Open vitobotta opened 5 years ago
We have some fixes for it but wasn’t able to make into v0.5.0 release. Some performance improvement and lower CPU utilization should be in the v0.6.0 release. And we will have ability to QOS the volume later as well. So you can control how much CPU/how much IOPS you want to assign to the volume.
Thanks, looking forward to these fixes.
I still have very high CPU Loads after updating to 0.6.1 ...
OS: Ubuntu Server 18.04 Docker: 19.03.2
If I shutdown the longhorn docker ressources the load is coming down immediately. In the OS you see a lot of Context Switches and Interrupts made by longhorn and calico.
Will you have a look on that? Do you need additional informations?
@sybnex Yeah, we're looking into it. We're tracking v0.6.1 CPU utilization at https://github.com/longhorn/longhorn/issues/767
Can you provide us the result of
kubectl top node
kubectl -n longhorn-system top pods
And ideally, the ps --sort=-pcpu -aux|head -n 20
(noted this may contain sensible information, please take a look before posting it) and docker stats
output from each node you have.
Here are the first results:
~$ kubectl top node
NAME CPU(cores) CPU% MEMORY(bytes) MEMORY%
192.168.1.61 3472m 86% 5712Mi 36%
192.168.1.62 1980m 99% 1687Mi 21%
~$ kubectl -n longhorn-system top pods
NAME CPU(cores) MEMORY(bytes)
csi-attacher-5b788cc68-4bslg 4m 7Mi
csi-attacher-5b788cc68-vjjxk 1m 6Mi
csi-provisioner-7b5cc54875-sb9pg 1m 5Mi
csi-provisioner-7b5cc54875-tx92h 4m 8Mi
engine-image-ei-7c102650-frhc6 10m 16Mi
engine-image-ei-7c102650-t9dwr 9m 16Mi
instance-manager-e-8c2cf8c2 46m 7Mi
instance-manager-e-8f0a01fb 73m 36Mi
instance-manager-r-183e3e28 81m 56Mi
instance-manager-r-523854ed 52m 45Mi
longhorn-csi-plugin-l9t9p 1m 10Mi
longhorn-csi-plugin-p9fs5 1m 10Mi
longhorn-driver-deployer-5546f4f466-w7swb 0m 7Mi
longhorn-manager-gd6pm 61m 40Mi
longhorn-manager-xs862 20m 31Mi
longhorn-ui-c8cd878-468b5 0m 3Mi
`
After removing longhorn ... especially the intance-managers:
~$ kubectl top node
NAME CPU(cores) CPU% MEMORY(bytes) MEMORY%
192.168.1.61 1037m 25% 5398Mi 34%
192.168.1.62 335m 16% 1549Mi 19%
@sybnex For the issue of v0.6.1 release, check #767 . We found the culprit and fixing it in 0.6.2 release. Leave this issue open for the CPU utilization during the load.
This issue has been automatically marked as stale because it has not had recent activity. It will be closed if no further activity occurs. Thank you for your contributions.
/unstale
Hi,
I forgot to answer .... As soon as I had updated to the latest version ... The CPU load has normalized.
Thx guys!
Hi,
I have Rancher 2.3.5
and installed the longhorn-system
Rancher chart with version 0.7.0
of the template. I did this installation on 3 days ago and I have three volumes configured (5, 10 and 50 GB). I initially had 2 replicas for each volume, but then I deleted 1 replica as I thought the high CPU usage was coming from the replica management. The following image shows the CPU usage increase over time on the longhorn-manager
pod (:
The only things currently running on this cluster is the Rancher cluster monitoring based on Prometheus (which has its storage on the 50 GB volume).
Is that the CPU usage expected for such a setup? And why does it always increase without changing anything?
@guylabs So the graph is only for the longhorn manager pod? That's interesting. We will expect a higher CPU on the instance manager pods since it's processing the disk data, but not the longhorn-manager
pod. Can you show me the result of kubectl -n longhorn-system top pods
?
@yasker This is the output of kubectl -n longhorn-system top pods
:
❯ kubectl -n longhorn-system top pods
NAME CPU(cores) MEMORY(bytes)
compatible-csi-attacher-d8bc66556-8k5sg 1m 11Mi
csi-attacher-76876c69cf-4c9gv 1m 11Mi
csi-attacher-76876c69cf-5wnph 1m 9Mi
csi-attacher-76876c69cf-k7xhq 1m 8Mi
csi-provisioner-586d7867bb-626tv 1m 9Mi
csi-provisioner-586d7867bb-j9cw4 1m 13Mi
csi-provisioner-586d7867bb-wbgfp 1m 9Mi
engine-image-ei-ec95b5ad-d76wr 3m 18Mi
instance-manager-e-7b81a471 10m 87Mi
instance-manager-r-0b6dc4e3 9m 146Mi
longhorn-csi-plugin-n2bzc 1m 36Mi
longhorn-driver-deployer-5558df9859-m4drd 1m 10Mi
longhorn-manager-zspvs 232m 39Mi
longhorn-ui-7fd87bdf76-pkrvk 0m 3Mi
When it reached 70% CPU usage I redeployed the manager and since then the CPU usage dropped. But when looking at the graph, it increases steadily. The blue line was the old manager that reached 70% and the orange line is the new manager averaging at around 15%. It started at 12% and now its already at 17% again after 8 hours. I will monitor it further.
This is the output of kubectl -n longhorn-system top pods
:
❯ kubectl -n longhorn-system top pods
NAME CPU(cores) MEMORY(bytes)
compatible-csi-attacher-d8bc66556-8k5sg 2m 11Mi
csi-attacher-76876c69cf-4c9gv 1m 11Mi
csi-attacher-76876c69cf-5wnph 1m 9Mi
csi-attacher-76876c69cf-k7xhq 1m 8Mi
csi-provisioner-586d7867bb-626tv 1m 9Mi
csi-provisioner-586d7867bb-j9cw4 2m 13Mi
csi-provisioner-586d7867bb-wbgfp 1m 9Mi
engine-image-ei-ec95b5ad-d76wr 3m 19Mi
instance-manager-e-7b81a471 14m 88Mi
instance-manager-r-0b6dc4e3 9m 148Mi
longhorn-csi-plugin-n2bzc 1m 36Mi
longhorn-driver-deployer-5558df9859-m4drd 1m 10Mi
longhorn-manager-v2kmf 55m 35Mi
longhorn-ui-7fd87bdf76-pkrvk 1m 3Mi
Do you have any update on your side on this?
@yasker The issue is still not resolved in 0.8.0
. I still get this high CPU values on the longhorn-manager
. Any update on this?
Using 0.8.0
, I also have a high CPU usage problem (almost 200% for all Longhorn's processes!)
Note I do have 76 pods and 13 volumes (we're going to have a lot more like around 300), if that matter.
Same error with version 0.7.0 and dropped when pod longhorn-manager redeployed very critical
We've recently fixed a ticker leak in the master. @meldafrawi can you check if there is a way we can reproduce the issue that long-running manager process taking more and more CPUs over the time?
Still happens in 1.0.0
Im facing same problem, one of longhorn-manager have high CPU usage for a long time, and doesn't see it decrease.
Also provides the output of kubectl -n longhorn-system top pods
[root@izt4n4k5dvd9o0zq2qqj54z ~]# kubectl -n longhorn-system top pods
NAME CPU(cores) MEMORY(bytes)
csi-attacher-5cc849c8dd-2m47j 1m 9Mi
csi-attacher-5cc849c8dd-8mgt5 2m 15Mi
csi-attacher-5cc849c8dd-fr945 1m 9Mi
csi-provisioner-74557755-56zw5 1m 16Mi
csi-provisioner-74557755-kv7gk 1m 8Mi
csi-provisioner-74557755-zn6xc 1m 7Mi
csi-resizer-686bd4b6d7-686b8 2m 9Mi
csi-resizer-686bd4b6d7-758dg 1m 10Mi
csi-resizer-686bd4b6d7-gjkkz 1m 13Mi
engine-image-ei-eee5f438-k4579 1m 10Mi
engine-image-ei-eee5f438-s57sg 2m 3Mi
engine-image-ei-eee5f438-z7xpl 1m 3Mi
instance-manager-e-47df2e30 5m 43Mi
instance-manager-e-545b093c 10m 45Mi
instance-manager-e-7807de6f 963m 49Mi
instance-manager-r-2e35103c 47m 93Mi
instance-manager-r-9c1f9a82 33m 67Mi
instance-manager-r-c184d5aa 7m 64Mi
longhorn-csi-plugin-5bdbn 1m 29Mi
longhorn-csi-plugin-cwrzc 1m 23Mi
longhorn-csi-plugin-dbtmp 2m 25Mi
longhorn-driver-deployer-cd74cb75b-wrzvl 1m 10Mi
longhorn-manager-5nw2l 14m 73Mi
longhorn-manager-6phrd 2m 67Mi
longhorn-manager-xt7kd 21m 92Mi
longhorn-ui-8486987944-66wjp 1m 3Mi
And a part of logs of Pod instance-manager-e-7807de6f
Jul 10 16:12:47.728560: ->10.42.0.74:10031 P[ 0kB] 36980us
Jul 10 16:12:47.948435: ->10.42.2.6:10031 P[ 0kB] 379us
Jul 10 16:12:24.732872: ->10.42.1.66:10001 P[ 0kB] 24000990us failed
Jul 10 16:12:32.733303: ->10.42.1.66:10001 P[ 0kB] 16000564us failed
Jul 10 16:12:40.733502: ->10.42.1.66:10001 P[ 0kB] 8000365us failed
Jul 10 16:12:40.733502: ->10.42.1.66:10001 P[ 0kB] pending
Jul 10 16:12:24.732872: ->10.42.1.66:10001 P[ 0kB] pending
Jul 10 16:12:32.733303: ->10.42.1.66:10001 P[ 0kB] pending
[pvc-e9ad8ad5-cd6e-4cfc-9192-60bb939f5c21-e-114210fa] time="2020-07-10T16:12:53Z" level=error msg="Error reading from wire: EOF"
[longhorn-instance-manager] time="2020-07-10T16:12:53Z" level=debug msg="Process Manager: start getting logs for process pvc-e9ad8ad5-cd6e-4cfc-9192-60bb939f5c21-e-114210fa"
[longhorn-instance-manager] time="2020-07-10T16:12:53Z" level=debug msg="Process Manager: got logs for process pvc-e9ad8ad5-cd6e-4cfc-9192-60bb939f5c21-e-114210fa"
[longhorn-instance-manager] time="2020-07-10T16:12:55Z" level=debug msg="Process Manager: start getting logs for process pvc-e9ad8ad5-cd6e-4cfc-9192-60bb939f5c21-e-114210fa"
[longhorn-instance-manager] time="2020-07-10T16:12:55Z" level=debug msg="Process Manager: got logs for process pvc-e9ad8ad5-cd6e-4cfc-9192-60bb939f5c21-e-114210fa"
[longhorn-instance-manager] time="2020-07-10T16:12:56Z" level=debug msg="Process Manager: start getting logs for process pvc-e9ad8ad5-cd6e-4cfc-9192-60bb939f5c21-e-114210fa"
[longhorn-instance-manager] time="2020-07-10T16:12:57Z" level=debug msg="Process Manager: got logs for process pvc-e9ad8ad5-cd6e-4cfc-9192-60bb939f5c21-e-114210fa"
[pvc-0d3a8007-5f40-4ab8-a26d-ac2e3167480f-e-1d87f72d] time="2020-07-10T16:12:58Z" level=error msg="Error reading from wire: EOF"
[longhorn-instance-manager] time="2020-07-10T16:13:00Z" level=debug msg="Process Manager: start getting logs for process pvc-0d3a8007-5f40-4ab8-a26d-ac2e3167480f-e-1d87f72d"
[longhorn-instance-manager] time="2020-07-10T16:13:01Z" level=debug msg="Process Manager: got logs for process pvc-0d3a8007-5f40-4ab8-a26d-ac2e3167480f-e-1d87f72d"
[longhorn-instance-manager] time="2020-07-10T16:13:01Z" level=debug msg="Process Manager: start getting logs for process pvc-e9ad8ad5-cd6e-4cfc-9192-60bb939f5c21-e-114210fa"
[longhorn-instance-manager] time="2020-07-10T16:13:01Z" level=debug msg="Process Manager: got logs for process pvc-e9ad8ad5-cd6e-4cfc-9192-60bb939f5c21-e-114210fa"
[pvc-e9ad8ad5-cd6e-4cfc-9192-60bb939f5c21-e-114210fa] time="2020-07-10T16:13:06Z" level=info msg="Removing backend: tcp://10.42.1.66:10030"
time="2020-07-10T16:13:06Z" level=info msg="Closing: 10.42.1.66:10030"
[pvc-0d3a8007-5f40-4ab8-a26d-ac2e3167480f-e-1d87f72d] time="2020-07-10T16:13:06Z" level=info msg="Removing backend: tcp://10.42.1.66:10000"
time="2020-07-10T16:13:06Z" level=info msg="Closing: 10.42.1.66:10000"
[pvc-2702aa10-4795-4d39-916d-72c50866ea6d-e-795fd18c] time="2020-07-10T16:13:13Z" level=error msg="Error reading from wire: EOF"
[longhorn-instance-manager] time="2020-07-10T16:13:14Z" level=debug msg="Process Manager: start getting logs for process pvc-2702aa10-4795-4d39-916d-72c50866ea6d-e-795fd18c"
[longhorn-instance-manager] time="2020-07-10T16:13:14Z" level=debug msg="Process Manager: got logs for process pvc-2702aa10-4795-4d39-916d-72c50866ea6d-e-795fd18c"
[pvc-e9ad8ad5-cd6e-4cfc-9192-60bb939f5c21-e-114210fa] time="2020-07-10T16:13:14Z" level=info msg="Connecting to remote: 10.42.1.66:10000"
[longhorn-instance-manager] time="2020-07-10T16:13:15Z" level=debug msg="Process Manager: start getting logs for process pvc-2702aa10-4795-4d39-916d-72c50866ea6d-e-795fd18c"
[longhorn-instance-manager] time="2020-07-10T16:13:15Z" level=debug msg="Process Manager: got logs for process pvc-2702aa10-4795-4d39-916d-72c50866ea6d-e-795fd18c"
[pvc-0d3a8007-5f40-4ab8-a26d-ac2e3167480f-e-1d87f72d] time="2020-07-10T16:13:16Z" level=info msg="Connecting to remote: 10.42.1.66:10030"
[longhorn-instance-manager] time="2020-07-10T16:13:17Z" level=debug msg="Process Manager: start getting logs for process pvc-2702aa10-4795-4d39-916d-72c50866ea6d-e-795fd18c"
[longhorn-instance-manager] time="2020-07-10T16:13:17Z" level=debug msg="Process Manager: got logs for process pvc-2702aa10-4795-4d39-916d-72c50866ea6d-e-795fd18c"
[pvc-2702aa10-4795-4d39-916d-72c50866ea6d-e-795fd18c] time="2020-07-10T16:13:19Z" level=info msg="Removing backend: tcp://10.42.1.66:10045"
time="2020-07-10T16:13:19Z" level=info msg="Closing: 10.42.1.66:10045"
[pvc-e9ad8ad5-cd6e-4cfc-9192-60bb939f5c21-e-114210fa] time="2020-07-10T16:13:30Z" level=info msg="Opening: 10.42.1.66:10000"
[pvc-e9ad8ad5-cd6e-4cfc-9192-60bb939f5c21-e-114210fa] time="2020-07-10T16:13:31Z" level=info msg="Snapshot: 10.42.2.6:10000 2b319147-751a-4c76-b550-dac9cd2f9fe7 UserCreated false Created at 2020-07-10T16:13:31Z, Labels map[]"
time="2020-07-10T16:13:31Z" level=info msg="Snapshot: 10.42.0.74:10060 2b319147-751a-4c76-b550-dac9cd2f9fe7 UserCreated false Created at 2020-07-10T16:13:31Z, Labels map[]"
[pvc-e9ad8ad5-cd6e-4cfc-9192-60bb939f5c21-e-114210fa] time="2020-07-10T16:13:32Z" level=info msg="Snapshot: 10.42.1.66:10000 2b319147-751a-4c76-b550-dac9cd2f9fe7 UserCreated false Created at 2020-07-10T16:13:31Z, Labels map[]"
[pvc-e9ad8ad5-cd6e-4cfc-9192-60bb939f5c21-e-114210fa] time="2020-07-10T16:13:34Z" level=info msg="Adding backend: tcp://10.42.1.66:10000"
time="2020-07-10T16:13:34Z" level=info msg="Start monitoring tcp://10.42.1.66:10000"
[pvc-e9ad8ad5-cd6e-4cfc-9192-60bb939f5c21-e-114210fa] time="2020-07-10T16:13:34Z" level=info msg="Set revision counter of 10.42.1.66:10000 to : 0"
[pvc-e9ad8ad5-cd6e-4cfc-9192-60bb939f5c21-e-114210fa] time="2020-07-10T16:13:34Z" level=info msg="Set backend tcp://10.42.1.66:10000 revision counter to 0"
[pvc-2702aa10-4795-4d39-916d-72c50866ea6d-e-795fd18c] time="2020-07-10T16:13:34Z" level=info msg="Connecting to remote: 10.42.1.66:10045"
[pvc-e9ad8ad5-cd6e-4cfc-9192-60bb939f5c21-e-114210fa] panic: runtime error: index out of range [0] with length 0
goroutine 2781167 [running]:
github.com/longhorn/longhorn-engine/pkg/controller.getReplicaDisksAndHead(0xc0004336a0, 0x16, 0xc0004614a0, 0xc000433b60, 0x13, 0x0, 0x0)
/go/src/github.com/longhorn/longhorn-engine/pkg/controller/rebuild.go:28 +0x4dc
github.com/longhorn/longhorn-engine/pkg/controller.(*Controller).PrepareRebuildReplica(0xc0001c9560, 0xc0004336a0, 0x16, 0x0, 0x0, 0x0, 0x0, 0x0)
/go/src/github.com/longhorn/longhorn-engine/pkg/controller/rebuild.go:179 +0x3d6
github.com/longhorn/longhorn-engine/pkg/controller/rpc.(*ControllerServer).ReplicaPrepareRebuild(0xc000010158, 0x10a8240, 0xc0007bc1b0, 0xc0007bc1e0, 0xc000010158, 0xc0007bc1b0, 0xc0004d5a58)
/go/src/github.com/longhorn/longhorn-engine/pkg/controller/rpc/server.go:222 +0x4e
github.com/longhorn/longhorn-engine/proto/ptypes._ControllerService_ReplicaPrepareRebuild_Handler(0xe79940, 0xc000010158, 0x10a8240, 0xc0007bc1b0, 0xc0003a24b0, 0x0, 0x10a8240, 0xc0007bc1b0, 0xc000433680, 0x18)
/go/src/github.com/longhorn/longhorn-engine/proto/ptypes/controller.pb.go:1774 +0x217
google.golang.org/grpc.(*Server).processUnaryRPC(0xc000001980, 0x10b0680, 0xc0005f6780, 0xc000013200, 0xc0000f0660, 0x189e438, 0x0, 0x0, 0x0)
/go/src/github.com/longhorn/longhorn-engine/vendor/google.golang.org/grpc/server.go:998 +0x46a
google.golang.org/grpc.(*Server).handleStream(0xc000001980, 0x10b0680, 0xc0005f6780, 0xc000013200, 0x0)
/go/src/github.com/longhorn/longhorn-engine/vendor/google.golang.org/grpc/server.go:1278 +0xd97
google.golang.org/grpc.(*Server).serveStreams.func1.1(0xc0004dd2a0, 0xc000001980, 0x10b0680, 0xc0005f6780, 0xc000013200)
/go/src/github.com/longhorn/longhorn-engine/vendor/google.golang.org/grpc/server.go:717 +0xbb
created by google.golang.org/grpc.(*Server).serveStreams.func1
/go/src/github.com/longhorn/longhorn-engine/vendor/google.golang.org/grpc/server.go:715 +0xa1
[longhorn-instance-manager] time="2020-07-10T16:13:35Z" level=info msg="Process Manager: process pvc-e9ad8ad5-cd6e-4cfc-9192-60bb939f5c21-e-114210fa error out, error msg: exit status 2"
[longhorn-instance-manager] time="2020-07-10T16:13:35Z" level=debug msg="Process update: pvc-e9ad8ad5-cd6e-4cfc-9192-60bb939f5c21-e-114210fa: state error: Error: exit status 2"
[pvc-2702aa10-4795-4d39-916d-72c50866ea6d-e-795fd18c] time="2020-07-10T16:13:35Z" level=info msg="Opening: 10.42.1.66:10045"
[pvc-2702aa10-4795-4d39-916d-72c50866ea6d-e-795fd18c] time="2020-07-10T16:13:36Z" level=info msg="Snapshot: 10.42.0.74:10045 1ce403a8-61fa-44ec-baae-12535a8fd551 UserCreated false Created at 2020-07-10T16:13:35Z, Labels map[]"
time="2020-07-10T16:13:36Z" level=info msg="Snapshot: 10.42.2.6:10075 1ce403a8-61fa-44ec-baae-12535a8fd551 UserCreated false Created at 2020-07-10T16:13:35Z, Labels map[]"
[pvc-2702aa10-4795-4d39-916d-72c50866ea6d-e-795fd18c] time="2020-07-10T16:13:36Z" level=info msg="Snapshot: 10.42.1.66:10045 1ce403a8-61fa-44ec-baae-12535a8fd551 UserCreated false Created at 2020-07-10T16:13:35Z, Labels map[]"
[pvc-2702aa10-4795-4d39-916d-72c50866ea6d-e-795fd18c] time="2020-07-10T16:13:36Z" level=info msg="Adding backend: tcp://10.42.1.66:10045"
time="2020-07-10T16:13:36Z" level=info msg="Start monitoring tcp://10.42.1.66:10045"
send_msg: fail to write magic version
tgtd: bs_longhorn_request(91) fail to write at 557707264 for 28672
tgtd: bs_longhorn_request(150) io error 0x55b38e32c940 2a -22 28672 557707264, Broken pipe
send_msg: fail to write magic version
tgtd: bs_longhorn_request(91) fail to write at 536870912 for 4096
tgtd: bs_longhorn_request(150) io error 0x55b38e32c940 2a -22 4096 536870912, Broken pipe
[pvc-2702aa10-4795-4d39-916d-72c50866ea6d-e-795fd18c] time="2020-07-10T16:13:45Z" level=info msg="Set revision counter of 10.42.1.66:10045 to : 0"
[pvc-2702aa10-4795-4d39-916d-72c50866ea6d-e-795fd18c] time="2020-07-10T16:13:45Z" level=info msg="Set backend tcp://10.42.1.66:10045 revision counter to 0"
[pvc-2702aa10-4795-4d39-916d-72c50866ea6d-e-795fd18c] time="2020-07-10T16:13:45Z" level=info msg="Synchronizing volume-head-003.img.meta to volume-head-001.img.meta@10.42.1.66:10048"
[pvc-2702aa10-4795-4d39-916d-72c50866ea6d-e-795fd18c] time="2020-07-10T16:13:45Z" level=info msg="Done synchronizing volume-head-003.img.meta to volume-head-001.img.meta@10.42.1.66:10048"
timeout_handler: Timeout request 41667 due to disconnection
tgtd: bs_longhorn_request(91) fail to write at 328925184 for 20480
tgtd: bs_longhorn_request(150) io error 0x55b38e358a10 2a -14 20480 328925184, Success
send_msg: fail to write magic version
tgtd: bs_longhorn_request(91) fail to write at 0 for 4096
tgtd: bs_longhorn_request(150) io error 0x55b38e358a10 2a -22 4096 0, Broken pipe
[pvc-2702aa10-4795-4d39-916d-72c50866ea6d-e-795fd18c] time="2020-07-10T16:16:31Z" level=info msg="Get backend tcp://10.42.0.74:10045 revision counter 395"
time="2020-07-10T16:16:31Z" level=info msg="Set revision counter of 10.42.1.66:10045 to : 395"
[pvc-2702aa10-4795-4d39-916d-72c50866ea6d-e-795fd18c] time="2020-07-10T16:16:31Z" level=info msg="Set backend tcp://10.42.1.66:10045 revision counter to 395"
time="2020-07-10T16:16:31Z" level=info msg="WO replica tcp://10.42.1.66:10045's chain verified, update mode to RW"
time="2020-07-10T16:16:31Z" level=info msg="Set replica tcp://10.42.1.66:10045 to mode RW"
[pvc-0d3a8007-5f40-4ab8-a26d-ac2e3167480f-e-1d87f72d] time="2020-07-10T16:22:26Z" level=info msg="Connecting to remote: 10.42.1.66:10030"
[pvc-0d3a8007-5f40-4ab8-a26d-ac2e3167480f-e-1d87f72d] time="2020-07-10T16:22:26Z" level=info msg="Opening: 10.42.1.66:10030"
[pvc-0d3a8007-5f40-4ab8-a26d-ac2e3167480f-e-1d87f72d] time="2020-07-10T16:22:26Z" level=info msg="Snapshot: 10.42.2.6:10030 9811450c-60bf-4316-95c6-ccc086d3f507 UserCreated false Created at 2020-07-10T16:22:26Z, Labels map[]"
time="2020-07-10T16:22:26Z" level=info msg="Snapshot: 10.42.0.74:10030 9811450c-60bf-4316-95c6-ccc086d3f507 UserCreated false Created at 2020-07-10T16:22:26Z, Labels map[]"
[pvc-0d3a8007-5f40-4ab8-a26d-ac2e3167480f-e-1d87f72d] time="2020-07-10T16:22:26Z" level=info msg="Snapshot: 10.42.1.66:10030 9811450c-60bf-4316-95c6-ccc086d3f507 UserCreated false Created at 2020-07-10T16:22:26Z, Labels map[]"
[pvc-0d3a8007-5f40-4ab8-a26d-ac2e3167480f-e-1d87f72d] time="2020-07-10T16:22:26Z" level=info msg="Adding backend: tcp://10.42.1.66:10030"
time="2020-07-10T16:22:26Z" level=info msg="Start monitoring tcp://10.42.1.66:10030"
[pvc-0d3a8007-5f40-4ab8-a26d-ac2e3167480f-e-1d87f72d] time="2020-07-10T16:22:29Z" level=info msg="Set revision counter of 10.42.1.66:10030 to : 0"
[pvc-0d3a8007-5f40-4ab8-a26d-ac2e3167480f-e-1d87f72d] time="2020-07-10T16:22:29Z" level=info msg="Set backend tcp://10.42.1.66:10030 revision counter to 0"
[pvc-0d3a8007-5f40-4ab8-a26d-ac2e3167480f-e-1d87f72d] time="2020-07-10T16:22:29Z" level=info msg="Synchronizing volume-head-002.img.meta to volume-head-001.img.meta@10.42.1.66:10033"
[pvc-0d3a8007-5f40-4ab8-a26d-ac2e3167480f-e-1d87f72d] time="2020-07-10T16:22:29Z" level=info msg="Done synchronizing volume-head-002.img.meta to volume-head-001.img.meta@10.42.1.66:10033"
[pvc-0d3a8007-5f40-4ab8-a26d-ac2e3167480f-e-1d87f72d] time="2020-07-10T16:23:20Z" level=info msg="Get backend tcp://10.42.0.74:10030 revision counter 62"
time="2020-07-10T16:23:20Z" level=info msg="Set revision counter of 10.42.1.66:10030 to : 62"
[pvc-0d3a8007-5f40-4ab8-a26d-ac2e3167480f-e-1d87f72d] time="2020-07-10T16:23:20Z" level=info msg="Set backend tcp://10.42.1.66:10030 revision counter to 62"
time="2020-07-10T16:23:20Z" level=info msg="WO replica tcp://10.42.1.66:10030's chain verified, update mode to RW"
time="2020-07-10T16:23:20Z" level=info msg="Set replica tcp://10.42.1.66:10030 to mode RW"
[longhorn-instance-manager] time="2020-07-10T16:28:18Z" level=debug msg="Process update: pvc-e9ad8ad5-cd6e-4cfc-9192-60bb939f5c21-e-114210fa: state error: Error: exit status 2"
[longhorn-instance-manager] time="2020-07-10T16:29:31Z" level=debug msg="Process Manager: start getting logs for process pvc-e9ad8ad5-cd6e-4cfc-9192-60bb939f5c21-e-114210fa"
[longhorn-instance-manager] time="2020-07-10T16:30:32Z" level=debug msg="Process update: pvc-e9ad8ad5-cd6e-4cfc-9192-60bb939f5c21-e-114210fa: state error: Error: exit status 2"
[longhorn-instance-manager] time="2020-07-10T16:34:59Z" level=debug msg="Process Manager: start getting logs for process pvc-e9ad8ad5-cd6e-4cfc-9192-60bb939f5c21-e-114210fa"
[longhorn-instance-manager] time="2020-07-10T16:35:16Z" level=debug msg="Process update: pvc-e9ad8ad5-cd6e-4cfc-9192-60bb939f5c21-e-114210fa: state error: Error: exit status 2"
[longhorn-instance-manager] time="2020-07-10T16:37:11Z" level=debug msg="Process update: pvc-e9ad8ad5-cd6e-4cfc-9192-60bb939f5c21-e-114210fa: state error: Error: exit status 2"
[longhorn-instance-manager] time="2020-07-10T16:40:02Z" level=debug msg="Process Manager: prepare to delete process pvc-e9ad8ad5-cd6e-4cfc-9192-60bb939f5c21-e-114210fa"
[longhorn-instance-manager] time="2020-07-10T16:40:02Z" level=debug msg="Process update: pvc-e9ad8ad5-cd6e-4cfc-9192-60bb939f5c21-e-114210fa: state error: Error: exit status 2"
[longhorn-instance-manager] time="2020-07-10T16:40:02Z" level=debug msg="Process Manager: deleted process pvc-e9ad8ad5-cd6e-4cfc-9192-60bb939f5c21-e-114210fa"
[longhorn-instance-manager] time="2020-07-10T16:40:02Z" level=info msg="Process Manager: successfully unregistered process pvc-e9ad8ad5-cd6e-4cfc-9192-60bb939f5c21-e-114210fa"
[longhorn-instance-manager] time="2020-07-10T16:40:02Z" level=debug msg="Process update: pvc-e9ad8ad5-cd6e-4cfc-9192-60bb939f5c21-e-114210fa: state error: Error: exit status 2"
[longhorn-instance-manager] time="2020-07-10T16:40:15Z" level=debug msg="Process Manager: start getting logs for process pvc-e9ad8ad5-cd6e-4cfc-9192-60bb939f5c21-e-114210fa"
[pvc-0d3a8007-5f40-4ab8-a26d-ac2e3167480f-e-1d87f72d] time="2020-07-11T04:08:09Z" level=error msg="Ping timeout on replica 10.42.0.74:10031 seq= 28022"
time="2020-07-11T04:08:09Z" level=error msg="Retry 1 on replica 10.42.0.74:10031 seq= 28022 size= 0 (kB)"
[pvc-2702aa10-4795-4d39-916d-72c50866ea6d-e-795fd18c] time="2020-07-11T04:08:10Z" level=error msg="Ping timeout on replica 10.42.0.74:10046 seq= 27913"
time="2020-07-11T04:08:10Z" level=error msg="Retry 1 on replica 10.42.0.74:10046 seq= 27913 size= 0 (kB)"
[pvc-2702aa10-4795-4d39-916d-72c50866ea6d-e-795fd18c] time="2020-07-11T04:36:54Z" level=error msg="Ping timeout on replica 10.42.0.74:10046 seq= 28763"
time="2020-07-11T04:36:54Z" level=error msg="Retry 1 on replica 10.42.0.74:10046 seq= 28763 size= 0 (kB)"
[pvc-0d3a8007-5f40-4ab8-a26d-ac2e3167480f-e-1d87f72d] time="2020-07-11T14:05:13Z" level=error msg="Ping timeout on replica 10.42.0.74:10031 seq= 45917"
time="2020-07-11T14:05:13Z" level=error msg="Retry 1 on replica 10.42.0.74:10031 seq= 45917 size= 0 (kB)"
Hope this can help you guys to solve this problem, thanks, and sorry for my bad English.
One of the engine has a panic in this log:
[pvc-e9ad8ad5-cd6e-4cfc-9192-60bb939f5c21-e-114210fa] panic: runtime error: index out of range [0] with length 0
goroutine 2781167 [running]:
github.com/longhorn/longhorn-engine/pkg/controller.getReplicaDisksAndHead(0xc0004336a0, 0x16, 0xc0004614a0, 0xc000433b60, 0x13, 0x0, 0x0)
/go/src/github.com/longhorn/longhorn-engine/pkg/controller/rebuild.go:28 +0x4dc
github.com/longhorn/longhorn-engine/pkg/controller.(*Controller).PrepareRebuildReplica(0xc0001c9560, 0xc0004336a0, 0x16, 0x0, 0x0, 0x0, 0x0, 0x0)
/go/src/github.com/longhorn/longhorn-engine/pkg/controller/rebuild.go:179 +0x3d6
github.com/longhorn/longhorn-engine/pkg/controller/rpc.(*ControllerServer).ReplicaPrepareRebuild(0xc000010158, 0x10a8240, 0xc0007bc1b0, 0xc0007bc1e0, 0xc000010158, 0xc0007bc1b0, 0xc0004d5a58)
/go/src/github.com/longhorn/longhorn-engine/pkg/controller/rpc/server.go:222 +0x4e
github.com/longhorn/longhorn-engine/proto/ptypes._ControllerService_ReplicaPrepareRebuild_Handler(0xe79940, 0xc000010158, 0x10a8240, 0xc0007bc1b0, 0xc0003a24b0, 0x0, 0x10a8240, 0xc0007bc1b0, 0xc000433680, 0x18)
/go/src/github.com/longhorn/longhorn-engine/proto/ptypes/controller.pb.go:1774 +0x217
google.golang.org/grpc.(*Server).processUnaryRPC(0xc000001980, 0x10b0680, 0xc0005f6780, 0xc000013200, 0xc0000f0660, 0x189e438, 0x0, 0x0, 0x0)
/go/src/github.com/longhorn/longhorn-engine/vendor/google.golang.org/grpc/server.go:998 +0x46a
google.golang.org/grpc.(*Server).handleStream(0xc000001980, 0x10b0680, 0xc0005f6780, 0xc000013200, 0x0)
/go/src/github.com/longhorn/longhorn-engine/vendor/google.golang.org/grpc/server.go:1278 +0xd97
google.golang.org/grpc.(*Server).serveStreams.func1.1(0xc0004dd2a0, 0xc000001980, 0x10b0680, 0xc0005f6780, 0xc000013200)
/go/src/github.com/longhorn/longhorn-engine/vendor/google.golang.org/grpc/server.go:717 +0xbb
created by google.golang.org/grpc.(*Server).serveStreams.func1
/go/src/github.com/longhorn/longhorn-engine/vendor/google.golang.org/grpc/server.go:715 +0xa1
How many volumes are being attached and used on this node?
This is an issue in my low-power cluster, where the cpu load on my master node spikes to 60-100(!) and causes a cascading failure as now nodes can't access the master node.
Seeing high CPU usage on longhorn-manager
here on 1.0.2
too.
Same problem here on 1.0.2
top process: tgtd
.
Same problem too
longhorn-manager is going up to our set limit of 200m and stays there. Only seems to increase on nodes that are running pods with PVs. If we let it it would eat up to 600m cpu... not sure what to do.
@aarlint I guess you want to set a limit
for the pod but actually set a request
. If you want to limit the max cpu the worload will use, you need to set spec.containers[].resources.limits.cpu
.
https://kubernetes.io/docs/concepts/configuration/manage-resources-containers/#requests-and-limits
I think I was unclear. I am setting a cpu limit and that works... but I’m not sure what I’m throttling when I do. The default cpu request for longhorn manager is 20m but in our case the longhorn manager takes 500m if/when we don’t set a cpu limit. Is that normal? What is the longhorn manager doing?
How many CPU resources the longhorn manager will eat depends on how many volumes and replica in your system. Typically, longhorn managers are responsible for:
https://longhorn.io/docs/1.1.0/concepts/#11-the-longhorn-manager-and-the-longhorn-engine
BTW, is there any reason that you need to put a hard CPU limit on longhorn managers?
If I don’t put a hard limit their cpu usage goes up to over 500m which really impacts node performance. The requested cpu is 20m in the standard install, so I thought they should use much less than 500cpu constantly.
20m is the minimum requirement. As I said, the actual resource it will use depends on the volume numbers. I just tried to attach 30 volumes to a node, then the longhorn manager pod on the node eats 300m CPU while other longhorn manager pods use 10m~50m CPU only.
onghorn-manager-2kz6p 2m 32Mi
longhorn-manager-fx4qq 297m 33Mi
longhorn-manager-jmdht 3m 29Mi
But if there is a limited amount of volumes attached to the node, and the longhorn manager pod on the node eats 500m CPU, then something may be wrong. In this scenario, you can generate a support bundle then send it to us.
If this is normal/expected behavior then there isn’t a problem. We weren’t aware longhorn manager’s cpu was tied directly to volume count.
Thanks for your help!
Hi, I still facing this issue, where longhorn manager takes up quite a bit of CPU. Is the current workaround to put a CPU limit?
Hi, I still facing this issue, where longhorn manager takes up quite a bit of CPU. Is the current workaround to put a CPU limit?
No. Longhorn manager pods will eat some CPUs temporarily if there are some operations like attach/detaching coming. Are the longhorn manager pods keeping taking up high CPU? If YES, there may be something wrong.
On node with 10 attachments (no local disk), I'm seeing 600m longhorn-manager usage and 100m instance-manager usage.So a total of 700m or ~70m per attachment. This is on arm64 at 2GHz clock rate.
Is ~70m per attachment about what should be expected? I have to say it does seam rather high since most of the pods which require this attachment use much less than ~70m, so longhorn ends up being the biggest CPU user for my applications.
On node with 10 attachments (no local disk), I'm seeing 600m longhorn-manager usage and 100m instance-manager usage.So a total of 700m or ~70m per attachment. This is on arm64 at 2GHz clock rate.
Is ~70m per attachment about what should be expected? I have to say it does seam rather high since most of the pods which require this attachment use much less than ~70m, so longhorn ends up being the biggest CPU user for my applications.
The processes in the instance manager are the essentials of volume io, 100m for 10 volumes is definitely not enough. Considering the total CPU is 2000m and the further upgrade, 300~400m might be the choice. And as I mentioned, you don't need to set a limit for longhorn managers since most of the time they won't eat lots of CPUs. The resource requirement investigation (for instance manager): https://github.com/longhorn/longhorn/issues/1691
In my situation most volumes are pretty much idle, and longhorn-manager
uses 6x more CPU than instance-manager on the given node. I don't have either limited. Are you suggesting that longhorn-manager can be limited, and instance-manager should not be? I'm ok with 100m usage, but when longhorn is the majority CPU user cumulatively using almost a core with mostly zero IO, I'm starting to worry about headroom as I add more volumes.
@vaskozl mind sharing a support bundle with us, you can attach it here or send an email to longhorn-support-bundle@rancher.com
with reference to this thread and your cluster environment (cpu, mem, disk type, network, etc, dedicated control plane)
In my situation most volumes are pretty much idle, and
longhorn-manager
uses 6x more CPU than instance-manager on the given node. I don't have either limited. Are you suggesting that longhorn-manager can be limited, and instance-manager should not be? I'm ok with 100m usage, but when longhorn is the majority CPU user cumulatively using almost a core with mostly zero IO, I'm starting to worry about headroom as I add more volumes.
I mean, ideally, you don't need to limit both. Besides, you need to guarantee the CPU resource for instance managers. Here you said that longhorn-manager uses 6x more CPU than instance-manager
, then something may be wrong. You can send a support bundle as well as the CPU-related info (the support bundle doesn't contain the CPU usage info) to us.
With around 250Mbps read on a volume, we see Longhorn engine using a good full 6 cores (12 threads) of a Xeon E-2288G
Message from syslogd@ns31354176 at Jul 15 18:21:28 ... "ns31354176" 18:21 15-Jul-21
kernel:[2920835.996469] watchdog: BUG: soft lockup - CPU#3 stuck for 22s! [longhorn:1771918]
Transferring the contents of a volume, we're getting system locking by longhorn
@ChipWolf are there replica rebuilds running at the time of the above message?
Can you include the full kernel call trace that was printed after the
kernel:[2920835.996469] watchdog: BUG: soft lockup - CPU#3 stuck for 22s! [longhorn:1771918]
Should look something like this:
kernel: BUG: soft lockup - CPU#X stuck for Ys! [longhorn:...]
... ...
kernel: Call Trace:
kernel: [...] :
kernel: [...] :
kernel: [...] :
@ChipWolf are there replica rebuilds running at the time of the above message? Can you include the full kernel call trace that was printed after the
kernel:[2920835.996469] watchdog: BUG: soft lockup - CPU#3 stuck for 22s! [longhorn:1771918]
Should look something like this:kernel: BUG: soft lockup - CPU#X stuck for Ys! [longhorn:...] ... ... kernel: Call Trace: kernel: [...] : kernel: [...] : kernel: [...] :
If I witness it again, I'll grab the logs, we don't aggregate syslog right now. Thanks
I mentioned this on Slack as well, anyway I thought I'd open an issue about it. I am testing many things at the moment, including various things concerning Longhorn. I really like Longhorn, but I am having a big issue with CPU usage when using volumes with actual data. One of my tests is copying around 25 GB of mixed data (small files, large videos) to a volume used by a Nextcloud installation, so I can test things with real data. I am downloading this data from another server with SSH+tar, and each time I do this test load average climbs to the 10s (right now it's at around 30). I have tried also restoring the same data from a Velero/Restic backup, and again the load goes up quite a lot quite quickly.
I am evaluating storage options while I learn Kubernetes for a project I had started to work on, but while like I said I like Longhorn a lot (especially the built in backups and the great UI), this problem makes it difficult for me to actually use it. So my question is, is it just me or does it happen to others?
At the moment I am testing with Hetzner Cloud servers with 4 cores and 16 GB of ram, but I have tested also with UpCloud servers which are the fastest I have found (especially storage), so I don't think it's an issue with the performance of the servers even though of course there are differences between providers.
Any idea on what could cause this high CPU issue and how to solve it? Exactly the same tests work fine without particular problems if I use OpenEBS - in this case the load doesn't go above 5-10 max even when copying large files).
Thanks in advance!