[TASK] Investigate cases causing high CPU usage

vitobotta commented 5 years ago

I mentioned this on Slack as well, anyway I thought I'd open an issue about it. I am testing many things at the moment, including various things concerning Longhorn. I really like Longhorn, but I am having a big issue with CPU usage when using volumes with actual data. One of my tests is copying around 25 GB of mixed data (small files, large videos) to a volume used by a Nextcloud installation, so I can test things with real data. I am downloading this data from another server with SSH+tar, and each time I do this test load average climbs to the 10s (right now it's at around 30). I have tried also restoring the same data from a Velero/Restic backup, and again the load goes up quite a lot quite quickly.

I am evaluating storage options while I learn Kubernetes for a project I had started to work on, but while like I said I like Longhorn a lot (especially the built in backups and the great UI), this problem makes it difficult for me to actually use it. So my question is, is it just me or does it happen to others?

At the moment I am testing with Hetzner Cloud servers with 4 cores and 16 GB of ram, but I have tested also with UpCloud servers which are the fastest I have found (especially storage), so I don't think it's an issue with the performance of the servers even though of course there are differences between providers.

Any idea on what could cause this high CPU issue and how to solve it? Exactly the same tests work fine without particular problems if I use OpenEBS - in this case the load doesn't go above 5-10 max even when copying large files).

Thanks in advance!

yasker commented 5 years ago

We have some fixes for it but wasn’t able to make into v0.5.0 release. Some performance improvement and lower CPU utilization should be in the v0.6.0 release. And we will have ability to QOS the volume later as well. So you can control how much CPU/how much IOPS you want to assign to the volume.

vitobotta commented 5 years ago

Thanks, looking forward to these fixes.

sybnex commented 5 years ago

I still have very high CPU Loads after updating to 0.6.1 ...

OS: Ubuntu Server 18.04 Docker: 19.03.2

If I shutdown the longhorn docker ressources the load is coming down immediately. In the OS you see a lot of Context Switches and Interrupts made by longhorn and calico.

Will you have a look on that? Do you need additional informations?

yasker commented 5 years ago

@sybnex Yeah, we're looking into it. We're tracking v0.6.1 CPU utilization at https://github.com/longhorn/longhorn/issues/767

Can you provide us the result of

kubectl top node
kubectl -n longhorn-system top pods

And ideally, the ps --sort=-pcpu -aux|head -n 20 (noted this may contain sensible information, please take a look before posting it) and docker stats output from each node you have.

sybnex commented 5 years ago

Here are the first results:


~$ kubectl top node
NAME           CPU(cores)   CPU%   MEMORY(bytes)   MEMORY%   
192.168.1.61   3472m        86%    5712Mi          36%       
192.168.1.62   1980m        99%    1687Mi          21%       

~$ kubectl -n longhorn-system top pods
NAME                                        CPU(cores)   MEMORY(bytes)   
csi-attacher-5b788cc68-4bslg                4m           7Mi             
csi-attacher-5b788cc68-vjjxk                1m           6Mi             
csi-provisioner-7b5cc54875-sb9pg            1m           5Mi             
csi-provisioner-7b5cc54875-tx92h            4m           8Mi             
engine-image-ei-7c102650-frhc6              10m          16Mi            
engine-image-ei-7c102650-t9dwr              9m           16Mi            
instance-manager-e-8c2cf8c2                 46m          7Mi             
instance-manager-e-8f0a01fb                 73m          36Mi            
instance-manager-r-183e3e28                 81m          56Mi            
instance-manager-r-523854ed                 52m          45Mi            
longhorn-csi-plugin-l9t9p                   1m           10Mi            
longhorn-csi-plugin-p9fs5                   1m           10Mi            
longhorn-driver-deployer-5546f4f466-w7swb   0m           7Mi             
longhorn-manager-gd6pm                      61m          40Mi            
longhorn-manager-xs862                      20m          31Mi            
longhorn-ui-c8cd878-468b5                   0m           3Mi       
`

sybnex commented 5 years ago

After removing longhorn ... especially the intance-managers:


~$ kubectl top node
NAME           CPU(cores)   CPU%   MEMORY(bytes)   MEMORY%   
192.168.1.61   1037m        25%    5398Mi          34%       
192.168.1.62   335m         16%    1549Mi          19%

yasker commented 5 years ago

@sybnex For the issue of v0.6.1 release, check #767 . We found the culprit and fixing it in 0.6.2 release. Leave this issue open for the CPU utilization during the load.

stale[bot] commented 4 years ago

This issue has been automatically marked as stale because it has not had recent activity. It will be closed if no further activity occurs. Thank you for your contributions.

dnauck commented 4 years ago

/unstale

sybnex commented 4 years ago

Hi,

I forgot to answer .... As soon as I had updated to the latest version ... The CPU load has normalized.

Thx guys!

guylabs commented 4 years ago

Hi,

I have Rancher 2.3.5 and installed the longhorn-system Rancher chart with version 0.7.0 of the template. I did this installation on 3 days ago and I have three volumes configured (5, 10 and 50 GB). I initially had 2 replicas for each volume, but then I deleted 1 replica as I thought the high CPU usage was coming from the replica management. The following image shows the CPU usage increase over time on the longhorn-manager pod (:

The only things currently running on this cluster is the Rancher cluster monitoring based on Prometheus (which has its storage on the 50 GB volume).

Is that the CPU usage expected for such a setup? And why does it always increase without changing anything?

yasker commented 4 years ago

@guylabs So the graph is only for the longhorn manager pod? That's interesting. We will expect a higher CPU on the instance manager pods since it's processing the disk data, but not the longhorn-manager pod. Can you show me the result of kubectl -n longhorn-system top pods?

guylabs commented 4 years ago

@yasker This is the output of kubectl -n longhorn-system top pods:

❯ kubectl -n longhorn-system top pods
NAME                                        CPU(cores)   MEMORY(bytes)   
compatible-csi-attacher-d8bc66556-8k5sg     1m           11Mi            
csi-attacher-76876c69cf-4c9gv               1m           11Mi            
csi-attacher-76876c69cf-5wnph               1m           9Mi             
csi-attacher-76876c69cf-k7xhq               1m           8Mi             
csi-provisioner-586d7867bb-626tv            1m           9Mi             
csi-provisioner-586d7867bb-j9cw4            1m           13Mi            
csi-provisioner-586d7867bb-wbgfp            1m           9Mi             
engine-image-ei-ec95b5ad-d76wr              3m           18Mi            
instance-manager-e-7b81a471                 10m          87Mi            
instance-manager-r-0b6dc4e3                 9m           146Mi           
longhorn-csi-plugin-n2bzc                   1m           36Mi            
longhorn-driver-deployer-5558df9859-m4drd   1m           10Mi            
longhorn-manager-zspvs                      232m         39Mi            
longhorn-ui-7fd87bdf76-pkrvk                0m           3Mi

guylabs commented 4 years ago

When it reached 70% CPU usage I redeployed the manager and since then the CPU usage dropped. But when looking at the graph, it increases steadily. The blue line was the old manager that reached 70% and the orange line is the new manager averaging at around 15%. It started at 12% and now its already at 17% again after 8 hours. I will monitor it further.

Screen Shot 2020-02-19 at 07 34 35

This is the output of kubectl -n longhorn-system top pods:

❯ kubectl -n longhorn-system top pods
NAME                                        CPU(cores)   MEMORY(bytes)   
compatible-csi-attacher-d8bc66556-8k5sg     2m           11Mi            
csi-attacher-76876c69cf-4c9gv               1m           11Mi            
csi-attacher-76876c69cf-5wnph               1m           9Mi             
csi-attacher-76876c69cf-k7xhq               1m           8Mi             
csi-provisioner-586d7867bb-626tv            1m           9Mi             
csi-provisioner-586d7867bb-j9cw4            2m           13Mi            
csi-provisioner-586d7867bb-wbgfp            1m           9Mi             
engine-image-ei-ec95b5ad-d76wr              3m           19Mi            
instance-manager-e-7b81a471                 14m          88Mi            
instance-manager-r-0b6dc4e3                 9m           148Mi           
longhorn-csi-plugin-n2bzc                   1m           36Mi            
longhorn-driver-deployer-5558df9859-m4drd   1m           10Mi            
longhorn-manager-v2kmf                      55m          35Mi            
longhorn-ui-7fd87bdf76-pkrvk                1m           3Mi

Do you have any update on your side on this?

guylabs commented 4 years ago

@yasker The issue is still not resolved in 0.8.0. I still get this high CPU values on the longhorn-manager. Any update on this?

Arno500 commented 4 years ago

Using 0.8.0, I also have a high CPU usage problem (almost 200% for all Longhorn's processes!)
Note I do have 76 pods and 13 volumes (we're going to have a lot more like around 300), if that matter.

all4innov commented 4 years ago

Same error with version 0.7.0 and dropped when pod longhorn-manager redeployed very critical

yasker commented 4 years ago

We've recently fixed a ticker leak in the master. @meldafrawi can you check if there is a way we can reproduce the issue that long-running manager process taking more and more CPUs over the time?

timmy59100 commented 4 years ago

Still happens in 1.0.0

laeo commented 4 years ago

Im facing same problem, one of longhorn-manager have high CPU usage for a long time, and doesn't see it decrease.

Also provides the output of kubectl -n longhorn-system top pods

[root@izt4n4k5dvd9o0zq2qqj54z ~]# kubectl -n longhorn-system top pods
NAME                                       CPU(cores)   MEMORY(bytes)
csi-attacher-5cc849c8dd-2m47j              1m           9Mi
csi-attacher-5cc849c8dd-8mgt5              2m           15Mi
csi-attacher-5cc849c8dd-fr945              1m           9Mi
csi-provisioner-74557755-56zw5             1m           16Mi
csi-provisioner-74557755-kv7gk             1m           8Mi
csi-provisioner-74557755-zn6xc             1m           7Mi
csi-resizer-686bd4b6d7-686b8               2m           9Mi
csi-resizer-686bd4b6d7-758dg               1m           10Mi
csi-resizer-686bd4b6d7-gjkkz               1m           13Mi
engine-image-ei-eee5f438-k4579             1m           10Mi
engine-image-ei-eee5f438-s57sg             2m           3Mi
engine-image-ei-eee5f438-z7xpl             1m           3Mi
instance-manager-e-47df2e30                5m           43Mi
instance-manager-e-545b093c                10m          45Mi
instance-manager-e-7807de6f                963m         49Mi
instance-manager-r-2e35103c                47m          93Mi
instance-manager-r-9c1f9a82                33m          67Mi
instance-manager-r-c184d5aa                7m           64Mi
longhorn-csi-plugin-5bdbn                  1m           29Mi
longhorn-csi-plugin-cwrzc                  1m           23Mi
longhorn-csi-plugin-dbtmp                  2m           25Mi
longhorn-driver-deployer-cd74cb75b-wrzvl   1m           10Mi
longhorn-manager-5nw2l                     14m          73Mi
longhorn-manager-6phrd                     2m           67Mi
longhorn-manager-xt7kd                     21m          92Mi
longhorn-ui-8486987944-66wjp               1m           3Mi

And a part of logs of Pod instance-manager-e-7807de6f

Jul 10 16:12:47.728560: ->10.42.0.74:10031 P[  0kB]    36980us
Jul 10 16:12:47.948435: ->10.42.2.6:10031 P[  0kB]      379us
Jul 10 16:12:24.732872: ->10.42.1.66:10001 P[  0kB] 24000990us failed
Jul 10 16:12:32.733303: ->10.42.1.66:10001 P[  0kB] 16000564us failed
Jul 10 16:12:40.733502: ->10.42.1.66:10001 P[  0kB]  8000365us failed
Jul 10 16:12:40.733502: ->10.42.1.66:10001 P[  0kB] pending
Jul 10 16:12:24.732872: ->10.42.1.66:10001 P[  0kB] pending
Jul 10 16:12:32.733303: ->10.42.1.66:10001 P[  0kB] pending
[pvc-e9ad8ad5-cd6e-4cfc-9192-60bb939f5c21-e-114210fa] time="2020-07-10T16:12:53Z" level=error msg="Error reading from wire: EOF"
[longhorn-instance-manager] time="2020-07-10T16:12:53Z" level=debug msg="Process Manager: start getting logs for process pvc-e9ad8ad5-cd6e-4cfc-9192-60bb939f5c21-e-114210fa"
[longhorn-instance-manager] time="2020-07-10T16:12:53Z" level=debug msg="Process Manager: got logs for process pvc-e9ad8ad5-cd6e-4cfc-9192-60bb939f5c21-e-114210fa"
[longhorn-instance-manager] time="2020-07-10T16:12:55Z" level=debug msg="Process Manager: start getting logs for process pvc-e9ad8ad5-cd6e-4cfc-9192-60bb939f5c21-e-114210fa"
[longhorn-instance-manager] time="2020-07-10T16:12:55Z" level=debug msg="Process Manager: got logs for process pvc-e9ad8ad5-cd6e-4cfc-9192-60bb939f5c21-e-114210fa"
[longhorn-instance-manager] time="2020-07-10T16:12:56Z" level=debug msg="Process Manager: start getting logs for process pvc-e9ad8ad5-cd6e-4cfc-9192-60bb939f5c21-e-114210fa"
[longhorn-instance-manager] time="2020-07-10T16:12:57Z" level=debug msg="Process Manager: got logs for process pvc-e9ad8ad5-cd6e-4cfc-9192-60bb939f5c21-e-114210fa"
[pvc-0d3a8007-5f40-4ab8-a26d-ac2e3167480f-e-1d87f72d] time="2020-07-10T16:12:58Z" level=error msg="Error reading from wire: EOF"
[longhorn-instance-manager] time="2020-07-10T16:13:00Z" level=debug msg="Process Manager: start getting logs for process pvc-0d3a8007-5f40-4ab8-a26d-ac2e3167480f-e-1d87f72d"
[longhorn-instance-manager] time="2020-07-10T16:13:01Z" level=debug msg="Process Manager: got logs for process pvc-0d3a8007-5f40-4ab8-a26d-ac2e3167480f-e-1d87f72d"
[longhorn-instance-manager] time="2020-07-10T16:13:01Z" level=debug msg="Process Manager: start getting logs for process pvc-e9ad8ad5-cd6e-4cfc-9192-60bb939f5c21-e-114210fa"
[longhorn-instance-manager] time="2020-07-10T16:13:01Z" level=debug msg="Process Manager: got logs for process pvc-e9ad8ad5-cd6e-4cfc-9192-60bb939f5c21-e-114210fa"
[pvc-e9ad8ad5-cd6e-4cfc-9192-60bb939f5c21-e-114210fa] time="2020-07-10T16:13:06Z" level=info msg="Removing backend: tcp://10.42.1.66:10030"
time="2020-07-10T16:13:06Z" level=info msg="Closing: 10.42.1.66:10030"
[pvc-0d3a8007-5f40-4ab8-a26d-ac2e3167480f-e-1d87f72d] time="2020-07-10T16:13:06Z" level=info msg="Removing backend: tcp://10.42.1.66:10000"
time="2020-07-10T16:13:06Z" level=info msg="Closing: 10.42.1.66:10000"
[pvc-2702aa10-4795-4d39-916d-72c50866ea6d-e-795fd18c] time="2020-07-10T16:13:13Z" level=error msg="Error reading from wire: EOF"
[longhorn-instance-manager] time="2020-07-10T16:13:14Z" level=debug msg="Process Manager: start getting logs for process pvc-2702aa10-4795-4d39-916d-72c50866ea6d-e-795fd18c"
[longhorn-instance-manager] time="2020-07-10T16:13:14Z" level=debug msg="Process Manager: got logs for process pvc-2702aa10-4795-4d39-916d-72c50866ea6d-e-795fd18c"
[pvc-e9ad8ad5-cd6e-4cfc-9192-60bb939f5c21-e-114210fa] time="2020-07-10T16:13:14Z" level=info msg="Connecting to remote: 10.42.1.66:10000"
[longhorn-instance-manager] time="2020-07-10T16:13:15Z" level=debug msg="Process Manager: start getting logs for process pvc-2702aa10-4795-4d39-916d-72c50866ea6d-e-795fd18c"
[longhorn-instance-manager] time="2020-07-10T16:13:15Z" level=debug msg="Process Manager: got logs for process pvc-2702aa10-4795-4d39-916d-72c50866ea6d-e-795fd18c"
[pvc-0d3a8007-5f40-4ab8-a26d-ac2e3167480f-e-1d87f72d] time="2020-07-10T16:13:16Z" level=info msg="Connecting to remote: 10.42.1.66:10030"
[longhorn-instance-manager] time="2020-07-10T16:13:17Z" level=debug msg="Process Manager: start getting logs for process pvc-2702aa10-4795-4d39-916d-72c50866ea6d-e-795fd18c"
[longhorn-instance-manager] time="2020-07-10T16:13:17Z" level=debug msg="Process Manager: got logs for process pvc-2702aa10-4795-4d39-916d-72c50866ea6d-e-795fd18c"
[pvc-2702aa10-4795-4d39-916d-72c50866ea6d-e-795fd18c] time="2020-07-10T16:13:19Z" level=info msg="Removing backend: tcp://10.42.1.66:10045"
time="2020-07-10T16:13:19Z" level=info msg="Closing: 10.42.1.66:10045"
[pvc-e9ad8ad5-cd6e-4cfc-9192-60bb939f5c21-e-114210fa] time="2020-07-10T16:13:30Z" level=info msg="Opening: 10.42.1.66:10000"
[pvc-e9ad8ad5-cd6e-4cfc-9192-60bb939f5c21-e-114210fa] time="2020-07-10T16:13:31Z" level=info msg="Snapshot: 10.42.2.6:10000 2b319147-751a-4c76-b550-dac9cd2f9fe7 UserCreated false Created at 2020-07-10T16:13:31Z, Labels map[]"
time="2020-07-10T16:13:31Z" level=info msg="Snapshot: 10.42.0.74:10060 2b319147-751a-4c76-b550-dac9cd2f9fe7 UserCreated false Created at 2020-07-10T16:13:31Z, Labels map[]"
[pvc-e9ad8ad5-cd6e-4cfc-9192-60bb939f5c21-e-114210fa] time="2020-07-10T16:13:32Z" level=info msg="Snapshot: 10.42.1.66:10000 2b319147-751a-4c76-b550-dac9cd2f9fe7 UserCreated false Created at 2020-07-10T16:13:31Z, Labels map[]"
[pvc-e9ad8ad5-cd6e-4cfc-9192-60bb939f5c21-e-114210fa] time="2020-07-10T16:13:34Z" level=info msg="Adding backend: tcp://10.42.1.66:10000"
time="2020-07-10T16:13:34Z" level=info msg="Start monitoring tcp://10.42.1.66:10000"
[pvc-e9ad8ad5-cd6e-4cfc-9192-60bb939f5c21-e-114210fa] time="2020-07-10T16:13:34Z" level=info msg="Set revision counter of 10.42.1.66:10000 to : 0"
[pvc-e9ad8ad5-cd6e-4cfc-9192-60bb939f5c21-e-114210fa] time="2020-07-10T16:13:34Z" level=info msg="Set backend tcp://10.42.1.66:10000 revision counter to 0"
[pvc-2702aa10-4795-4d39-916d-72c50866ea6d-e-795fd18c] time="2020-07-10T16:13:34Z" level=info msg="Connecting to remote: 10.42.1.66:10045"
[pvc-e9ad8ad5-cd6e-4cfc-9192-60bb939f5c21-e-114210fa] panic: runtime error: index out of range [0] with length 0

goroutine 2781167 [running]:
github.com/longhorn/longhorn-engine/pkg/controller.getReplicaDisksAndHead(0xc0004336a0, 0x16, 0xc0004614a0, 0xc000433b60, 0x13, 0x0, 0x0)
    /go/src/github.com/longhorn/longhorn-engine/pkg/controller/rebuild.go:28 +0x4dc
github.com/longhorn/longhorn-engine/pkg/controller.(*Controller).PrepareRebuildReplica(0xc0001c9560, 0xc0004336a0, 0x16, 0x0, 0x0, 0x0, 0x0, 0x0)
    /go/src/github.com/longhorn/longhorn-engine/pkg/controller/rebuild.go:179 +0x3d6
github.com/longhorn/longhorn-engine/pkg/controller/rpc.(*ControllerServer).ReplicaPrepareRebuild(0xc000010158, 0x10a8240, 0xc0007bc1b0, 0xc0007bc1e0, 0xc000010158, 0xc0007bc1b0, 0xc0004d5a58)
    /go/src/github.com/longhorn/longhorn-engine/pkg/controller/rpc/server.go:222 +0x4e
github.com/longhorn/longhorn-engine/proto/ptypes._ControllerService_ReplicaPrepareRebuild_Handler(0xe79940, 0xc000010158, 0x10a8240, 0xc0007bc1b0, 0xc0003a24b0, 0x0, 0x10a8240, 0xc0007bc1b0, 0xc000433680, 0x18)
    /go/src/github.com/longhorn/longhorn-engine/proto/ptypes/controller.pb.go:1774 +0x217
google.golang.org/grpc.(*Server).processUnaryRPC(0xc000001980, 0x10b0680, 0xc0005f6780, 0xc000013200, 0xc0000f0660, 0x189e438, 0x0, 0x0, 0x0)
    /go/src/github.com/longhorn/longhorn-engine/vendor/google.golang.org/grpc/server.go:998 +0x46a
google.golang.org/grpc.(*Server).handleStream(0xc000001980, 0x10b0680, 0xc0005f6780, 0xc000013200, 0x0)
    /go/src/github.com/longhorn/longhorn-engine/vendor/google.golang.org/grpc/server.go:1278 +0xd97
google.golang.org/grpc.(*Server).serveStreams.func1.1(0xc0004dd2a0, 0xc000001980, 0x10b0680, 0xc0005f6780, 0xc000013200)
    /go/src/github.com/longhorn/longhorn-engine/vendor/google.golang.org/grpc/server.go:717 +0xbb
created by google.golang.org/grpc.(*Server).serveStreams.func1
    /go/src/github.com/longhorn/longhorn-engine/vendor/google.golang.org/grpc/server.go:715 +0xa1
[longhorn-instance-manager] time="2020-07-10T16:13:35Z" level=info msg="Process Manager: process pvc-e9ad8ad5-cd6e-4cfc-9192-60bb939f5c21-e-114210fa error out, error msg: exit status 2"
[longhorn-instance-manager] time="2020-07-10T16:13:35Z" level=debug msg="Process update: pvc-e9ad8ad5-cd6e-4cfc-9192-60bb939f5c21-e-114210fa: state error: Error: exit status 2"
[pvc-2702aa10-4795-4d39-916d-72c50866ea6d-e-795fd18c] time="2020-07-10T16:13:35Z" level=info msg="Opening: 10.42.1.66:10045"
[pvc-2702aa10-4795-4d39-916d-72c50866ea6d-e-795fd18c] time="2020-07-10T16:13:36Z" level=info msg="Snapshot: 10.42.0.74:10045 1ce403a8-61fa-44ec-baae-12535a8fd551 UserCreated false Created at 2020-07-10T16:13:35Z, Labels map[]"
time="2020-07-10T16:13:36Z" level=info msg="Snapshot: 10.42.2.6:10075 1ce403a8-61fa-44ec-baae-12535a8fd551 UserCreated false Created at 2020-07-10T16:13:35Z, Labels map[]"
[pvc-2702aa10-4795-4d39-916d-72c50866ea6d-e-795fd18c] time="2020-07-10T16:13:36Z" level=info msg="Snapshot: 10.42.1.66:10045 1ce403a8-61fa-44ec-baae-12535a8fd551 UserCreated false Created at 2020-07-10T16:13:35Z, Labels map[]"
[pvc-2702aa10-4795-4d39-916d-72c50866ea6d-e-795fd18c] time="2020-07-10T16:13:36Z" level=info msg="Adding backend: tcp://10.42.1.66:10045"
time="2020-07-10T16:13:36Z" level=info msg="Start monitoring tcp://10.42.1.66:10045"
send_msg: fail to write magic version
tgtd: bs_longhorn_request(91) fail to write at 557707264 for 28672
tgtd: bs_longhorn_request(150) io error 0x55b38e32c940 2a -22 28672 557707264, Broken pipe
send_msg: fail to write magic version
tgtd: bs_longhorn_request(91) fail to write at 536870912 for 4096
tgtd: bs_longhorn_request(150) io error 0x55b38e32c940 2a -22 4096 536870912, Broken pipe
[pvc-2702aa10-4795-4d39-916d-72c50866ea6d-e-795fd18c] time="2020-07-10T16:13:45Z" level=info msg="Set revision counter of 10.42.1.66:10045 to : 0"
[pvc-2702aa10-4795-4d39-916d-72c50866ea6d-e-795fd18c] time="2020-07-10T16:13:45Z" level=info msg="Set backend tcp://10.42.1.66:10045 revision counter to 0"
[pvc-2702aa10-4795-4d39-916d-72c50866ea6d-e-795fd18c] time="2020-07-10T16:13:45Z" level=info msg="Synchronizing volume-head-003.img.meta to volume-head-001.img.meta@10.42.1.66:10048"
[pvc-2702aa10-4795-4d39-916d-72c50866ea6d-e-795fd18c] time="2020-07-10T16:13:45Z" level=info msg="Done synchronizing volume-head-003.img.meta to volume-head-001.img.meta@10.42.1.66:10048"
timeout_handler: Timeout request 41667 due to disconnection
tgtd: bs_longhorn_request(91) fail to write at 328925184 for 20480
tgtd: bs_longhorn_request(150) io error 0x55b38e358a10 2a -14 20480 328925184, Success
send_msg: fail to write magic version
tgtd: bs_longhorn_request(91) fail to write at 0 for 4096
tgtd: bs_longhorn_request(150) io error 0x55b38e358a10 2a -22 4096 0, Broken pipe
[pvc-2702aa10-4795-4d39-916d-72c50866ea6d-e-795fd18c] time="2020-07-10T16:16:31Z" level=info msg="Get backend tcp://10.42.0.74:10045 revision counter 395"
time="2020-07-10T16:16:31Z" level=info msg="Set revision counter of 10.42.1.66:10045 to : 395"
[pvc-2702aa10-4795-4d39-916d-72c50866ea6d-e-795fd18c] time="2020-07-10T16:16:31Z" level=info msg="Set backend tcp://10.42.1.66:10045 revision counter to 395"
time="2020-07-10T16:16:31Z" level=info msg="WO replica tcp://10.42.1.66:10045's chain verified, update mode to RW"
time="2020-07-10T16:16:31Z" level=info msg="Set replica tcp://10.42.1.66:10045 to mode RW"
[pvc-0d3a8007-5f40-4ab8-a26d-ac2e3167480f-e-1d87f72d] time="2020-07-10T16:22:26Z" level=info msg="Connecting to remote: 10.42.1.66:10030"
[pvc-0d3a8007-5f40-4ab8-a26d-ac2e3167480f-e-1d87f72d] time="2020-07-10T16:22:26Z" level=info msg="Opening: 10.42.1.66:10030"
[pvc-0d3a8007-5f40-4ab8-a26d-ac2e3167480f-e-1d87f72d] time="2020-07-10T16:22:26Z" level=info msg="Snapshot: 10.42.2.6:10030 9811450c-60bf-4316-95c6-ccc086d3f507 UserCreated false Created at 2020-07-10T16:22:26Z, Labels map[]"
time="2020-07-10T16:22:26Z" level=info msg="Snapshot: 10.42.0.74:10030 9811450c-60bf-4316-95c6-ccc086d3f507 UserCreated false Created at 2020-07-10T16:22:26Z, Labels map[]"
[pvc-0d3a8007-5f40-4ab8-a26d-ac2e3167480f-e-1d87f72d] time="2020-07-10T16:22:26Z" level=info msg="Snapshot: 10.42.1.66:10030 9811450c-60bf-4316-95c6-ccc086d3f507 UserCreated false Created at 2020-07-10T16:22:26Z, Labels map[]"
[pvc-0d3a8007-5f40-4ab8-a26d-ac2e3167480f-e-1d87f72d] time="2020-07-10T16:22:26Z" level=info msg="Adding backend: tcp://10.42.1.66:10030"
time="2020-07-10T16:22:26Z" level=info msg="Start monitoring tcp://10.42.1.66:10030"
[pvc-0d3a8007-5f40-4ab8-a26d-ac2e3167480f-e-1d87f72d] time="2020-07-10T16:22:29Z" level=info msg="Set revision counter of 10.42.1.66:10030 to : 0"
[pvc-0d3a8007-5f40-4ab8-a26d-ac2e3167480f-e-1d87f72d] time="2020-07-10T16:22:29Z" level=info msg="Set backend tcp://10.42.1.66:10030 revision counter to 0"
[pvc-0d3a8007-5f40-4ab8-a26d-ac2e3167480f-e-1d87f72d] time="2020-07-10T16:22:29Z" level=info msg="Synchronizing volume-head-002.img.meta to volume-head-001.img.meta@10.42.1.66:10033"
[pvc-0d3a8007-5f40-4ab8-a26d-ac2e3167480f-e-1d87f72d] time="2020-07-10T16:22:29Z" level=info msg="Done synchronizing volume-head-002.img.meta to volume-head-001.img.meta@10.42.1.66:10033"
[pvc-0d3a8007-5f40-4ab8-a26d-ac2e3167480f-e-1d87f72d] time="2020-07-10T16:23:20Z" level=info msg="Get backend tcp://10.42.0.74:10030 revision counter 62"
time="2020-07-10T16:23:20Z" level=info msg="Set revision counter of 10.42.1.66:10030 to : 62"
[pvc-0d3a8007-5f40-4ab8-a26d-ac2e3167480f-e-1d87f72d] time="2020-07-10T16:23:20Z" level=info msg="Set backend tcp://10.42.1.66:10030 revision counter to 62"
time="2020-07-10T16:23:20Z" level=info msg="WO replica tcp://10.42.1.66:10030's chain verified, update mode to RW"
time="2020-07-10T16:23:20Z" level=info msg="Set replica tcp://10.42.1.66:10030 to mode RW"
[longhorn-instance-manager] time="2020-07-10T16:28:18Z" level=debug msg="Process update: pvc-e9ad8ad5-cd6e-4cfc-9192-60bb939f5c21-e-114210fa: state error: Error: exit status 2"
[longhorn-instance-manager] time="2020-07-10T16:29:31Z" level=debug msg="Process Manager: start getting logs for process pvc-e9ad8ad5-cd6e-4cfc-9192-60bb939f5c21-e-114210fa"
[longhorn-instance-manager] time="2020-07-10T16:30:32Z" level=debug msg="Process update: pvc-e9ad8ad5-cd6e-4cfc-9192-60bb939f5c21-e-114210fa: state error: Error: exit status 2"
[longhorn-instance-manager] time="2020-07-10T16:34:59Z" level=debug msg="Process Manager: start getting logs for process pvc-e9ad8ad5-cd6e-4cfc-9192-60bb939f5c21-e-114210fa"
[longhorn-instance-manager] time="2020-07-10T16:35:16Z" level=debug msg="Process update: pvc-e9ad8ad5-cd6e-4cfc-9192-60bb939f5c21-e-114210fa: state error: Error: exit status 2"
[longhorn-instance-manager] time="2020-07-10T16:37:11Z" level=debug msg="Process update: pvc-e9ad8ad5-cd6e-4cfc-9192-60bb939f5c21-e-114210fa: state error: Error: exit status 2"
[longhorn-instance-manager] time="2020-07-10T16:40:02Z" level=debug msg="Process Manager: prepare to delete process pvc-e9ad8ad5-cd6e-4cfc-9192-60bb939f5c21-e-114210fa"
[longhorn-instance-manager] time="2020-07-10T16:40:02Z" level=debug msg="Process update: pvc-e9ad8ad5-cd6e-4cfc-9192-60bb939f5c21-e-114210fa: state error: Error: exit status 2"
[longhorn-instance-manager] time="2020-07-10T16:40:02Z" level=debug msg="Process Manager: deleted process pvc-e9ad8ad5-cd6e-4cfc-9192-60bb939f5c21-e-114210fa"
[longhorn-instance-manager] time="2020-07-10T16:40:02Z" level=info msg="Process Manager: successfully unregistered process pvc-e9ad8ad5-cd6e-4cfc-9192-60bb939f5c21-e-114210fa"
[longhorn-instance-manager] time="2020-07-10T16:40:02Z" level=debug msg="Process update: pvc-e9ad8ad5-cd6e-4cfc-9192-60bb939f5c21-e-114210fa: state error: Error: exit status 2"
[longhorn-instance-manager] time="2020-07-10T16:40:15Z" level=debug msg="Process Manager: start getting logs for process pvc-e9ad8ad5-cd6e-4cfc-9192-60bb939f5c21-e-114210fa"
[pvc-0d3a8007-5f40-4ab8-a26d-ac2e3167480f-e-1d87f72d] time="2020-07-11T04:08:09Z" level=error msg="Ping timeout on replica 10.42.0.74:10031 seq= 28022"
time="2020-07-11T04:08:09Z" level=error msg="Retry  1 on replica 10.42.0.74:10031 seq= 28022 size= 0 (kB)"
[pvc-2702aa10-4795-4d39-916d-72c50866ea6d-e-795fd18c] time="2020-07-11T04:08:10Z" level=error msg="Ping timeout on replica 10.42.0.74:10046 seq= 27913"
time="2020-07-11T04:08:10Z" level=error msg="Retry  1 on replica 10.42.0.74:10046 seq= 27913 size= 0 (kB)"
[pvc-2702aa10-4795-4d39-916d-72c50866ea6d-e-795fd18c] time="2020-07-11T04:36:54Z" level=error msg="Ping timeout on replica 10.42.0.74:10046 seq= 28763"
time="2020-07-11T04:36:54Z" level=error msg="Retry  1 on replica 10.42.0.74:10046 seq= 28763 size= 0 (kB)"
[pvc-0d3a8007-5f40-4ab8-a26d-ac2e3167480f-e-1d87f72d] time="2020-07-11T14:05:13Z" level=error msg="Ping timeout on replica 10.42.0.74:10031 seq= 45917"
time="2020-07-11T14:05:13Z" level=error msg="Retry  1 on replica 10.42.0.74:10031 seq= 45917 size= 0 (kB)"

Hope this can help you guys to solve this problem, thanks, and sorry for my bad English.

yasker commented 4 years ago

One of the engine has a panic in this log:

[pvc-e9ad8ad5-cd6e-4cfc-9192-60bb939f5c21-e-114210fa] panic: runtime error: index out of range [0] with length 0

goroutine 2781167 [running]:
github.com/longhorn/longhorn-engine/pkg/controller.getReplicaDisksAndHead(0xc0004336a0, 0x16, 0xc0004614a0, 0xc000433b60, 0x13, 0x0, 0x0)
    /go/src/github.com/longhorn/longhorn-engine/pkg/controller/rebuild.go:28 +0x4dc
github.com/longhorn/longhorn-engine/pkg/controller.(*Controller).PrepareRebuildReplica(0xc0001c9560, 0xc0004336a0, 0x16, 0x0, 0x0, 0x0, 0x0, 0x0)
    /go/src/github.com/longhorn/longhorn-engine/pkg/controller/rebuild.go:179 +0x3d6
github.com/longhorn/longhorn-engine/pkg/controller/rpc.(*ControllerServer).ReplicaPrepareRebuild(0xc000010158, 0x10a8240, 0xc0007bc1b0, 0xc0007bc1e0, 0xc000010158, 0xc0007bc1b0, 0xc0004d5a58)
    /go/src/github.com/longhorn/longhorn-engine/pkg/controller/rpc/server.go:222 +0x4e
github.com/longhorn/longhorn-engine/proto/ptypes._ControllerService_ReplicaPrepareRebuild_Handler(0xe79940, 0xc000010158, 0x10a8240, 0xc0007bc1b0, 0xc0003a24b0, 0x0, 0x10a8240, 0xc0007bc1b0, 0xc000433680, 0x18)
    /go/src/github.com/longhorn/longhorn-engine/proto/ptypes/controller.pb.go:1774 +0x217
google.golang.org/grpc.(*Server).processUnaryRPC(0xc000001980, 0x10b0680, 0xc0005f6780, 0xc000013200, 0xc0000f0660, 0x189e438, 0x0, 0x0, 0x0)
    /go/src/github.com/longhorn/longhorn-engine/vendor/google.golang.org/grpc/server.go:998 +0x46a
google.golang.org/grpc.(*Server).handleStream(0xc000001980, 0x10b0680, 0xc0005f6780, 0xc000013200, 0x0)
    /go/src/github.com/longhorn/longhorn-engine/vendor/google.golang.org/grpc/server.go:1278 +0xd97
google.golang.org/grpc.(*Server).serveStreams.func1.1(0xc0004dd2a0, 0xc000001980, 0x10b0680, 0xc0005f6780, 0xc000013200)
    /go/src/github.com/longhorn/longhorn-engine/vendor/google.golang.org/grpc/server.go:717 +0xbb
created by google.golang.org/grpc.(*Server).serveStreams.func1
    /go/src/github.com/longhorn/longhorn-engine/vendor/google.golang.org/grpc/server.go:715 +0xa1

How many volumes are being attached and used on this node?

yurividal commented 4 years ago

Having same problem on LH 1.0

We currently have 3 nodes, and about 20 volumes.

container (1).log

ShadowJonathan commented 4 years ago

This is an issue in my low-power cluster, where the cpu load on my master node spikes to 60-100(!) and causes a cascading failure as now nodes can't access the master node.

cubic3d commented 4 years ago

Seeing high CPU usage on longhorn-manager here on 1.0.2 too.

devthejo commented 3 years ago

Same problem here on 1.0.2 top process: tgtd.

aleksandrmetik commented 3 years ago

Same problem too

aarlint commented 3 years ago

longhorn-manager is going up to our set limit of 200m and stays there. Only seems to increase on nodes that are running pods with PVs. If we let it it would eat up to 600m cpu... not sure what to do.

shuo-wu commented 3 years ago

@aarlint I guess you want to set a limit for the pod but actually set a request. If you want to limit the max cpu the worload will use, you need to set spec.containers[].resources.limits.cpu.

https://kubernetes.io/docs/concepts/configuration/manage-resources-containers/#requests-and-limits

aarlint commented 3 years ago

I think I was unclear. I am setting a cpu limit and that works... but I’m not sure what I’m throttling when I do. The default cpu request for longhorn manager is 20m but in our case the longhorn manager takes 500m if/when we don’t set a cpu limit. Is that normal? What is the longhorn manager doing?

shuo-wu commented 3 years ago

How many CPU resources the longhorn manager will eat depends on how many volumes and replica in your system. Typically, longhorn managers are responsible for:

Create/Update/Delete the custom resources (CRs. e.g., Longhorn volumes) defined by Longhorn.
Dynamically Launching some Longhorn workloads
Handling volume recurring jobs ......

https://longhorn.io/docs/1.1.0/concepts/#11-the-longhorn-manager-and-the-longhorn-engine

BTW, is there any reason that you need to put a hard CPU limit on longhorn managers?

aarlint commented 3 years ago

If I don’t put a hard limit their cpu usage goes up to over 500m which really impacts node performance. The requested cpu is 20m in the standard install, so I thought they should use much less than 500cpu constantly.

shuo-wu commented 3 years ago

20m is the minimum requirement. As I said, the actual resource it will use depends on the volume numbers. I just tried to attach 30 volumes to a node, then the longhorn manager pod on the node eats 300m CPU while other longhorn manager pods use 10m~50m CPU only.

onghorn-manager-2kz6p                      2m           32Mi
longhorn-manager-fx4qq                      297m         33Mi
longhorn-manager-jmdht                      3m           29Mi

But if there is a limited amount of volumes attached to the node, and the longhorn manager pod on the node eats 500m CPU, then something may be wrong. In this scenario, you can generate a support bundle then send it to us.

aarlint commented 3 years ago

If this is normal/expected behavior then there isn’t a problem. We weren’t aware longhorn manager’s cpu was tied directly to volume count.

aarlint commented 3 years ago

Thanks for your help!

kirinnee commented 3 years ago

Hi, I still facing this issue, where longhorn manager takes up quite a bit of CPU. Is the current workaround to put a CPU limit?

shuo-wu commented 3 years ago

Hi, I still facing this issue, where longhorn manager takes up quite a bit of CPU. Is the current workaround to put a CPU limit?

No. Longhorn manager pods will eat some CPUs temporarily if there are some operations like attach/detaching coming. Are the longhorn manager pods keeping taking up high CPU? If YES, there may be something wrong.

vaskozl commented 3 years ago

On node with 10 attachments (no local disk), I'm seeing 600m longhorn-manager usage and 100m instance-manager usage.So a total of 700m or ~70m per attachment. This is on arm64 at 2GHz clock rate.

Is ~70m per attachment about what should be expected? I have to say it does seam rather high since most of the pods which require this attachment use much less than ~70m, so longhorn ends up being the biggest CPU user for my applications.

shuo-wu commented 3 years ago

On node with 10 attachments (no local disk), I'm seeing 600m longhorn-manager usage and 100m instance-manager usage.So a total of 700m or ~70m per attachment. This is on arm64 at 2GHz clock rate.

Is ~70m per attachment about what should be expected? I have to say it does seam rather high since most of the pods which require this attachment use much less than ~70m, so longhorn ends up being the biggest CPU user for my applications.

The processes in the instance manager are the essentials of volume io, 100m for 10 volumes is definitely not enough. Considering the total CPU is 2000m and the further upgrade, 300~400m might be the choice. And as I mentioned, you don't need to set a limit for longhorn managers since most of the time they won't eat lots of CPUs. The resource requirement investigation (for instance manager): https://github.com/longhorn/longhorn/issues/1691

vaskozl commented 3 years ago

In my situation most volumes are pretty much idle, and longhorn-manager uses 6x more CPU than instance-manager on the given node. I don't have either limited. Are you suggesting that longhorn-manager can be limited, and instance-manager should not be? I'm ok with 100m usage, but when longhorn is the majority CPU user cumulatively using almost a core with mostly zero IO, I'm starting to worry about headroom as I add more volumes.

joshimoo commented 3 years ago

@vaskozl mind sharing a support bundle with us, you can attach it here or send an email to longhorn-support-bundle@rancher.com with reference to this thread and your cluster environment (cpu, mem, disk type, network, etc, dedicated control plane)

shuo-wu commented 3 years ago

In my situation most volumes are pretty much idle, and longhorn-manager uses 6x more CPU than instance-manager on the given node. I don't have either limited. Are you suggesting that longhorn-manager can be limited, and instance-manager should not be? I'm ok with 100m usage, but when longhorn is the majority CPU user cumulatively using almost a core with mostly zero IO, I'm starting to worry about headroom as I add more volumes.

I mean, ideally, you don't need to limit both. Besides, you need to guarantee the CPU resource for instance managers. Here you said that longhorn-manager uses 6x more CPU than instance-manager, then something may be wrong. You can send a support bundle as well as the CPU-related info (the support bundle doesn't contain the CPU usage info) to us.

ChipWolf commented 3 years ago

With around 250Mbps read on a volume, we see Longhorn engine using a good full 6 cores (12 threads) of a Xeon E-2288G

ChipWolf commented 3 years ago

Message from syslogd@ns31354176 at Jul 15 18:21:28 ...                                                                                        "ns31354176" 18:21 15-Jul-21
 kernel:[2920835.996469] watchdog: BUG: soft lockup - CPU#3 stuck for 22s! [longhorn:1771918]

Transferring the contents of a volume, we're getting system locking by longhorn

joshimoo commented 3 years ago

@ChipWolf are there replica rebuilds running at the time of the above message? Can you include the full kernel call trace that was printed after the kernel:[2920835.996469] watchdog: BUG: soft lockup - CPU#3 stuck for 22s! [longhorn:1771918] Should look something like this:

kernel: BUG: soft lockup - CPU#X stuck for Ys! [longhorn:...]
... ...
kernel: Call Trace:
kernel: [...] :
kernel: [...] :
kernel: [...] :

ChipWolf commented 3 years ago

@ChipWolf are there replica rebuilds running at the time of the above message? Can you include the full kernel call trace that was printed after the kernel:[2920835.996469] watchdog: BUG: soft lockup - CPU#3 stuck for 22s! [longhorn:1771918] Should look something like this:
kernel: BUG: soft lockup - CPU#X stuck for Ys! [longhorn:...]
... ...
kernel: Call Trace:
kernel: [...] :
kernel: [...] :
kernel: [...] :

If I witness it again, I'll grab the logs, we don't aggregate syslog right now. Thanks

longhorn / longhorn

[TASK] Investigate cases causing high CPU usage #538