[BUG] Instance-manager uses high CPU on one node

Smartich0ke commented 2 weeks ago

Describe the bug

I have 3 nodes in my k3s cluster and one of them has very high CPU usage because of longhorn. I can hear the fans revving up like a jet engine on the node. Longhorn has been fine recently, so this is kind of out of the blue.

Nothing seems unusual in the logs, and I've only been able to narrow it down to the instance-manager one that one node. Longhorn also is also using quite a lot of RAM, around 2-3GB on each node, but this has always been the case so I suspect that is a different issue. I have tried fully rebooting the node and the issue still remains.

The problem started when I was running ~~1.6.3~~ 1.6.2. So I updated to 1.7.0 to see if that would fix the issue but it didn't. The instance manager uses a fair bit less than it did before, but still an abnormally high amount.

I have tried fully rebooting the node several times, and it has not solved the problem.

I ran `top inside the instance-manager pod and here is a picture of the output: Screenshot from 2024-08-24 22-05-19

And here are some grafana screenshots showing the high CPU usage

To Reproduce

Nothing to reproduce. I just let longhorn start normally and it happens.

Expected behavior

That instance-manager uses a more reasonable amount of CPU, like other nodes.

Support bundle for troubleshooting

supportbundle_1abee738-58cb-4a06-8e8b-3ceed44e6282_2024-08-24T22-18-15Z.zip

Environment

Longhorn version: 1.7.0
Impacted volume (PV): None, the PVs continue to work normally.
Installation method (e.g. Rancher Catalog App/Helm/Kubectl): Helm
Kubernetes distro (e.g. RKE/K3s/EKS/OpenShift) and version: K3s (v1.29.6+k3s2)
- Number of control plane nodes in the cluster: 1, no HA
- Number of worker nodes in the cluster: 2
Node config
- OS type and version: Debian 12
- Kernel version: Linux 6.1.0-22-amd64
- CPU per node: 4 cores allocated
- Memory per node: 8GB allocated for now, planning to allocated more in future.
- Disk type (e.g. SSD/NVMe/HDD): NVMe SSD
- Network bandwidth between the nodes (Gbps): gigabit
Underlying Infrastructure (e.g. on AWS/GCE, EKS/GKE, VMWare/KVM, Baremetal): QEMU/Proxmox
Number of Longhorn volumes in the cluster: 28

derekbit commented 2 weeks ago

The problem started when I was running 1.6.3

Is it 1.6.2? Did you see the issues in v1.6.1 or v1.6.0?

derekbit commented 2 weeks ago

There are some replica rebuilding events in the support bundle. Can you help check whether the high CPU usages of instance-manager pods remain if there is no replica rebuild?

Smartich0ke commented 2 weeks ago

Yes sorry I meant 1.6.2. I started having the problem in 1.6.2 and then updated to 1.7.0 to see if it would go away. I rebooted the node once again and waited until everything settled down. Eventually the high CPU usage stopped and is back to normal.

derekbit commented 2 weeks ago

@Smartich0ke Thanks for the update. When the issue happens again, could you keep the environment, and we'd like to collect more information from your cluster? Thank you.

PhanLe1010 commented 2 weeks ago

@Smartich0ke When you reboot the node, Longhorn would need to rebuild the replicas on the newly rebooted node so CPU on that node will temporary be high. After all 28 replicas are rebuilt, CPU on that node should go down. If it remains high, it is problematic. Please ping us if it remains high

Smartich0ke commented 2 weeks ago

Ok thanks for the help guys. I will report if it happens again. Closing for now.

longhorn / longhorn