openebs-archive / node-disk-manager

Kubernetes Storage Device Management
https://openebs.io/docs
Apache License 2.0
182 stars 113 forks source link

NDM looping constantly causing high cpu usage with `Error: unreachable state` #674

Open magnetised opened 2 years ago

magnetised commented 2 years ago

What steps did you take and what happened: I've just installed openebs as part of k0s on an aws ec2 instance with 2 disks, the host disk and a separate ebs data partition. Everything seems to be working fine but one of the ndm pods is at a constant 20% cpu usage. looking at the logs it seems to be in some loop querying the host/node disks

Looking at another server with the same ndm version but a simpler, single-disk setup, the exact same thing is happening.

What did you expect to happen: I expected the ndm process to not be constantly using cpu in a constant loop.

The output of the following commands will help us better understand what's going on: [Pasting long output into a GitHub gist or other pastebin is fine.]

* `kubectl get blockdevices -n openebs -o yaml`

apiVersion: v1 items:


* `kubectl logs <ndm daemon pod name> -n openebs`

just including two loops, it goes on like this permanently.

https://gist.github.com/magnetised/c1f2bef4242b663721d87898f8416d65

* `lsblk` from nodes where ndm daemonset is running 

NAME MAJ:MIN RM SIZE RO TYPE MOUNTPOINT nvme1n1 259:0 0 128G 0 disk └─nvme1n1p1 259:4 0 128G 0 part /var/openebs nvme0n1 259:1 0 128G 0 disk ├─nvme0n1p1 259:2 0 1M 0 part └─nvme0n1p2 259:3 0 128G 0 part /

**Anything else you would like to add:**
[Miscellaneous information that will assist in solving the issue.]

**Environment:**
- OpenEBS version

`openebs.io/version=3.0.0`
`node-disk-manager:1.7.0`

- Kubernetes version (use `kubectl version`):

Client Version: v1.24.2 Kustomize Version: v4.5.4 Server Version: v1.23.6+k0s


- Kubernetes installer & version:

K0s version v1.23.6+k0s.0

- Cloud provider or hardware configuration:

AWS EC2 instance
- Type of disks connected to the nodes (eg: Virtual Disks, GCE/EBS Volumes, Physical drives etc)

host root partition `nvme0n1` 
open ebs volume `nvme1n1` with a single partition `nvme1n1p1` mounted at `/var/openebs` 

- OS (e.g. from `/etc/os-release`):

NAME="Red Hat Enterprise Linux" VERSION="8.6 (Ootpa)" ID="rhel" ID_LIKE="fedora" VERSION_ID="8.6" PLATFORM_ID="platform:el8" PRETTY_NAME="Red Hat Enterprise Linux 8.6 (Ootpa)" ANSI_COLOR="0;31" CPE_NAME="cpe:/o:redhat:enterprise_linux:8::baseos" HOME_URL="https://www.redhat.com/" DOCUMENTATION_URL="https://access.redhat.com/documentation/red_hat_enterprise_linux/8/" BUG_REPORT_URL="https://bugzilla.redhat.com/"

REDHAT_BUGZILLA_PRODUCT="Red Hat Enterprise Linux 8" REDHAT_BUGZILLA_PRODUCT_VERSION=8.6 REDHAT_SUPPORT_PRODUCT="Red Hat Enterprise Linux" REDHAT_SUPPORT_PRODUCT_VERSION="8.6"

artem-zinnatullin commented 1 year ago

Exact same issue with vanilla k0s v1.27.2+k0s.0 installation with open ebs extension enabled (openebs/node-disk-manager:1.9.0), consumes over 60% of CPU on IDLE with no PVs no nothing, this is really bad.

gervaso commented 1 year ago

Hi, we had the same issue on-premise and it was caused by the presence of "/dev/sr1" on the vm, so I think you should update the filter to remove unusable devices.