NDM looping constantly causing high cpu usage with `Error: unreachable state`

magnetised commented 2 years ago

What steps did you take and what happened: I've just installed openebs as part of k0s on an aws ec2 instance with 2 disks, the host disk and a separate ebs data partition. Everything seems to be working fine but one of the ndm pods is at a constant 20% cpu usage. looking at the logs it seems to be in some loop querying the host/node disks

Looking at another server with the same ndm version but a simpler, single-disk setup, the exact same thing is happening.

What did you expect to happen: I expected the ndm process to not be constantly using cpu in a constant loop.

The output of the following commands will help us better understand what's going on: [Pasting long output into a GitHub gist or other pastebin is fine.]

kubectl get pods -n openebs


NAME                                           READY   STATUS    RESTARTS      AGE
openebs-localpv-provisioner-6ccc9d6fc9-kcnhs   1/1     Running   9 (19h ago)   20h
openebs-ndm-jpvpw                              1/1     Running   0             26m
openebs-ndm-operator-7bd6898d96-vz54r          1/1     Running   9 (19h ago)   20h

* `kubectl get blockdevices -n openebs -o yaml`

apiVersion: v1 items:

apiVersion: openebs.io/v1alpha1 kind: BlockDevice metadata: annotations: internal.openebs.io/uuid-scheme: gpt creationTimestamp: "2022-07-05T13:22:38Z" generation: 20 labels: kubernetes.io/hostname: ip-172-31-18-163.eu-west-1.compute.internal ndm.io/blockdevice-type: blockdevice ndm.io/managed: "true" name: blockdevice-01fd0d0d966998648102985c5f12e22a namespace: openebs resourceVersion: "64236" uid: 9d3e2ec3-57b5-4303-829c-e0cfa51f2f07 spec: capacity: logicalSectorSize: 512 physicalSectorSize: 512 storage: 137437888000 details: compliance: "" deviceType: partition driveType: SSD firmwareRevision: "" hardwareSectorSize: 512 logicalBlockSize: 512 model: Amazon Elastic Block Store physicalBlockSize: 512 serial: vol033aa51d4508ed1b0 vendor: "" devlinks:
- kind: by-id links:
  - /dev/disk/by-id/nvme-nvme.1d0f-766f6c3033336161353164343530386564316230-416d617a6f6e20456c617374696320426c6f636b2053746f7265-00000001-part1
  - /dev/disk/by-id/nvme-Amazon_Elastic_Block_Store_vol033aa51d4508ed1b0-part1
  - /dev/disk/by-id/wwn-nvme.1d0f-766f6c3033336161353164343530386564316230-416d617a6f6e20456c617374696320426c6f636b2053746f7265-00000001-part1
- kind: by-path links:
  - /dev/disk/by-path/pci-0000:00:1f.0-nvme-1-part1 filesystem: fsType: xfs mountPoint: /var/openebs nodeAttributes: nodeName: ip-172-31-18-163.eu-west-1.compute.internal partitioned: "No" path: /dev/nvme1n1p1 status: claimState: Unclaimed state: Inactive kind: List metadata: resourceVersion: ""

kubectl get blockdeviceclaims -n openebs -o yaml


apiVersion: v1
items: []
kind: List
metadata:
resourceVersion: ""


* `kubectl logs <ndm daemon pod name> -n openebs`

just including two loops, it goes on like this permanently.

https://gist.github.com/magnetised/c1f2bef4242b663721d87898f8416d65

* `lsblk` from nodes where ndm daemonset is running

NAME MAJ:MIN RM SIZE RO TYPE MOUNTPOINT nvme1n1 259:0 0 128G 0 disk └─nvme1n1p1 259:4 0 128G 0 part /var/openebs nvme0n1 259:1 0 128G 0 disk ├─nvme0n1p1 259:2 0 1M 0 part └─nvme0n1p2 259:3 0 128G 0 part /

**Anything else you would like to add:**
[Miscellaneous information that will assist in solving the issue.]

**Environment:**
- OpenEBS version

`openebs.io/version=3.0.0`
`node-disk-manager:1.7.0`

- Kubernetes version (use `kubectl version`):

Client Version: v1.24.2 Kustomize Version: v4.5.4 Server Version: v1.23.6+k0s


- Kubernetes installer & version:

K0s version v1.23.6+k0s.0

- Cloud provider or hardware configuration:

AWS EC2 instance
- Type of disks connected to the nodes (eg: Virtual Disks, GCE/EBS Volumes, Physical drives etc)

host root partition `nvme0n1` 
open ebs volume `nvme1n1` with a single partition `nvme1n1p1` mounted at `/var/openebs` 

- OS (e.g. from `/etc/os-release`):

NAME="Red Hat Enterprise Linux" VERSION="8.6 (Ootpa)" ID="rhel" ID_LIKE="fedora" VERSION_ID="8.6" PLATFORM_ID="platform:el8" PRETTY_NAME="Red Hat Enterprise Linux 8.6 (Ootpa)" ANSI_COLOR="0;31" CPE_NAME="cpe:/o:redhat:enterprise_linux:8::baseos" HOME_URL="https://www.redhat.com/" DOCUMENTATION_URL="https://access.redhat.com/documentation/red_hat_enterprise_linux/8/" BUG_REPORT_URL="https://bugzilla.redhat.com/"

REDHAT_BUGZILLA_PRODUCT="Red Hat Enterprise Linux 8" REDHAT_BUGZILLA_PRODUCT_VERSION=8.6 REDHAT_SUPPORT_PRODUCT="Red Hat Enterprise Linux" REDHAT_SUPPORT_PRODUCT_VERSION="8.6"

artem-zinnatullin commented 1 year ago

Exact same issue with vanilla k0s v1.27.2+k0s.0 installation with open ebs extension enabled (openebs/node-disk-manager:1.9.0), consumes over 60% of CPU on IDLE with no PVs no nothing, this is really bad.

gervaso commented 1 year ago

Hi, we had the same issue on-premise and it was caused by the presence of "/dev/sr1" on the vm, so I think you should update the filter to remove unusable devices.

openebs-archive / node-disk-manager

NDM looping constantly causing high cpu usage with `Error: unreachable state` #674