kubernetes / kubernetes

Production-Grade Container Scheduling and Management
https://kubernetes.io
Apache License 2.0
110.37k stars 39.47k forks source link

Couldn't collect kubelet_volume_* metrics of broken volume using cephfs. #126475

Open sugaf1204 opened 2 months ago

sugaf1204 commented 2 months ago

What happened?

Broken cephfs volume exists, but kubeletvolume* metrics of it is not output.

As kubelet_volume_stats_health_status_abnormal metrics is not output, I can't detect volume health.

What did you expect to happen?

The kubelet_volume_stats_health_status_abnormal should be output for broken volumes.

kubelet_volume_stats_health_status_abnormal{namespace="default",persistentvolumeclaim="broken"} 1

How can we reproduce it (as minimally and precisely as possible)?

  1. Enable feature gate for CSIVolumeHealth
  2. Prepare PVCs which provisioned by cephfs and are Unhealthy.
  3. Retrieve kubelet metrics and verify that there are no metrics for broken PVCs

Anything else we need to know?

In ceph-csi, when an unhelthy volume is met, only the volumeCondition is returned as NodeGetVolumeStatsResponse.

In kubernetes, only those with Usage in NodeGetVolumeStatsResponse are accepted.

I think that it should accept only VolumeCondition without Usage.

https://github.com/ceph/ceph-csi/blob/e6540989a52212cf9b66672b4aa8fde19d037be6/internal/cephfs/nodeserver.go#L787

https://github.com/kubernetes/kubernetes/blob/2a1d4172e22abb6759b3d2ad21bb09a04eef596d/pkg/volume/csi/csi_client.go#L611-L638

Kubernetes version

```console $ kubectl version Client Version: v1.30.2 Kustomize Version: v5.0.4-0.20230601165947-6ce0bf390ce3 Server Version: v1.28.10 WARNING: version difference between client (1.30) and server (1.28) exceeds the supported minor version skew of +/-1 ```

Cloud provider

baremetal

OS version

```console $ cat /etc/os-release PRETTY_NAME="Ubuntu 22.04.4 LTS" NAME="Ubuntu" VERSION_ID="22.04" VERSION="22.04.4 LTS (Jammy Jellyfish)" VERSION_CODENAME=jammy ID=ubuntu ID_LIKE=debian HOME_URL="https://www.ubuntu.com/" SUPPORT_URL="https://help.ubuntu.com/" BUG_REPORT_URL="https://bugs.launchpad.net/ubuntu/" PRIVACY_POLICY_URL="https://www.ubuntu.com/legal/terms-and-policies/privacy-policy" UBUNTU_CODENAME=jammy $ uname -a Linux bm-008 5.15.0-112-generic #122-Ubuntu SMP Thu May 23 07:48:21 UTC 2024 x86_64 x86_64 x86_64 GNU/Linux ```

Install tools

Container runtime (CRI) and version (if applicable)

Related plugins (CNI, CSI, ...) and versions (if applicable)

- rook-ceph-cluster: chart 1.14.5 - rook-ceph: chart 1.14.5
k8s-ci-robot commented 2 months ago

This issue is currently awaiting triage.

If a SIG or subproject determines this is a relevant issue, they will accept it by applying the triage/accepted label and provide further guidance.

The triage/accepted label can be added by org members by writing /triage accepted in a comment.

Instructions for interacting with me using PR comments are available [here](https://git.k8s.io/community/contributors/guide/pull-requests.md). If you have questions or suggestions related to my behavior, please file an issue against the [kubernetes-sigs/prow](https://github.com/kubernetes-sigs/prow/issues/new?title=Prow%20issue:) repository.
Adarsh-verma-14 commented 2 months ago

/sig node /sig storage

xing-yang commented 2 months ago

/assign

kannon92 commented 1 month ago

/remove-sig node

kannon92 commented 1 month ago

I think storage is the correct label for this. Not sure what node should do in this case.

Madhu-1 commented 1 month ago

This is fixed in https://github.com/kubernetes/kubernetes/pull/127021 @xing-yang i think we can close this issue.