kubernetes-sigs / azuredisk-csi-driver

Azure Disk CSI Driver
Apache License 2.0
145 stars 183 forks source link

Failed to stage volume to node #2269

Open xiaoping8385 opened 5 months ago

xiaoping8385 commented 5 months ago

What happened: When we run several pod on k8s cluster two of the pods which running on a same node can not work

describe pod

root@b972e187-7d97-4933-4479-6f99f744c6a7:/tmp/build/88f1b62d# kubectl describe pod sample-statefulset-1
Name:             sample-statefulset-1
Namespace:        od-pks-testing-8ca861f7-71ae-4058-b5b0-f04db02e25cf
Priority:         0
Service Account:  stateful-set-sa
Node:             849574ed-eef0-49a3-aef2-a03c1737cd4a/10.0.16.20
Start Time:       Tue, 09 Apr 2024 13:35:44 +0000
Labels:           app=pv-ex
                  controller-revision-hash=sample-statefulset-b77598d95
                  statefulset.kubernetes.io/pod-name=sample-statefulset-1
Annotations:      <none>
Status:           Pending
IP:               
IPs:              <none>
Controlled By:    StatefulSet/sample-statefulset
Containers:
  pv-ex:
    Container ID:   
    Image:          gcr.io/cf-pks-releng-a1/pksrelengci/persistent-example:latest
    Image ID:       
    Port:           3000/TCP
    Host Port:      0/TCP
    State:          Waiting
      Reason:       ContainerCreating
    Ready:          False
    Restart Count:  0
    Environment:
      LOG_FILE:  /usr/share/pv-ex/log.txt
    Mounts:
      /usr/share/pv-ex from www (rw)
      /var/run/secrets/kubernetes.io/serviceaccount from kube-api-access-2rs44 (ro)
Conditions:
  Type              Status
  Initialized       True 
  Ready             False 
  ContainersReady   False 
  PodScheduled      True 
Volumes:
  www:
    Type:       PersistentVolumeClaim (a reference to a PersistentVolumeClaim in the same namespace)
    ClaimName:  www-sample-statefulset-1
    ReadOnly:   false
  kube-api-access-2rs44:
    Type:                    Projected (a volume that contains injected data from multiple sources)
    TokenExpirationSeconds:  3607
    ConfigMapName:           kube-root-ca.crt
    ConfigMapOptional:       <nil>
    DownwardAPI:             true
QoS Class:                   BestEffort
Node-Selectors:              <none>
Tolerations:                 node.kubernetes.io/not-ready:NoExecute op=Exists for 300s
                             node.kubernetes.io/unreachable:NoExecute op=Exists for 300s
Events:
  Type     Reason       Age                    From     Message
  ----     ------       ----                   ----     -------
  Warning  FailedMount  6m21s (x273 over 18h)  kubelet  MountVolume.MountDevice failed for volume "pvc-53087397-597f-41aa-bccb-e2d8ce450626" : rpc error: code = DeadlineExceeded desc = context deadline exceeded
  Warning  FailedMount  97s (x482 over 18h)    kubelet  Unable to attach or mount volumes: unmounted volumes=[www], unattached volumes=[], failed to process volumes=[]: timed out waiting for the condition

describe pv:
 kubectl describe pv pvc-53087397-597f-41aa-bccb-e2d8ce450626
Name:              pvc-53087397-597f-41aa-bccb-e2d8ce450626
Labels:            <none>
Annotations:       pv.kubernetes.io/migrated-to: disk.csi.azure.com
                   pv.kubernetes.io/provisioned-by: kubernetes.io/azure-disk
                   volume.kubernetes.io/provisioner-deletion-secret-name: 
                   volume.kubernetes.io/provisioner-deletion-secret-namespace: 
Finalizers:        [kubernetes.io/pv-protection external-attacher/disk-csi-azure-com]
StorageClass:      ci-storage
Status:            Bound
Claim:             od-pks-testing-8ca861f7-71ae-4058-b5b0-f04db02e25cf/www-sample-statefulset-1
Reclaim Policy:    Delete
Access Modes:      RWO
VolumeMode:        Filesystem
Capacity:          1Gi
Node Affinity:     
  Required Terms:  
    Term 0:        topology.disk.csi.azure.com/zone in [westus2-1]
Message:           
Source:
    Type:         AzureDisk (an Azure Data Disk mount on the host and bind mount to the pod)
    DiskName:     pvc-53087397-597f-41aa-bccb-e2d8ce450626
    DiskURI:      /subscriptions/066b2cf2-eca1-4e23-98b4-ce94c72b99b3/resourceGroups/obsidian-flier/providers/Microsoft.Compute/disks/pvc-53087397-597f-41aa-bccb-e2d8ce450626
    Kind:         Managed
    FSType:       
    CachingMode:  ReadWrite
    ReadOnly:     false
Events:           <none>

kubectl describe volumeattachment csi-5e61ec3524c386fbcdbfa765f2c0b089c359348ef4961e9d4e77ff0134b12aa7
Name:         csi-5e61ec3524c386fbcdbfa765f2c0b089c359348ef4961e9d4e77ff0134b12aa7
Namespace:    
Labels:       <none>
Annotations:  csi.alpha.kubernetes.io/node-id: 849574ed-eef0-49a3-aef2-a03c1737cd4a
API Version:  storage.k8s.io/v1
Kind:         VolumeAttachment
Metadata:
  Creation Timestamp:  2024-04-09T13:35:44Z
  Finalizers:
    external-attacher/disk-csi-azure-com
  Resource Version:  102334
  UID:               15ff9764-3ab6-4c99-b809-f2a26cb08362
Spec:
  Attacher:   disk.csi.azure.com
  Node Name:  849574ed-eef0-49a3-aef2-a03c1737cd4a
  Source:
    Persistent Volume Name:  pvc-53087397-597f-41aa-bccb-e2d8ce450626
Status:
  Attached:  true
  Attachment Metadata:
    LUN:                  2
    Diskiopsreadwrite:    500
    Diskmbpsreadwrite:    60
    Networkaccesspolicy:  AllowAll
    Skuname:              Standard_LRS
Events:                   <none>

csicontroller.log
kubectl logs  csi-azuredisk-controller-b5cd56b5f-85wcc  -n kube-system | grep pvc-53087397-597f-41aa-bccb-e2d8ce450626 
Defaulted container "csi-provisioner" out of: csi-provisioner, csi-attacher, csi-snapshotter, csi-resizer, liveness-probe, azuredisk
I0409 13:35:42.159495       1 controller.go:923] successfully created PV pvc-53087397-597f-41aa-bccb-e2d8ce450626 for PVC www-sample-statefulset-1 and csi volume name /subscriptions/066b2cf2-eca1-4e23-98b4-ce94c72b99b3/resourceGroups/obsidian-flier/providers/Microsoft.Compute/disks/pvc-53087397-597f-41aa-bccb-e2d8ce450626
I0409 13:35:42.159599       1 controller.go:1442] provision "od-pks-testing-8ca861f7-71ae-4058-b5b0-f04db02e25cf/www-sample-statefulset-1" class "ci-storage": volume "pvc-53087397-597f-41aa-bccb-e2d8ce450626" provisioned
I0409 13:35:42.169628       1 event.go:298] Event(v1.ObjectReference{Kind:"PersistentVolumeClaim", Namespace:"od-pks-testing-8ca861f7-71ae-4058-b5b0-f04db02e25cf", Name:"www-sample-statefulset-1", UID:"53087397-597f-41aa-bccb-e2d8ce450626", APIVersion:"v1", ResourceVersion:"102231", FieldPath:""}): type: 'Normal' reason: 'ProvisioningSucceeded' Successfully provisioned volume pvc-53087397-597f-41aa-bccb-e2d8ce450626
root@b972e187-7d97-4933-4479-6f99f744c6a7:/tmp/build/88f1b62d# kubectl logs  csi-azuredisk-controller-b5cd56b5f-85wcc  -n kube-system -c azuredisk | grep pvc-53087397-597f-41aa-bccb-e2d8ce450626 
I0409 13:35:39.617879       1 utils.go:78] GRPC request: {"accessibility_requirements":{"preferred":[{"segments":{"topology.disk.csi.azure.com/zone":"westus2-1","topology.kubernetes.io/zone":"westus2-1"}},{"segments":{"topology.disk.csi.azure.com/zone":"westus2-2","topology.kubernetes.io/zone":"westus2-2"}},{"segments":{"topology.disk.csi.azure.com/zone":"westus2-3","topology.kubernetes.io/zone":"westus2-3"}}],"requisite":[{"segments":{"topology.disk.csi.azure.com/zone":"westus2-3","topology.kubernetes.io/zone":"westus2-3"}},{"segments":{"topology.disk.csi.azure.com/zone":"westus2-2","topology.kubernetes.io/zone":"westus2-2"}},{"segments":{"topology.disk.csi.azure.com/zone":"westus2-1","topology.kubernetes.io/zone":"westus2-1"}}]},"capacity_range":{"required_bytes":1073741824},"name":"pvc-53087397-597f-41aa-bccb-e2d8ce450626","parameters":{"csi.storage.k8s.io/pv/name":"pvc-53087397-597f-41aa-bccb-e2d8ce450626","csi.storage.k8s.io/pvc/name":"www-sample-statefulset-1","csi.storage.k8s.io/pvc/namespace":"od-pks-testing-8ca861f7-71ae-4058-b5b0-f04db02e25cf","kind":"managed","storageaccounttype":"Standard_LRS"},"volume_capabilities":[{"AccessType":{"Mount":{}},"access_mode":{"mode":7}}]}
I0409 13:35:39.618148       1 controllerserver.go:198] begin to create azure disk(pvc-53087397-597f-41aa-bccb-e2d8ce450626) account type(Standard_LRS) rg(obsidian-flier) location(westus2) size(1) diskZone(westus2-1) maxShares(0)
I0409 13:35:39.618168       1 azure_managedDiskController.go:98] azureDisk - creating new managed Name:pvc-53087397-597f-41aa-bccb-e2d8ce450626 StorageAccountType:Standard_LRS Size:1
I0409 13:35:42.158743       1 azure_managedDiskController.go:288] azureDisk - created new MD Name:pvc-53087397-597f-41aa-bccb-e2d8ce450626 StorageAccountType:Standard_LRS Size:1
I0409 13:35:42.158780       1 controllerserver.go:294] create azure disk(pvc-53087397-597f-41aa-bccb-e2d8ce450626) account type(Standard_LRS) rg(obsidian-flier) location(westus2) size(1) tags(map[kubernetes.io-created-for-pv-name:pvc-53087397-597f-41aa-bccb-e2d8ce450626 kubernetes.io-created-for-pvc-name:www-sample-statefulset-1 kubernetes.io-created-for-pvc-namespace:od-pks-testing-8ca861f7-71ae-4058-b5b0-f04db02e25cf]) successfully
I0409 13:35:42.158843       1 azure_metrics.go:115] "Observed Request Latency" latency_seconds=2.540643032 request="azuredisk_csi_driver_controller_create_volume" resource_group="obsidian-flier" subscription_id="066b2cf2-eca1-4e23-98b4-ce94c72b99b3" source="disk.csi.azure.com" volumeid="/subscriptions/066b2cf2-eca1-4e23-98b4-ce94c72b99b3/resourceGroups/obsidian-flier/providers/Microsoft.Compute/disks/pvc-53087397-597f-41aa-bccb-e2d8ce450626" result_code="succeeded"
I0409 13:35:42.158867       1 utils.go:84] GRPC response: {"volume":{"accessible_topology":[{"segments":{"topology.disk.csi.azure.com/zone":"westus2-1"}}],"capacity_bytes":1073741824,"content_source":{"Type":null},"volume_context":{"csi.storage.k8s.io/pv/name":"pvc-53087397-597f-41aa-bccb-e2d8ce450626","csi.storage.k8s.io/pvc/name":"www-sample-statefulset-1","csi.storage.k8s.io/pvc/namespace":"od-pks-testing-8ca861f7-71ae-4058-b5b0-f04db02e25cf","kind":"Managed","requestedsizegib":"1","storageaccounttype":"Standard_LRS"},"volume_id":"/subscriptions/066b2cf2-eca1-4e23-98b4-ce94c72b99b3/resourceGroups/obsidian-flier/providers/Microsoft.Compute/disks/pvc-53087397-597f-41aa-bccb-e2d8ce450626"}}
I0409 13:35:44.853034       1 utils.go:78] GRPC request: {"node_id":"849574ed-eef0-49a3-aef2-a03c1737cd4a","volume_capability":{"AccessType":{"Mount":{}},"access_mode":{"mode":7}},"volume_context":{"cachingmode":"ReadWrite","fstype":"","kind":"Managed"},"volume_id":"/subscriptions/066b2cf2-eca1-4e23-98b4-ce94c72b99b3/resourceGroups/obsidian-flier/providers/Microsoft.Compute/disks/pvc-53087397-597f-41aa-bccb-e2d8ce450626"}
I0409 13:35:44.891972       1 controllerserver.go:401] GetDiskLun returned: cannot find Lun for disk pvc-53087397-597f-41aa-bccb-e2d8ce450626. Initiating attaching volume /subscriptions/066b2cf2-eca1-4e23-98b4-ce94c72b99b3/resourceGroups/obsidian-flier/providers/Microsoft.Compute/disks/pvc-53087397-597f-41aa-bccb-e2d8ce450626 to node 849574ed-eef0-49a3-aef2-a03c1737cd4a (vmState Succeeded).
I0409 13:35:44.892007       1 controllerserver.go:426] Trying to attach volume /subscriptions/066b2cf2-eca1-4e23-98b4-ce94c72b99b3/resourceGroups/obsidian-flier/providers/Microsoft.Compute/disks/pvc-53087397-597f-41aa-bccb-e2d8ce450626 to node 849574ed-eef0-49a3-aef2-a03c1737cd4a
I0409 13:35:44.892032       1 azure_controller_common.go:255] wait 1000ms for more requests on node 849574ed-eef0-49a3-aef2-a03c1737cd4a, current disk attach: /subscriptions/066b2cf2-eca1-4e23-98b4-ce94c72b99b3/resourceGroups/obsidian-flier/providers/Microsoft.Compute/disks/pvc-53087397-597f-41aa-bccb-e2d8ce450626
I0409 13:35:45.892758       1 azure_controller_common.go:269] Trying to attach volume /subscriptions/066b2cf2-eca1-4e23-98b4-ce94c72b99b3/resourceGroups/obsidian-flier/providers/Microsoft.Compute/disks/pvc-53087397-597f-41aa-bccb-e2d8ce450626 lun 2 to node 849574ed-eef0-49a3-aef2-a03c1737cd4a, diskMap len:1, map[/subscriptions/066b2cf2-eca1-4e23-98b4-ce94c72b99b3/resourcegroups/obsidian-flier/providers/microsoft.compute/disks/pvc-53087397-597f-41aa-bccb-e2d8ce450626:%!s(*provider.AttachDiskOptions=&{ReadWrite pvc-53087397-597f-41aa-bccb-e2d8ce450626  false 2})]
I0409 13:35:45.892866       1 azure_controller_standard.go:103] azureDisk - update(obsidian-flier): vm(849574ed-eef0-49a3-aef2-a03c1737cd4a) - attach disk list(map[/subscriptions/066b2cf2-eca1-4e23-98b4-ce94c72b99b3/resourcegroups/obsidian-flier/providers/microsoft.compute/disks/pvc-53087397-597f-41aa-bccb-e2d8ce450626:%!s(*provider.AttachDiskOptions=&{ReadWrite pvc-53087397-597f-41aa-bccb-e2d8ce450626  false 2})])
I0409 13:35:46.399029       1 azure_controller_standard.go:116] azureDisk - update(obsidian-flier): vm(849574ed-eef0-49a3-aef2-a03c1737cd4a) - attach disk list(map[/subscriptions/066b2cf2-eca1-4e23-98b4-ce94c72b99b3/resourcegroups/obsidian-flier/providers/microsoft.compute/disks/pvc-53087397-597f-41aa-bccb-e2d8ce450626:%!s(*provider.AttachDiskOptions=&{ReadWrite pvc-53087397-597f-41aa-bccb-e2d8ce450626  false 2})]) returned with <nil>
I0409 13:35:56.607788       1 azure_controller_common.go:558] azureDisk - find disk: lun 2 name pvc-53087397-597f-41aa-bccb-e2d8ce450626 uri /subscriptions/066b2cf2-eca1-4e23-98b4-ce94c72b99b3/resourceGroups/obsidian-flier/providers/Microsoft.Compute/disks/pvc-53087397-597f-41aa-bccb-e2d8ce450626
I0409 13:35:56.607819       1 controllerserver.go:436] Attach operation successful: volume /subscriptions/066b2cf2-eca1-4e23-98b4-ce94c72b99b3/resourceGroups/obsidian-flier/providers/Microsoft.Compute/disks/pvc-53087397-597f-41aa-bccb-e2d8ce450626 attached to node 849574ed-eef0-49a3-aef2-a03c1737cd4a.
I0409 13:35:56.607841       1 controllerserver.go:456] attach volume /subscriptions/066b2cf2-eca1-4e23-98b4-ce94c72b99b3/resourceGroups/obsidian-flier/providers/Microsoft.Compute/disks/pvc-53087397-597f-41aa-bccb-e2d8ce450626 to node 849574ed-eef0-49a3-aef2-a03c1737cd4a successfully
I0409 13:35:56.607904       1 azure_metrics.go:115] "Observed Request Latency" latency_seconds=11.715902039 request="azuredisk_csi_driver_controller_publish_volume" resource_group="obsidian-flier" subscription_id="066b2cf2-eca1-4e23-98b4-ce94c72b99b3" source="disk.csi.azure.com" volumeid="/subscriptions/066b2cf2-eca1-4e23-98b4-ce94c72b99b3/resourceGroups/obsidian-flier/providers/Microsoft.Compute/disks/pvc-53087397-597f-41aa-bccb-e2d8ce450626" node="849574ed-eef0-49a3-aef2-a03c1737cd4a" result_code="succeeded"
root@b972e187-7d97-4933-4479-6f99f744c6a7:/tmp/build/88f1b62d# kubectl logs csi-azuredisk-node-4ts6c -n kube-system | grep  pvc-53087397-597f-41aa-bccb-e2d8ce450626
Defaulted container "liveness-probe" out of: liveness-probe, node-driver-registrar, azuredisk
root@b972e187-7d97-4933-4479-6f99f744c6a7:/tmp/build/88f1b62d# kubectl logs csi-azuredisk-node-4ts6c -n kube-system -c azuredisk| grep  pvc-53087397-597f-41aa-bccb-e2d8ce450626
root@b972e187-7d97-4933-4479-6f99f744c6a7:/tmp/build/88f1b62d# kubectl logs csi-azuredisk-node-4ts6c -n kube-system -c azuredisk| grep  pvc-53087397-597f-41aa-bcc^Ce2d8ce450626

CSInode.log 
I0410 06:48:42.302185       1 azure_common_linux.go:121] /dev/disk/azure is not populated, now try to parse 1:0:0:0 directly
I0410 06:48:42.302208       1 azure_common_linux.go:121] /dev/disk/azure is not populated, now try to parse 2:0:0:0 directly
I0410 06:48:42.302213       1 azure_common_linux.go:121] /dev/disk/azure is not populated, now try to parse 2:0:0:1 directly
I0410 06:48:42.302218       1 azure_common_linux.go:121] /dev/disk/azure is not populated, now try to parse 2:0:0:2 directly
I0410 06:48:42.302222       1 azure_common_linux.go:121] /dev/disk/azure is not populated, now try to parse 4:0:1:0 directly
I0410 06:48:42.302226       1 azure_common_linux.go:121] /dev/disk/azure is not populated, now try to parse 5:0:0:0 directly
I0410 06:48:43.302264       1 azure_common_linux.go:121] /dev/disk/azure is not populated, now try to parse 1:0:0:0 directly
I0410 06:48:43.302287       1 azure_common_linux.go:121] /dev/disk/azure is not populated, now try to parse 2:0:0:0 directly
I0410 06:48:43.302292       1 azure_common_linux.go:121] /dev/disk/azure is not populated, now try to parse 2:0:0:1 directly
I0410 06:48:43.302296       1 azure_common_linux.go:121] /dev/disk/azure is not populated, now try to parse 2:0:0:2 directly
I0410 06:48:43.302300       1 azure_common_linux.go:121] /dev/disk/azure is not populated, now try to parse 4:0:1:0 directly
I0410 06:48:43.302304       1 azure_common_linux.go:121] /dev/disk/azure is not populated, now try to parse 5:0:0:0 directly
I0410 06:48:44.302389       1 azure_common_linux.go:121] /dev/disk/azure is not populated, now try to parse 1:0:0:0 directly
I0410 06:48:44.302414       1 azure_common_linux.go:121] /dev/disk/azure is not populated, now try to parse 2:0:0:0 directly
I0410 06:48:44.302419       1 azure_common_linux.go:121] /dev/disk/azure is not populated, now try to parse 2:0:0:1 directly
I0410 06:48:44.302423       1 azure_common_linux.go:121] /dev/disk/azure is not populated, now try to parse 2:0:0:2 directly
I0410 06:48:44.302426       1 azure_common_linux.go:121] /dev/disk/azure is not populated, now try to parse 4:0:1:0 directly
I0410 06:48:44.302430       1 azure_common_linux.go:121] /dev/disk/azure is not populated, now try to parse 5:0:0:0 directly
I0410 06:48:44.729773       1 utils.go:77] GRPC call: /csi.v1.Node/NodeStageVolume
I0410 06:48:44.729791       1 utils.go:78] GRPC request: {"publish_context":{"LUN":"2","diskiopsreadwrite":"500","diskmbpsreadwrite":"60","networkaccesspolicy":"AllowAll","skuname":"Standard_LRS"},"staging_target_path":"/var/vcap/data/kubelet/plugins/kubernetes.io/csi/disk.csi.azure.com/ad2407d46e6a6d65ad8f32685e88d02e2ff433ef4fb076f20ad79eefded4d947/globalmount","volume_capability":{"AccessType":{"Mount":{}},"access_mode":{"mode":7}},"volue_context":{"cachingmode":"ReadWrite","fstype":"","kind":"Managed"},"volume_id":"/subscriptions/066b2cf2-eca1-4e23-98b4-ce94c72b99b3/resourceGroups/obsidian-flier/providers/Microsoft.Compute/disks/pvc-53087397-597f-41aa-bccb-e2d8ce450626"}

What you expected to happen: The pod should running How to reproduce it: Happens occasionally, recreate the worker node usually resolve this issue Anything else we need to know?:

Environment:

xiaoping8385 commented 5 months ago

ssh to the worker node, we can see the devices(sdd sde) are there, but not staged.

worker/0ebcd467-c9d1-457a-8931-84520ab05cc9:~$ lsblk
NAME   MAJ:MIN RM   SIZE RO TYPE MOUNTPOINTS
fd0      2:0    1     4K  0 disk 
sda      8:0    0   403G  0 disk 
├─sda1   8:1    0   4.8G  0 part /home
│                                /
└─sda2   8:2    0 398.2G  0 part /var/vcap/data/kubelet/pods/1db7b8bf-91dc-4526-a47d-50bb0705cfd4/volume-subpaths/host-var-log-antrea/antrea-ovs/1
                                 /var/vcap/data/kubelet/pods/9f022aaf-d7b0-477a-b153-9c59404a403f/volume-subpaths/antrea-config/antrea-controller/0
                                 /var/vcap/data/kubelet/pods/1db7b8bf-91dc-4526-a47d-50bb0705cfd4/volume-subpaths/antrea-config/antrea-agent/0
                                 /var/vcap/data/kubelet/pods/1db7b8bf-91dc-4526-a47d-50bb0705cfd4/volume-subpaths/antrea-config/install-cni/0
                                 /var/vcap/data/kubelet/pods/1db7b8bf-91dc-4526-a47d-50bb0705cfd4/volume-subpaths/antrea-agent-tweaker-config/antrea-agent-tweaker/0
                                 /var/tmp
                                 /tmp
                                 /opt
                                 /var/opt
                                 /var/log
                                 /var/vcap/data
sdb      8:16   0    64G  0 disk 
└─sdb1   8:17   0    64G  0 part 
sdc      8:32   0    50G  0 disk 
└─sdc1   8:33   0    50G  0 part /var/vcap/store
sdd      8:48   0     1G  0 disk 
sde      8:64   0     1G  0 disk 
sr0     11:0    1   648K  0 rom  

worker/0ebcd467-c9d1-457a-8931-84520ab05cc9:/sys/bus/scsi/devices$ ls -al
total 0
drwxr-xr-x 2 root root 0 Apr  9 08:53 .
drwxr-xr-x 4 root root 0 Apr  9 08:53 ..
lrwxrwxrwx 1 root root 0 Apr  9 08:53 1:0:0:0 -> ../../../devices/pci0000:00/0000:00:07.1/ata2/host1/target1:0:0/1:0:0:0
lrwxrwxrwx 1 root root 0 Apr  9 09:01 2:0:0:0 -> ../../../devices/LNXSYSTM:00/LNXSYBUS:00/PNP0A03:00/device:07/VMBUS:01/f8b3781b-1e82-4818-a1c3-63d806ec15bb/host2/target2:0:0/2:0:0:0
lrwxrwxrwx 1 root root 0 Apr  9 13:35 2:0:0:1 -> ../../../devices/LNXSYSTM:00/LNXSYBUS:00/PNP0A03:00/device:07/VMBUS:01/f8b3781b-1e82-4818-a1c3-63d806ec15bb/host2/target2:0:0/2:0:0:1
lrwxrwxrwx 1 root root 0 Apr  9 13:35 2:0:0:2 -> ../../../devices/LNXSYSTM:00/LNXSYBUS:00/PNP0A03:00/device:07/VMBUS:01/f8b3781b-1e82-4818-a1c3-63d806ec15bb/host2/target2:0:0/2:0:0:2
lrwxrwxrwx 1 root root 0 Apr  9 08:53 4:0:1:0 -> ../../../devices/LNXSYSTM:00/LNXSYBUS:00/PNP0A03:00/device:07/VMBUS:01/00000000-0001-8899-0000-000000000000/host4/target4:0:1/4:0:1:0
lrwxrwxrwx 1 root root 0 Apr  9 08:53 5:0:0:0 -> ../../../devices/LNXSYSTM:00/LNXSYBUS:00/PNP0A03:00/device:07/VMBUS:01/00000000-0000-8899-0000-000000000000/host5/target5:0:0/5:0:0:0
lrwxrwxrwx 1 root root 0 Apr  9 08:53 host0 -> ../../../devices/pci0000:00/0000:00:07.1/ata1/host0
lrwxrwxrwx 1 root root 0 Apr  9 08:53 host1 -> ../../../devices/pci0000:00/0000:00:07.1/ata2/host1
lrwxrwxrwx 1 root root 0 Apr  9 08:53 host2 -> ../../../devices/LNXSYSTM:00/LNXSYBUS:00/PNP0A03:00/device:07/VMBUS:01/f8b3781b-1e82-4818-a1c3-63d806ec15bb/host2
lrwxrwxrwx 1 root root 0 Apr  9 08:53 host3 -> ../../../devices/LNXSYSTM:00/LNXSYBUS:00/PNP0A03:00/device:07/VMBUS:01/f8b3781a-1e82-4818-a1c3-63d806ec15bb/host3
lrwxrwxrwx 1 root root 0 Apr  9 08:53 host4 -> ../../../devices/LNXSYSTM:00/LNXSYBUS:00/PNP0A03:00/device:07/VMBUS:01/00000000-0001-8899-0000-000000000000/host4
lrwxrwxrwx 1 root root 0 Apr  9 08:53 host5 -> ../../../devices/LNXSYSTM:00/LNXSYBUS:00/PNP0A03:00/device:07/VMBUS:01/00000000-0000-8899-0000-000000000000/host5
lrwxrwxrwx 1 root root 0 Apr  9 08:53 target1:0:0 -> ../../../devices/pci0000:00/0000:00:07.1/ata2/host1/target1:0:0
lrwxrwxrwx 1 root root 0 Apr 10 07:21 target2:0:0 -> ../../../devices/LNXSYSTM:00/LNXSYBUS:00/PNP0A03:00/device:07/VMBUS:01/f8b3781b-1e82-4818-a1c3-63d806ec15bb/host2/target2:0:0
lrwxrwxrwx 1 root root 0 Apr  9 08:53 target4:0:1 -> ../../../devices/LNXSYSTM:00/LNXSYBUS:00/PNP0A03:00/device:07/VMBUS:01/00000000-0001-8899-0000-000000000000/host4/target4:0:1
lrwxrwxrwx 1 root root 0 Apr  9 08:53 target5:0:0 -> ../../../devices/LNXSYSTM:00/LNXSYBUS:00/PNP0A03:00/device:07/VMBUS:01/00000000-0000-8899-0000-000000000000/host5/target5:0:0
andyzhangx commented 5 months ago

azure_common_linux.go:121] /dev/disk/azure is not populated, now try to parse 1:0:0:0 directly

from the error msg, it seems you are missing such udev rule on the node, follow guide here to generate the udev rule on the node: https://github.com/kubernetes-sigs/azuredisk-csi-driver/issues/1631#issuecomment-1372299254, Azure VM should have such built-in rule by default, is your node type special? @xiaoping8385

xiaoping8385 commented 5 months ago

There is no such rule, and not all the node have the same issue, for eg, we found one of the nodes can mount pv successfully, and there is also no such built-in-rule, In normal node:


> resize2fs /dev/sdd
resize2fs 1.46.5 (30-Dec-2021)
The filesystem is already 262144 (4k) blocks long.  Nothing to do!

worker/722ad0af-9e98-4c46-b47d-44ba282a5960:~$ lsblk
NAME   MAJ:MIN RM   SIZE RO TYPE MOUNTPOINTS
fd0      2:0    1     4K  0 disk 
sda      8:0    0    64G  0 disk 
└─sda1   8:1    0    64G  0 part 
sdb      8:16   0   403G  0 disk 
├─sdb1   8:17   0   4.8G  0 part /home
│                                /
└─sdb2   8:18   0 398.2G  0 part /var/vcap/data/kubelet/pods/5986f253-7cd5-4b7a-ad6b-f6e8c8c0f681/volume-subpaths/host-var-log-antrea/antrea-ovs/1
                                 /var/vcap/data/kubelet/pods/5986f253-7cd5-4b7a-ad6b-f6e8c8c0f681/volume-subpaths/antrea-config/antrea-agent/0
                                 /var/vcap/data/kubelet/pods/5986f253-7cd5-4b7a-ad6b-f6e8c8c0f681/volume-subpaths/antrea-config/install-cni/0
                                 /var/vcap/data/kubelet/pods/5986f253-7cd5-4b7a-ad6b-f6e8c8c0f681/volume-subpaths/antrea-agent-tweaker-config/antrea-agent-tweaker/0
                                 /var/tmp
                                 /tmp
                                 /opt
                                 /var/opt
                                 /var/log
                                 /var/vcap/data
sdc      8:32   0    50G  0 disk 
└─sdc1   8:33   0    50G  0 part /var/vcap/store
sdd      8:48   0     1G  0 disk /var/vcap/data/kubelet/pods/fa2d11c4-ef95-47ef-b2b8-7b5fbf8ed279/volumes/kubernetes.io~csi/pvc-684f4e73-681d-4164-97d7-5440bafc950b/mount
                                 /var/vcap/data/kubelet/plugins/kubernetes.io/csi/disk.csi.azure.com/c4d484f06606c7650a1b33528b67e8221f3c74b68b45db9bd301ad189c62a2eb/globalmount
sr0     11:0    1   648K  0 rom  
worker/722ad0af-9e98-4c46-b47d-44ba282a5960:~$ cd /etc/udev/rules.d/
worker/722ad0af-9e98-4c46-b47d-44ba282a5960:/etc/udev/rules.d$ ls
worker/722ad0af-9e98-4c46-b47d-44ba282a5960:/etc/udev/rules.d$ ls -al
total 8
drwxr-xr-x 2 root root 4096 Apr  7  2022 .
drwxr-xr-x 4 root root 4096 Mar 18 12:14 ..
worker/722ad0af-9e98-4c46-b47d-44ba282a5960:/etc/udev/rules.d$ 

In problematic node,

worker/0ebcd467-c9d1-457a-8931-84520ab05cc9:~$ lsblk
NAME   MAJ:MIN RM   SIZE RO TYPE MOUNTPOINTS
fd0      2:0    1     4K  0 disk 
sda      8:0    0   403G  0 disk 
├─sda1   8:1    0   4.8G  0 part /home
│                                /
└─sda2   8:2    0 398.2G  0 part /var/vcap/data/kubelet/pods/1db7b8bf-91dc-4526-a47d-50bb0705cfd4/volume-subpaths/host-var-log-antrea/antrea-ovs/1
                                 /var/vcap/data/kubelet/pods/9f022aaf-d7b0-477a-b153-9c59404a403f/volume-subpaths/antrea-config/antrea-controller/0
                                 /var/vcap/data/kubelet/pods/1db7b8bf-91dc-4526-a47d-50bb0705cfd4/volume-subpaths/antrea-config/antrea-agent/0
                                 /var/vcap/data/kubelet/pods/1db7b8bf-91dc-4526-a47d-50bb0705cfd4/volume-subpaths/antrea-config/install-cni/0
                                 /var/vcap/data/kubelet/pods/1db7b8bf-91dc-4526-a47d-50bb0705cfd4/volume-subpaths/antrea-agent-tweaker-config/antrea-agent-tweaker/0
                                 /var/tmp
                                 /tmp
                                 /opt
                                 /var/opt
                                 /var/log
                                 /var/vcap/data
sdb      8:16   0    64G  0 disk 
└─sdb1   8:17   0    64G  0 part 
sdc      8:32   0    50G  0 disk 
└─sdc1   8:33   0    50G  0 part /var/vcap/store
sdd      8:48   0     1G  0 disk 
sde      8:64   0     1G  0 disk 

worker/0ebcd467-c9d1-457a-8931-84520ab05cc9:/var/vcap/bosh_ssh/bosh_79bac66d3614401# resize2fs /dev/sdd
resize2fs 1.46.5 (30-Dec-2021)
resize2fs: Bad magic number in super-block while trying to open /dev/sdd
Couldn't find valid filesystem superblock.

worker/0ebcd467-c9d1-457a-8931-84520ab05cc9:/etc/udev/rules.d# resize2fs /dev/sde
resize2fs 1.46.5 (30-Dec-2021)
resize2fs: Bad magic number in super-block while trying to open /dev/sde
Couldn't find valid filesystem superblock.

Both node are created from the same template

andyzhangx commented 5 months ago

here is the data disk parsing logic, the reliable way is creating that udev rule on the node, that would create device links under /dev/disk/azure

https://github.com/kubernetes-sigs/azuredisk-csi-driver/blob/99a117cfe0ff682d364215f6e6ab9e056a5b6339/pkg/azuredisk/azure_common_linux.go#L120-L194

xiaoping8385 commented 5 months ago

Who should be responsible to creating the udev rule? I thought the vm is created by azure cloud. and it can not explain that some of the nodes work and some doesn't

andyzhangx commented 5 months ago

can you try creating the udev rule on one node and check whether it works? @xiaoping8385 if it's working, then I could provide you a daemonset to creating udev rule on every node.

xiaoping8385 commented 5 months ago

So i need to create a script (66-azure-storage.rules )as described in #1631 comment and run that?

andyzhangx commented 5 months ago

@xiaoping8385 just create that script, and it would work automatically. and then reschedule pod with disk volume onto that node.

xiaoping8385 commented 5 months ago

@andyzhangx Still not work, I deployed the script, and delete the pod, and also restart the csi-azuredisk-node, but the pod still not running. Is that possible we have a zoom or chat?

andyzhangx commented 5 months ago

@xiaoping8385 can you run following command on the node, and try again?

This command will reload the udev rules and trigger the creation of any new device nodes that match the rules.

sudo udevadm control --reload-rules && sudo udevadm trigger
xiaoping8385 commented 5 months ago

it works finally, thanks, and one more question, how should we prevent this issue in future or do we need an extra deployment? what's the root cause for this issue

andyzhangx commented 5 months ago

it works finally, thanks, and one more question, how should we prevent this issue in future or do we need an extra deployment? what's the root cause for this issue

@xiaoping8385 per this doc, the linux VM should have udev rules: https://learn.microsoft.com/en-us/troubleshoot/azure/virtual-machines/linux/troubleshoot-device-names-problems#solution

What is your k8s cluster? are you manually setting up k8s cluster based on Azure Ubuntu VMs?

xiaoping8385 commented 5 months ago

@andyzhangx Sorry for late reply, Our k8s cluster is actually created by bosh director, which has a CPI to talk with azure platform to create vms, I don't think there is a udev rules in our envs,as I mentioned before,we have two nodes without such rule, but in one node the volume can be successfully staged, but the other don't. is there other ways other than udev rules that help to stage the volume?

andyzhangx commented 5 months ago

@andyzhangx Sorry for late reply, Our k8s cluster is actually created by bosh director, which has a CPI to talk with azure platform to create vms, I don't think there is a udev rules in our envs,as I mentioned before,we have two nodes without such rule, but in one node the volume can be successfully staged, but the other don't. is there other ways other than udev rules that help to stage the volume?

@xiaoping8385 using udev rules are the most reliable way to detect a data disk correctly. Are you able to create a daemonset to write udev rules on every node?

xiaoping8385 commented 4 months ago

sometimes yes, since this issue is happened occasionally.

k8s-triage-robot commented 1 month ago

The Kubernetes project currently lacks enough contributors to adequately respond to all issues.

This bot triages un-triaged issues according to the following rules:

You can:

Please send feedback to sig-contributor-experience at kubernetes/community.

/lifecycle stale

k8s-triage-robot commented 3 weeks ago

The Kubernetes project currently lacks enough active contributors to adequately respond to all issues.

This bot triages un-triaged issues according to the following rules:

You can:

Please send feedback to sig-contributor-experience at kubernetes/community.

/lifecycle rotten