kubernetes-sigs / azuredisk-csi-driver

Azure Disk CSI Driver
Apache License 2.0
144 stars 183 forks source link

Mounting Disks under NVMe diskcontroller in windows failes #2365

Open Flask opened 2 months ago

Flask commented 2 months ago

What happened: Trying to mount a managed disk on a nvme diskcontroller vm failes

I0620 07:48:36.892166    6464 utils.go:77] GRPC call: /csi.v1.Node/NodeStageVolume
I0620 07:48:36.892166    6464 utils.go:78] GRPC request: {"publish_context":{"LUN":"0"},"staging_target_path":"\\var\\lib\\kubelet\\plugins\\kubernetes.io\\csi\\disk.csi.azure.com\\3a07bbd56bedf026817504b649086872043fb4a71d1a81b17de2e82d86563b52\\globalmount","volume_capability":{"AccessType":{"Mount":{"fs_type":"ntfs"}},"access_mode":{"mode":7}},"volume_context":{"cachingMode":"ReadOnly","csi.storage.k8s.io/pv/name":"pvc-dcdeeaa3-cd7a-40ff-8e4e-3c3bd2430d7b","csi.storage.k8s.io/pvc/name":"mypod","csi.storage.k8s.io/pvc/namespace":"myns,"fsType":"ntfs","kind":"Managed","requestedsizegib":"512","skuName":"Premium_LRS","storage.kubernetes.io/csiProvisionerIdentity":"1718807269317-6827-disk.csi.azure.com"},"volume_id":"/subscriptions/<subscription>/resourceGroups/myrg/providers/Microsoft.Compute/disks/pvc-dcdeeaa3-cd7a-40ff-8e4e-3c3bd2430d7b"}

Warning FailedMount 4m49s (x49 over 89m) kubelet MountVolume.MountDevice failed for volume "pvc-dcdeeaa3-cd7a-40ff-8e4e-3c3bd2430d7b" : rpc error: code = Internal desc = failed to find disk on lun 0. azureDisk - findDiskByLun(0) failed with error(could not find disk id for lun: 0)

What you expected to happen: provide the pvc to the pod

How to reproduce it: try to attach an azuredisk to a windows kubernetes node of type Standard_D4alds_v6

Anything else we need to know?:

Environment:

andyzhangx commented 2 months ago

could it always repro on Standard_D4alds_v6 windows vm sku?

Flask commented 2 months ago

hey @andyzhangx i've tried it 4-5 times with different machines in a vmss. I think there have been some changes on how managed disks are attached to the those VMs. Maybe this helps:

Managed disk on Standard_D96ads_v5:

get-disk                                                                                                                                                                                                                                             

Number Friendly Name                                                                                                                                      Serial Number                    HealthStatus         OperationalStatus      Total Size Partition  
                                                                                                                                                                                                                                                  Style      
------ -------------                                                                                                                                      -------------                    ------------         -----------------      ---------- ---------- 
...
11     Msft Virtual Disk                                                                                                                                                                   Healthy              Online                     512 GB GPT 
...
ConvertTo-Json @(Get-Disk | select Number, Location)  
[                                                                                                                                                                                                                                                            
    ...                                                                                                                                                                                                                                 
    {                                                                                                                                                                                                                                                        
        "Number":  11,                                                                                                                                                                                                                                       
        "Location":  "Integrated : Adapter 3 : Port 0 : Target 0 : LUN 0"                                                                                                                                                                                    
    },
    ...

on the Standard_D96alds_v6:

Get-Disk                                                                                                                                                                                                                                          

Number Friendly Name                                                                                                                                      Serial Number                    HealthStatus         OperationalStatus      Total Size Partition  
                                                                                                                                                                                                                                                  Style      
------ -------------                                                                                                                                      -------------                    ------------         -----------------      ---------- ---------- 
...      
12     MSFT NVMe Accelerator v1.0                                                                                                                         B91B_DB34_FB4F_48EE_AC80_7234... Healthy              Online                     512 GB GPT        
ConvertTo-Json @(Get-Disk | select Number, Location)  
[                                                                                                                                                                                                                                                            
    ...                                                                                                                                                                                                                                 
    {                                                                                                                                                                                                                                                        
        "Number":  12,                                                                                                                                                                                                                                       
        "Location":  "Integrated : Adapter 0"                                                                                                                                                                                                                
    }  
    ...

I've removed the non-related entries to keep it simple and replaced them with ...

andyzhangx commented 2 months ago

@Flask so on Standard_D96alds_v6, is disk num 12 a managed data disk? the is Friendly Name of that disk is MSFT NVMe Accelerator v1.0 , and that disk does not have lun num mapping as Standard_D96ads_v5, e.g. "Location": "Integrated : Adapter 3 : Port 0 : Target 0 : LUN 0"

Flask commented 2 months ago

Exactly. Storage class is in both cases:

apiVersion: storage.k8s.io/v1
kind: StorageClass
metadata:
  name: ssd-ntfs
parameters:
  cachingMode: ReadOnly
  fsType: ntfs
  kind: managed
  skuName: Premium_LRS
provisioner: disk.csi.azure.com
reclaimPolicy: Delete
volumeBindingMode: WaitForFirstConsumer
andyzhangx commented 2 months ago

@Flask I think there is sth. wrong with the windows vm internal config for this vm sku. Can you file a support ticket to Azure windows VM team? thx

andyzhangx commented 2 months ago

On linux, there should be a udev rule to detect data disk automatically: https://github.com/kubernetes-sigs/azuredisk-csi-driver/issues/2034#issuecomment-1854095537 I think Windows VM should also have similar udev rule on this VM sku.

andyzhangx commented 1 month ago

FYI. the nvme disk is already supported on Linux node with v1.30.3 release, still need to figure how to get the <lun, disk-num> mapping on Windows node.