kubernetes-sigs / azurefile-csi-driver

Azure File CSI Driver
Apache License 2.0
154 stars 141 forks source link

RemoveSmbGlobalMapping makes Windows node very slow #1925

Open Rob19999 opened 3 months ago

Rob19999 commented 3 months ago

What happened: Do to the RemoveSmbGlobalMapping cleaning up a lot of mounts at the same time (see below) . Either the POD gets overloaded or the storage account hits some sort off rate limiter because we're seeing high latency errors. We tried validating this on the storage account but all seems good here.

Windows nodes can no longer create mounts and the node becomes unusable and needs to be cordoned until the process is done unmounting. No new mounts can be created during this process. We expect this to be the cause.

How to reproduce it: Hard to say. We expect its a load issue where we create/delete volumes in a very short time but we cannot reliable reproduce this. We then hit high latency. At times we see 10sec latency 30sec 160 210 sec's multible times and nodes become unusable.

Anything else we need to know?: This seems to be introduced by https://github.com/kubernetes-sigs/azurefile-csi-driver/pull/1847 We're also running a case at Microsoft support but that is slow going. Ticket start 07 juni.

Expected start time off the issue 7 juni 2024 20:00 utc +. seems to correlate with the new CSI version.

Storage class: `spec: accessModes:

Storageclass settings are: provisioner: file.csi.azure.com parameters: skuName: Standard_LRS reclaimPolicy: Delete mountOptions:

Environment:

`I0618 10:20:01.948928 10816 smb.go:62] begin to run RemoveSmbGlobalMapping with \.file.core.windows.net\pvc-d54bf4ed-647d-480a-8de5-3f647ca1126f I0618 10:20:02.184149 10816 smb.go:97] checking remote server path UNC\.file.core.windows.net\pvc-e1118b6f-8b8e-4fbe-b680-0671072fbf2e on local path C:\var\lib\kubelet\plugins\kubernetes.io\csi\file.csi.azure.com\f9d0b87889d2774ec5e434e4f21da3a34b60667e8473ff91641328c209ba2c8f\globalmount I0618 10:20:02.184149 10816 smb.go:62] begin to run RemoveSmbGlobalMapping with \.file.core.windows.net\pvc-ff818ba6-ec9f-4486-80e4-7cc61350f6d5 I0618 10:20:02.422931 10816 smb.go:97] checking remote server path UNC\.file.core.windows.net\pvc-e1118b6f-8b8e-4fbe-b680-0671072fbf2e on local path C:\var\lib\kubelet\plugins\kubernetes.io\csi\file.csi.azure.com\f9d0b87889d2774ec5e434e4f21da3a34b60667e8473ff91641328c209ba2c8f\globalmount I0618 10:20:02.422931 10816 smb.go:62] begin to run RemoveSmbGlobalMapping with \.file.core.windows.net\pvc-b034ea6e-80a0-4b98-a036-a7557c1d2595 I0618 10:20:02.697686 10816 smb.go:97] checking remote server path UNC\.file.core.windows.net\pvc-e1118b6f-8b8e-4fbe-b680-0671072fbf2e on local path C:\var\lib\kubelet\plugins\kubernetes.io\csi\file.csi.azure.com\f9d0b87889d2774ec5e434e4f21da3a34b60667e8473ff91641328c209ba2c8f\globalmount I0618 10:20:02.697686 10816 smb.go:62] begin to run RemoveSmbGlobalMapping with \.file.core.windows.net\pvc-f9f952d8-fd7b-423b-8384-3230cbbef2d2 I0618 10:20:03.701545 10816 safe_mounter_host_process_windows.go:153] Unmount: remote path: \.file.core.windows.net\pvc-73aa6b77-becf-49d0-bfcc-203c68b89695 local path: c:\var\lib\kubelet\plugins\kubernetes.io\csi\file.csi.azure.com\68be1105184c280e13b3b149afbfbe50dfe12a8428a23aabbccdb60143461c61\globalmount I0618 10:20:03.702218 10816 nodeserver.go:438] NodeUnstageVolume: unmount volume ##pvc-73aa6b77-becf-49d0-bfcc-203c68b89695### on \var\lib\kubelet\plugins\kubernetes.io\csi\file.csi.azure.com\68be1105184c280e13b3b149afbfbe50dfe12a8428a23aabbccdb60143461c61\globalmount successfully I0618 10:20:03.702218 10816 azure_metrics.go:118] "Observed Request Latency" latency_seconds=199.0857827 request="azurefile_csi_driver_node_unstage_volume" resource_group="" subscription_id="" source="file.csi.azure.com" volumeid="##pvc-73aa6b77-becf-49d0-bfcc-203c68b89695###" result_code="succeeded" I0618 10:20:03.702218 10816 utils.go:108] GRPC response: {} I0618 10:20:03.989377 10816 smb.go:97] checking remote server path UNC\.file.core.windows.net\pvc-e1118b6f-8b8e-4fbe-b680-0671072fbf2e on local path C:\var\lib\kubelet\plugins\kubernetes.io\csi\file.csi.azure.com\f9d0b87889d2774ec5e434e4f21da3a34b60667e8473ff91641328c209ba2c8f\globalmount I0618 10:20:03.989377 10816 smb.go:62] begin to run RemoveSmbGlobalMapping with \.file.core.windows.net\pvc-373137fb-16fa-4445-9526-fa69d783ec00 I0618 10:20:04.286036 10816 smb.go:97] checking remote server path UNC\.file.core.windows.net\pvc-e1118b6f-8b8e-4fbe-b680-0671072fbf2e on local path C:\var\lib\kubelet\plugins\kubernetes.io\csi\file.csi.azure.com\f9d0b87889d2774ec5e434e4f21da3a34b60667e8473ff91641328c209ba2c8f\globalmount I0618 10:20:04.286036 10816 smb.go:62] begin to run RemoveSmbGlobalMapping with \.file.core.windows.net\pvc-e465606b-ee96-42f4-9108-266751f62cc0 I0618 10:20:04.533291 10816 smb.go:97] checking remote server path UNC\.file.core.windows.net\pvc-e1118b6f-8b8e-4fbe-b680-0671072fbf2e on local path C:\var\lib\kubelet\plugins\kubernetes.io\csi\file.csi.azure.com\f9d0b87889d2774ec5e434e4f21da3a34b60667e8473ff91641328c209ba2c8f\globalmount I0618 10:20:04.533291 10816 smb.go:62] begin to run RemoveSmbGlobalMapping with \.file.core.windows.net\pvc-9cc1f701-08df-45fa-9a63-6316ce0f4521 I0618 10:20:04.770168 10816 smb.go:97] checking remote server path UNC\.file.core.windows.net\pvc-e1118b6f-8b8e-4fbe-b680-0671072fbf2e on local path C:\var\lib\kubelet\plugins\kubernetes.io\csi\file.csi.azure.com\f9d0b87889d2774ec5e434e4f21da3a34b60667e8473ff91641328c209ba2c8f\globalmount I0618 10:20:04.770198 10816 smb.go:62] begin to run RemoveSmbGlobalMapping with \.file.core.windows.net\pvc-57d12ee1-f847-44ec-830b-e6375971a791 I0618 10:20:05.068931 10816 smb.go:97] checking remote server path UNC\.file.core.windows.net\pvc-e1118b6f-8b8e-4fbe-b680-0671072fbf2e on local path C:\var\lib\kubelet\plugins\kubernetes.io\csi\file.csi.azure.com\f9d0b87889d2774ec5e434e4f21da3a34b60667e8473ff91641328c209ba2c8f\globalmount I0618 10:20:05.068931 10816 smb.go:62] begin to run RemoveSmbGlobalMapping with \.file.core.windows.net\pvc-0eb2b4ab-9f0d-48e8-81e0-1639f3b8f26a I0618 10:20:05.474311 10816 smb.go:97] checking remote server path UNC\.file.core.windows.net\pvc-e1118b6f-8b8e-4fbe-b680-0671072fbf2e on local path C:\var\lib\kubelet\plugins\kubernetes.io\csi\file.csi.azure.com\f9d0b87889d2774ec5e434e4f21da3a34b60667e8473ff91641328c209ba2c8f\globalmount I0618 10:20:05.474311 10816 smb.go:62] begin to run RemoveSmbGlobalMapping with \.file.core.windows.net\pvc-f3e3de62-8522-473b-99b3-ee77c7492d9c I0618 10:20:06.699287 10816 safe_mounter_host_process_windows.go:153] Unmount: remote path: \.file.core.windows.net\pvc-72761240-0540-421b-8dab-fe0715219f8e local path: c:\var\lib\kubelet\plugins\kubernetes.io\csi\file.csi.azure.com\c12f70176ba6695ba94e730da66a9fa3598f85b824f9855ae55ee54ed85492cf\globalmount I0618 10:20:06.699287 10816 nodeserver.go:438] NodeUnstageVolume: unmount volume ##pvc-72761240-0540-421b-8dab-fe0715219f8e### on \var\lib\kubelet\plugins\kubernetes.io\csi\file.csi.azure.com\c12f70176ba6695ba94e730da66a9fa3598f85b824f9855ae55ee54ed85492cf\globalmount successfully I0618 10:20:06.699287 10816 azure_metrics.go:118] "Observed Request Latency" latency_seconds=202.2881918 request="azurefile_csi_driver_node_unstage_volume" resource_group="" subscription_id="" source="file.csi.azure.com" volumeid="##pvc-72761240-0540-421b-8dab-fe0715219f8e###" result_code="succeeded" I0618 10:20:06.699287 10816 utils.go:108] GRPC response: {} I0618 10:20:07.722251 10816 safe_mounter_host_process_windows.go:153] Unmount: remote path: \.file.core.windows.net\pvc-79ecdf05-1c90-47d4-826a-77f19a8ef701 local path: c:\var\lib\kubelet\plugins\kubernetes.io\csi\file.csi.azure.com\36cadd6f3c6efd50be224333d28fce0715ebf57ccb46583c9f572e61f4b74455\globalmount I0618 10:20:07.722251 10816 nodeserver.go:438] NodeUnstageVolume: unmount volume ##pvc-79ecdf05-1c90-47d4-826a-77f19a8ef701###-suite on \var\lib\kubelet\plugins\kubernetes.io\csi\file.csi.azure.com\36cadd6f3c6efd50be224333d28fce0715ebf57ccb46583c9f572e61f4b74455\globalmount successfully I0618 10:20:07.722251 10816 azure_metrics.go:118] "Observed Request Latency" latency_seconds=203.2887571 request="azurefile_csi_driver_node_unstage_volume" resource_group="" subscription_id="" source="file.csi.azure.com" volumeid="##pvc-79ecdf05-1c90-47d4-826a-77f19a8ef701###-suite" result_code="succeeded" I0618 10:20:07.722251 10816 utils.go:108] GRPC response: {} I0618 10:20:08.828625 10816 safe_mounter_host_process_windows.go:153] Unmount: remote path: \.file.core.windows.net\pvc-4ac50876-5989-405a-9143-5d587fcd2988 local path: c:\var\lib\kubelet\plugins\kubernetes.io\csi\file.csi.azure.com\034c21971c3ececa002bdf34aff6700c4d8158d26011a278e01b72ccda712ece\globalmount I0618 10:20:08.829524 10816 nodeserver.go:438] NodeUnstageVolume: unmount volume ##pvc-4ac50876-5989-405a-9143-5d587fcd2988### on \var\lib\kubelet\plugins\kubernetes.io\csi\file.csi.azure.com\034c21971c3ececa002bdf34aff6700c4d8158d26011a278e01b72ccda712ece\globalmount successfully I0618 10:20:08.829598 10816 azure_metrics.go:118] "Observed Request Latency" latency_seconds=204.3185916 request="azurefile_csi_driver_node_unstage_volume" resource_group="" subscription_id="" source="file.csi.azure.com" volumeid="##pvc-4ac50876-5989-405a-9143-5d587fcd2988###" result_code="succeeded" I0618 10:20:08.829632 10816 utils.go:108] GRPC response: {} I0618 10:20:09.902257 10816 safe_mounter_host_process_windows.go:153] Unmount: remote path: \.file.core.windows.net\pvc-0242f32f-f56b-4cb9-b7da-7226cc26b57c local path: c:\var\lib\kubelet\plugins\kubernetes.io\csi\file.csi.azure.com\738c2ee924a3cd1c0f7d6b7402f0340b800a969b535303b27ae756a769c1d8c5\globalmount I0618 10:20:09.902678 10816 nodeserver.go:438] NodeUnstageVolume: unmount volume ##pvc-0242f32f-f56b-4cb9-b7da-7226cc26b57c### on \var\lib\kubelet\plugins\kubernetes.io\csi\file.csi.azure.com\738c2ee924a3cd1c0f7d6b7402f0340b800a969b535303b27ae756a769c1d8c5\globalmount successfully I0618 10:20:09.902702 10816 azure_metrics.go:118] "Observed Request Latency" latency_seconds=205.3916363 request="azurefile_csi_driver_node_unstage_volume" resource_group="" subscription_id="" source="file.csi.azure.com" volumeid="##pvc-0242f32f-f56b-4cb9-b7da-7226cc26b57c###" result_code="succeeded" I0618 10:20:09.902756 10816 utils.go:108] GRPC response: {} I0618 10:20:10.878791 10816 safe_mounter_host_process_windows.go:153] Unmount: remote path: \.file.core.windows.net\pvc-3009b498-4ff5-45cc-afe3-9e5b218198f0 local path: c:\var\lib\kubelet\plugins\kubernetes.io\csi\file.csi.azure.com\93f41f081097d72376a47274df2be10a55b0ed09627df13e69671b18909b282d\globalmount I0618 10:20:10.878791 10816 nodeserver.go:438] NodeUnstageVolume: unmount volume ##pvc-3009b498-4ff5-45cc-afe3-9e5b218198f0### on \var\lib\kubelet\plugins\kubernetes.io\csi\file.csi.azure.com\93f41f081097d72376a47274df2be10a55b0ed09627df13e69671b18909b282d\globalmount successfully I0618 10:20:10.878791 10816 azure_metrics.go:118] "Observed Request Latency" latency_seconds=206.3627631 request="azurefile_csi_driver_node_unstage_volume" resource_group="" subscription_id="" source="file.csi.azure.com" volumeid="##pvc-3009b498-4ff5-45cc-afe3-9e5b218198f0###" result_code="succeeded" I0618 10:20:10.878791 10816 utils.go:108] GRPC response: {} I0618 10:20:11.786121 10816 safe_mounter_host_process_windows.go:153] Unmount: remote path: \.file.core.windows.net\pvc-78ae83bf-0534-44ce-a2bf-8e91cf904444 local path: c:\var\lib\kubelet\plugins\kubernetes.io\csi\file.csi.azure.com\bdcbb4137593f2fb676c9215aa83a13d054bf0763edd31695686483cef010f34\globalmount I0618 10:20:11.789218 10816 nodeserver.go:438] NodeUnstageVolume: unmount volume ##pvc-78ae83bf-0534-44ce-a2bf-8e91cf904444### on \var\lib\kubelet\plugins\kubernetes.io\csi\file.csi.azure.com\bdcbb4137593f2fb676c9215aa83a13d054bf0763edd31695686483cef010f34\globalmount successfully I0618 10:20:11.789218 10816 azure_metrics.go:118] "Observed Request Latency" latency_seconds=207.2731914 request="azurefile_csi_driver_node_unstage_volume" resource_group="" subscription_id="" source="file.csi.azure.com" volumeid="##pvc-78ae83bf-0534-44ce-a2bf-8e91cf904444###" result_code="succeeded" I0618 10:20:11.789218 10816 utils.go:108] GRPC response: {} I0618 10:20:12.697301 10816 safe_mounter_host_process_windows.go:153] Unmount: remote path: \.file.core.windows.net\pvc-2949f134-77a9-49fe-bbe8-a56869a66da5 local path: c:\var\lib\kubelet\plugins\kubernetes.io\csi\file.csi.azure.com\8bb3100c43ed731f7bf1eb91db32110762dc8812447044cd2a1898c72013c4c8\globalmount I0618 10:20:12.697724 10816 nodeserver.go:438] NodeUnstageVolume: unmount volume ##pvc-2949f134-77a9-49fe-bbe8-a56869a66da5### on \var\lib\kubelet\plugins\kubernetes.io\csi\file.csi.azure.com\8bb3100c43ed731f7bf1eb91db32110762dc8812447044cd2a1898c72013c4c8\globalmount successfully I0618 10:20:12.697802 10816 azure_metrics.go:118] "Observed Request Latency" latency_seconds=208.1809329 request="azurefile_csi_driver_node_unstage_volume" resource_group="" subscription_id="" source="file.csi.azure.com" volumeid="##pvc-2949f134-77a9-49fe-bbe8-a56869a66da5###" result_code="succeeded" I0618 10:20:12.697835 10816 utils.go:108] GRPC response: {} I0618 10:20:12.946110 10816 smb.go:97] checking remote server path UNC\.file.core.windows.net\pvc-f9f952d8-fd7b-423b-8384-3230cbbef2d2 on local path C:\var\lib\kubelet\plugins\kubernetes.io\csi\file.csi.azure.com\c9d4eedbd9b825d6afca7c8ae5b55d2deeea2e1ab125d9502446d5b62c0de9fa\globalmount I0618 10:20:13.202160 10816 smb.go:97] checking remote server path UNC\.file.core.windows.net\pvc-f9f952d8-fd7b-423b-8384-3230cbbef2d2 on local path C:\var\lib\kubelet\plugins\kubernetes.io\csi\file.csi.azure.com\c9d4eedbd9b825d6afca7c8ae5b55d2deeea2e1ab125d9502446d5b62c0de9fa\globalmount I0618 10:20:13.476106 10816 smb.go:97] checking remote server path UNC\.file.core.windows.net\pvc-f9f952d8-fd7b-423b-8384-3230cbbef2d2 on local path C:\var\lib\kubelet\plugins\kubernetes.io\csi\file.csi.azure.com\c9d4eedbd9b825d6afca7c8ae5b55d2deeea2e1ab125d9502446d5b62c0de9fa\globalmount I0618 10:20:14.636814 10816 safe_mounter_host_process_windows.go:153] Unmount: remote path: \.file.core.windows.net\pvc-713ee767-8180-4301-ba8d-28493cc53e93 local path: c:\var\lib\kubelet\plugins\kubernetes.io\csi\file.csi.azure.com\f94621da65e3f3e84fea189a34121ae5cf00901c29de8fae77900bdd37cffd64\globalmount I0618 10:20:14.636814 10816 nodeserver.go:438] NodeUnstageVolume: unmount volume ##pvc-713ee767-8180-4301-ba8d-28493cc53e93### on \var\lib\kubelet\plugins\kubernetes.io\csi\file.csi.azure.com\f94621da65e3f3e84fea189a34121ae5cf00901c29de8fae77900bdd37cffd64\globalmount successfully I0618 10:20:14.636814 10816 azure_metrics.go:118] "Observed Request Latency" latency_seconds=210.119928 request="azurefile_csi_driver_node_unstage_volume" resource_group="" subscription_id="" source="file.csi.azure.com" volumeid="##pvc-713ee767-8180-4301-ba8d-28493cc53e93###" result_code="succeeded" I0618 10:20:14.636814 10816 utils.go:108] GRPC response: {} I0618 10:20:15.534033 10816 safe_mounter_host_process_windows.go:153] Unmount: remote path: \.file.core.windows.net\pvc-2db7726b-6f2b-4356-96ce-23205ca74256 local path: c:\var\lib\kubelet\plugins\kubernetes.io\csi\file.csi.azure.com\79b5697b9072bfb2fbdeb3ade8c4a1518580957fc5a68e2d4ff0ebd333b11a23\globalmount I0618 10:20:15.534636 10816 nodeserver.go:438] NodeUnstageVolume: unmount volume ##pvc-2db7726b-6f2b-4356-96ce-23205ca74256### on \var\lib\kubelet\plugins\kubernetes.io\csi\file.csi.azure.com\79b5697b9072bfb2fbdeb3ade8c4a1518580957fc5a68e2d4ff0ebd333b11a23\globalmount successfully I0618 10:20:15.534636 10816 azure_metrics.go:118] "Observed Request Latency" latency_seconds=211.0177506 request="azurefile_csi_driver_node_unstage_volume" resource_group="" subscription_id="" source="file.csi.azure.com" volumeid="##pvc-2db7726b-6f2b-4356-96ce-23205ca74256###" result_code="succeeded" I0618 10:20:15.534636 10816 utils.go:108] GRPC response: {} I0618 10:20:16.526202 10816 safe_mounter_host_process_windows.go:153] Unmount: remote path: \.file.core.windows.net\pvc-5a6a954d-5190-4d60-9109-88aa0451fe4b local path: c:\var\lib\kubelet\plugins\kubernetes.io\csi\file.csi.azure.com\21bfe19e2ba4d47cbb05c6d41a5870b528440959370643cbc690441dbb5d1076\globalmount I0618 10:20:16.526202 10816 nodeserver.go:438] NodeUnstageVolume: unmount volume ##pvc-5a6a954d-5190-4d60-9109-88aa0451fe4b### on \var\lib\kubelet\plugins\kubernetes.io\csi\file.csi.azure.com\21bfe19e2ba4d47cbb05c6d41a5870b528440959370643cbc690441dbb5d1076\globalmount successfully I0618 10:20:16.526202 10816 azure_metrics.go:118] "Observed Request Latency" latency_seconds=212.0083226 request="azurefile_csi_driver_node_unstage_volume" resource_group="" subscription_id="" source="file.csi.azure.com" volumeid="##pvc-5a6a954d-5190-4d60-9109-88aa0451fe4b###" result_code="succeeded" I0618 10:20:16.526202 10816 utils.go:108] GRPC response: {} I0618 10:20:17.392968 10816 safe_mounter_host_process_windows.go:153] Unmount: remote path: \.file.core.windows.net\pvc-44c9f43c-e568-4fc5-a367-54d83780d72d local path: c:\var\lib\kubelet\plugins\kubernetes.io\csi\file.csi.azure.com\6f61b1c4aff21ed830d9688c61158b783824cd05856cb6b4ea51892d49167d2b\globalmount I0618 10:20:17.393701 10816 nodeserver.go:438] NodeUnstageVolume: unmount volume ##pvc-44c9f43c-e568-4fc5-a367-54d83780d72d### on \var\lib\kubelet\plugins\kubernetes.io\csi\file.csi.azure.com\6f61b1c4aff21ed830d9688c61158b783824cd05856cb6b4ea51892d49167d2b\globalmount successfully I0618 10:20:17.393803 10816 azure_metrics.go:118] "Observed Request Latency" latency_seconds=212.8753152 request="azurefile_csi_driver_node_unstage_volume" resource_group="" subscription_id="" source="file.csi.azure.com" volumeid="##pvc-44c9f43c-e568-4fc5-a367-54d83780d72d###" result_code="succeeded" I0618 10:20:17.393879 10816 utils.go:108] GRPC response: {} I0618 10:20:18.224804 10816 safe_mounter_host_process_windows.go:153] Unmount: remote path: \.file.core.windows.net\pvc-24179808-e2c1-4437-974e-01ebb4f02b13 local path: c:\var\lib\kubelet\plugins\kubernetes.io\csi\file.csi.azure.com\9edcba724428c1bee6be9a30ece697a4e78a7d47d2c29f421d9be77a41207db3\globalmount I0618 10:20:18.225363 10816 nodeserver.go:438] NodeUnstageVolume: unmount volume ##pvc-24179808-e2c1-4437-974e-01ebb4f02b13### on \var\lib\kubelet\plugins\kubernetes.io\csi\file.csi.azure.com\9edcba724428c1bee6be9a30ece697a4e78a7d47d2c29f421d9be77a41207db3\globalmount successfully I0618 10:20:18.225363 10816 azure_metrics.go:118] "Observed Request Latency" latency_seconds=213.7069489 request="azurefile_csi_driver_node_unstage_volume" resource_group="" subscription_id="" source="file.csi.azure.com" volumeid="##pvc-24179808-e2c1-4437-974e-01ebb4f02b13###" result_code="succeeded" I0618 10:20:18.225363 10816 utils.go:108] GRPC response: {} I0618 10:20:19.136511 10816 safe_mounter_host_process_windows.go:153] Unmount: remote path: \.file.core.windows.net\pvc-9fba43bc-d909-4ed8-b35b-8334f31eb153 local path: c:\var\lib\kubelet\plugins\kubernetes.io\csi\file.csi.azure.com\05166299f887278812bc8dce6779991718b760c1e62cd128750ae9880e1c9e95\globalmount I0618 10:20:19.137299 10816 nodeserver.go:438] NodeUnstageVolume: unmount volume ##pvc-9fba43bc-d909-4ed8-b35b-8334f31eb153### on \var\lib\kubelet\plugins\kubernetes.io\csi\file.csi.azure.com\05166299f887278812bc8dce6779991718b760c1e62cd128750ae9880e1c9e95\globalmount successfully I0618 10:20:19.137299 10816 azure_metrics.go:118] "Observed Request Latency" latency_seconds=214.6188864 request="azurefile_csi_driver_node_unstage_volume" resource_group="" subscription_id="" source="file.csi.azure.com" volumeid="##pvc-9fba43bc-d909-4ed8-b35b-8334f31eb153###" result_code="succeeded" I0618 10:20:19.137299 10816 utils.go:108] GRPC response: {} I0618 10:20:20.060895 10816 safe_mounter_host_process_windows.go:153] Unmount: remote path: \.file.core.windows.net\pvc-2d21d178-6878-4ed8-812b-bc0475b5ff6b local path: c:\var\lib\kubelet\plugins\kubernetes.io\csi\file.csi.azure.com\c99e9e73af00de6731ba0d597b58df22ac4c6051d80b17f91bccfbf5e9b1c5ba\globalmount I0618 10:20:20.061880 10816 nodeserver.go:438] NodeUnstageVolume: unmount volume ##pvc-2d21d178-6878-4ed8-812b-bc0475b5ff6b###-suite on \var\lib\kubelet\plugins\kubernetes.io\csi\file.csi.azure.com\c99e9e73af00de6731ba0d597b58df22ac4c6051d80b17f91bccfbf5e9b1c5ba\globalmount successfully I0618 10:20:20.061971 10816 azure_metrics.go:118] "Observed Request Latency" latency_seconds=215.5428814 request="azurefile_csi_driver_node_unstage_volume" resource_group="" subscription_id="" source="file.csi.azure.com" volumeid="##pvc-2d21d178-6878-4ed8-812b-bc0475b5ff6b###-suite" result_code="succeeded" I0618 10:20:20.062000 10816 utils.go:108] GRPC response: {} I0618 10:20:21.028418 10816 safe_mounter_host_process_windows.go:153] Unmount: remote path: \.file.core.windows.net\pvc-e27b7c29-471a-438c-8af3-2bac9c24d699 local path: c:\var\lib\kubelet\plugins\kubernetes.io\csi\file.csi.azure.com\ca3534d0553fab117563cb8a57e1428c986890a081e1ad90b6671e14f1b5df60\globalmount I0618 10:20:21.029038 10816 nodeserver.go:438] NodeUnstageVolume: unmount volume ##pvc-e27b7c29-471a-438c-8af3-2bac9c24d699### on \var\lib\kubelet\plugins\kubernetes.io\csi\file.csi.azure.com\ca3534d0553fab117563cb8a57e1428c986890a081e1ad90b6671e14f1b5df60\globalmount successfully I0618 10:20:21.029038 10816 azure_metrics.go:118] "Observed Request Latency" latency_seconds=216.4953155 request="azurefile_csi_driver_node_unstage_volume" resource_group="" subscription_id="" source="file.csi.azure.com" volumeid="##pvc-e27b7c29-471a-438c-8af3-2bac9c24d699###" result_code="succeeded" I0618 10:20:21.029038 10816 utils.go:108] GRPC response: {} I0618 10:20:21.934535 10816 safe_mounter_host_process_windows.go:153] Unmount: remote path: \.file.core.windows.net\pvc-e1118b6f-8b8e-4fbe-b680-0671072fbf2e local path: c:\var\lib\kubelet\plugins\kubernetes.io\csi\file.csi.azure.com\f9d0b87889d2774ec5e434e4f21da3a34b60667e8473ff91641328c209ba2c8f\globalmount I0618 10:20:21.935276 10816 nodeserver.go:438] NodeUnstageVolume: unmount volume ##pvc-e1118b6f-8b8e-4fbe-b680-0671072fbf2e### on \var\lib\kubelet\plugins\kubernetes.io\csi\file.csi.azure.com\f9d0b87889d2774ec5e434e4f21da3a34b60667e8473ff91641328c209ba2c8f\globalmount successfully I0618 10:20:21.935364 10816 azure_metrics.go:118] "Observed Request Latency" latency_seconds=217.4015968 request="azurefile_csi_driver_node_unstage_volume" resource_group="" subscription_id="" source="file.csi.azure.com" volumeid="##pvc-e1118b6f-8b8e-4fbe-b680-0671072fbf2e###" result_code="succeeded" I0618 10:20:21.935384 10816 utils.go:108] GRPC response: {} I0618 10:20:22.847634 10816 safe_mounter_host_process_windows.go:153] Unmount: remote path: \.file.core.windows.net\pvc-d54bf4ed-647d-480a-8de5-3f647ca1126f local path: c:\var\lib\kubelet\plugins\kubernetes.io\csi\file.csi.azure.com\283cb9d1568781162f98d129fc9701b7c94d5c357616bbe8d06ae6dead8f97c7\globalmount I0618 10:20:22.857050 10816 nodeserver.go:438] NodeUnstageVolume: unmount volume ##pvc-d54bf4ed-647d-480a-8de5-3f647ca1126f###-suite on \var\lib\kubelet\plugins\kubernetes.io\csi\file.csi.azure.com\283cb9d1568781162f98d129fc9701b7c94d5c357616bbe8d06ae6dead8f97c7\globalmount successfully I0618 10:20:22.857050 10816 azure_metrics.go:118] "Observed Request Latency" latency_seconds=218.3233306 request="azurefile_csi_driver_node_unstage_volume" resource_group="" subscription_id="" source="file.csi.azure.com" volumeid="##pvc-d54bf4ed-647d-480a-8de5-3f647ca1126f###-suite" result_code="succeeded" I0618 10:20:22.857050 10816 utils.go:108] GRPC response: {} I0618 10:20:22.998545 10816 utils.go:101] GRPC call: /csi.v1.Node/NodeUnstageVolume I0618 10:20:22.999582 10816 utils.go:101] GRPC call: /csi.v1.Node/NodeUnstageVolume I0618 10:20:22.998545 10816 utils.go:102] GRPC request: {"staging_target_path":"\var\lib\kubelet\plugins\kubernetes.io\csi\file.csi.azure.com\86e0827bb9cbd0427bfabe2ef368b87f24995bcd1f97f071a152786fa1dfb37f\globalmount","volume_id":"##pvc-1517e05b-fdb0-4947-a7c6-da943cf99efa###"} E0618 10:20:22.999582 10816 utils.go:106] GRPC error: rpc error: code = Aborted desc = An operation with the given Volume ID ##pvc-1517e05b-fdb0-4947-a7c6-da943cf99efa### already exists I0618 10:20:22.999582 10816 utils.go:102] GRPC request: {"staging_target_path":"\var\lib\kubelet\plugins\kubernetes.io\csi\file.csi.azure.com\0e8d433c6973fb38384981c32271f2e5299a8b5f828099586f0efad994ba67c2\globalmount","volume_id":"##pvc-b4f7a590-4570-4fc1-a978-05558603dca1###"} E0618 10:20:22.999882 10816 utils.go:106] GRPC error: rpc error: code = Aborted desc = An operation with the given Volume ID ##pvc-b4f7a590-4570-4fc1-a978-05558603dca1### already exists I0618 10:20:23.003864 10816 utils.go:101] GRPC call: /csi.v1.Node/NodeUnstageVolume I0618 10:20:23.003864 10816 utils.go:102] GRPC request: {"staging_target_path":"\var\lib\kubelet\plugins\kubernetes.io\csi\file.csi.azure.com\99ef496ccb72e2f549ef78b842855c22765e1f899df6248b8d815cb36f0eae8d\globalmount","volume_id":"##pvc-a7ac343a-aff1-4a2a-8eec-959855b2b15b###"} E0618 10:20:23.003864 10816 utils.go:106] GRPC error: rpc error: code = Aborted desc = An operation with the given Volume ID ##pvc-a7ac343a-aff1-4a2a-8eec-959855b2b15b### already exists I0618 10:20:23.836056 10816 safe_mounter_host_process_windows.go:153] Unmount: remote path: \.file.core.windows.net\pvc-ff818ba6-ec9f-4486-80e4-7cc61350f6d5 local path: c:\var\lib\kubelet\plugins\kubernetes.io\csi\file.csi.azure.com\08afa7f5599b501c3ec032c18fbb1f626a7e12ee8672fa52b05dce745f28e873\globalmount I0618 10:20:23.836814 10816 nodeserver.go:438] NodeUnstageVolume: unmount volume ##pvc-ff818ba6-ec9f-4486-80e4-7cc61350f6d5### on \var\lib\kubelet\plugins\kubernetes.io\csi\file.csi.azure.com\08afa7f5599b501c3ec032c18fbb1f626a7e12ee8672fa52b05dce745f28e873\globalmount successfully I0618 10:20:23.837362 10816 azure_metrics.go:118] "Observed Request Latency" latency_seconds=219.2973164 request="azurefile_csi_driver_node_unstage_volume" resource_group="" subscription_id="" source="file.csi.azure.com" volumeid="##pvc-ff818ba6-ec9f-4486-80e4-7cc61350f6d5###" result_code="succeeded" I0618 10:20:23.837826 10816 utils.go:108] GRPC response: {} I0618 10:20:24.773724 10816 safe_mounter_host_process_windows.go:153] Unmount: remote path: \.file.core.windows.net\pvc-b034ea6e-80a0-4b98-a036-a7557c1d2595 local path: c:\var\lib\kubelet\plugins\kubernetes.io\csi\file.csi.azure.com\d0572836866328f35f4a64fbf26a85c83ef3c903cfd92d82f10c5c0d721ad463\globalmount I0618 10:20:24.773724 10816 nodeserver.go:438] NodeUnstageVolume: unmount volume ##pvc-b034ea6e-80a0-4b98-a036-a7557c1d2595###-suite on \var\lib\kubelet\plugins\kubernetes.io\csi\file.csi.azure.com\d0572836866328f35f4a64fbf26a85c83ef3c903cfd92d82f10c5c0d721ad463\globalmount successfully I0618 10:20:24.773724 10816 azure_metrics.go:118] "Observed Request Latency" latency_seconds=220.2335692 request="azurefile_csi_driver_node_unstage_volume" resource_group="" subscription_id="" source="file.csi.azure.com" volumeid="##pvc-b034ea6e-80a0-4b98-a036-a7557c1d2595###-suite" result_code="succeeded" I0618 10:20:24.773724 10816 utils.go:108] GRPC response: {} I0618 10:20:25.605800 10816 safe_mounter_host_process_windows.go:153] Unmount: remote path: \.file.core.windows.net\pvc-f9f952d8-fd7b-423b-8384-3230cbbef2d2 local path: c:\var\lib\kubelet\plugins\kubernetes.io\csi\file.csi.azure.com\c9d4eedbd9b825d6afca7c8ae5b55d2deeea2e1ab125d9502446d5b62c0de9fa\globalmount I0618 10:20:25.606589 10816 nodeserver.go:438] NodeUnstageVolume: unmount volume ##pvc-f9f952d8-fd7b-423b-8384-3230cbbef2d2###-suite on \var\lib\kubelet\plugins\kubernetes.io\csi\file.csi.azure.com\c9d4eedbd9b825d6afca7c8ae5b55d2deeea2e1ab125d9502446d5b62c0de9fa\globalmount successfully I0618 10:20:25.606659 10816 azure_metrics.go:118] "Observed Request Latency" latency_seconds=221.0657469 request="azurefile_csi_driver_node_unstage_volume" resource_group="" subscription_id="" source="file.csi.azure.com" volumeid="##pvc-f9f952d8-fd7b-423b-8384-3230cbbef2d2###-suite" result_code="succeeded" I0618 10:20:25.606685 10816 utils.go:108] GRPC response: {} I0618 10:20:26.530883 10816 safe_mounter_host_process_windows.go:153] Unmount: remote path: \.file.core.windows.net\pvc-373137fb-16fa-4445-9526-fa69d783ec00 local path: c:\var\lib\kubelet\plugins\kubernetes.io\csi\file.csi.azure.com\d1789d276f47d27aa0fbb3d446801109a044ccc5ba7efb1fe475edb4765412ee\globalmount I0618 10:20:26.531513 10816 nodeserver.go:438] NodeUnstageVolume: unmount volume ##pvc-373137fb-16fa-4445-9526-fa69d783ec00### on \var\lib\kubelet\plugins\kubernetes.io\csi\file.csi.azure.com\d1789d276f47d27aa0fbb3d446801109a044ccc5ba7efb1fe475edb4765412ee\globalmount successfully I0618 10:20:26.531635 10816 azure_metrics.go:118] "Observed Request Latency" latency_seconds=221.8119688 request="azurefile_csi_driver_node_unstage_volume" resource_group="" subscription_id="" source="file.csi.azure.com" volumeid="##pvc-373137fb-16fa-4445-9526-fa69d783ec00###" result_code="succeeded" I0618 10:20:26.531661 10816 utils.go:108] GRPC response: {} I0618 10:20:27.386371 10816 safe_mounter_host_process_windows.go:153] Unmount: remote path: \.file.core.windows.net\pvc-e465606b-ee96-42f4-9108-266751f62cc0 local path: c:\var\lib\kubelet\plugins\kubernetes.io\csi\file.csi.azure.com\7aaf7ebb435bc4e96ca15a1f3d94f31fead5f923b509ecab572a6228a3e96406\globalmount I0618 10:20:27.386371 10816 nodeserver.go:438] NodeUnstageVolume: unmount volume ##pvc-e465606b-ee96-42f4-9108-266751f62cc0###-suite on \var\lib\kubelet\plugins\kubernetes.io\csi\file.csi.azure.com\7aaf7ebb435bc4e96ca15a1f3d94f31fead5f923b509ecab572a6228a3e96406\globalmount successfully I0618 10:20:27.386371 10816 azure_metrics.go:118] "Observed Request Latency" latency_seconds=222.6667772 request="azurefile_csi_driver_node_unstage_volume" resource_group="" subscription_id="" source="file.csi.azure.com" volumeid="##pvc-e465606b-ee96-42f4-9108-266751f62cc0###-suite" result_code="succeeded" I0618 10:20:27.386371 10816 utils.go:108] GRPC response: {} I0618 10:20:28.241399 10816 safe_mounter_host_process_windows.go:153] Unmount: remote path: \.file.core.windows.net\pvc-9cc1f701-08df-45fa-9a63-6316ce0f4521 local path: c:\var\lib\kubelet\plugins\kubernetes.io\csi\file.csi.azure.com\b5a8f71ac9345abb3846ff099001ef91961525cbeda2ef05a084fd85175e1973\globalmount I0618 10:20:28.241399 10816 nodeserver.go:438] NodeUnstageVolume: unmount volume ##pvc-9cc1f701-08df-45fa-9a63-6316ce0f4521###-suite on \var\lib\kubelet\plugins\kubernetes.io\csi\file.csi.azure.com\b5a8f71ac9345abb3846ff099001ef91961525cbeda2ef05a084fd85175e1973\globalmount successfully I0618 10:20:28.241399 10816 azure_metrics.go:118] "Observed Request Latency" latency_seconds=223.5207706 request="azurefile_csi_driver_node_unstage_volume" resource_group="" subscription_id="" source="file.csi.azure.com" volumeid="##pvc-9cc1f701-08df-45fa-9a63-6316ce0f4521###-suite" result_code="succeeded" I0618 10:20:28.241399 10816 utils.go:108] GRPC response: {} I0618 10:20:29.222249 10816 safe_mounter_host_process_windows.go:153] Unmount: remote path: \.file.core.windows.net\pvc-57d12ee1-f847-44ec-830b-e6375971a791 local path: c:\var\lib\kubelet\plugins\kubernetes.io\csi\file.csi.azure.com\426633f8b1f494922897121ccf62104b21279f8a4d3bca2e189b30c83dd1a1e1\globalmount I0618 10:20:29.223610 10816 nodeserver.go:438] NodeUnstageVolume: unmount volume ##pvc-57d12ee1-f847-44ec-830b-e6375971a791###-suite on \var\lib\kubelet\plugins\kubernetes.io\csi\file.csi.azure.com\426633f8b1f494922897121ccf62104b21279f8a4d3bca2e189b30c83dd1a1e1\globalmount successfully I0618 10:20:29.223610 10816 azure_metrics.go:118] "Observed Request Latency" latency_seconds=224.3052176 request="azurefile_csi_driver_node_unstage_volume" resource_group="" subscription_id="" source="file.csi.azure.com" volumeid="##pvc-57d12ee1-f847-44ec-830b-e6375971a791###-suite" result_code="succeeded" I0618 10:20:29.223610 10816 utils.go:108] GRPC response: {} I0618 10:20:30.200191 10816 safe_mounter_host_process_windows.go:153] Unmount: remote path: \.file.core.windows.net\pvc-0eb2b4ab-9f0d-48e8-81e0-1639f3b8f26a local path: c:\var\lib\kubelet\plugins\kubernetes.io\csi\file.csi.azure.com\e43c03d50a8520c906a87a68e2869d94c67f0f6beac75b150d5d633223e51b53\globalmount I0618 10:20:30.201492 10816 nodeserver.go:438] NodeUnstageVolume: unmount volume ##pvc-0eb2b4ab-9f0d-48e8-81e0-1639f3b8f26a###-suite on \var\lib\kubelet\plugins\kubernetes.io\csi\file.csi.azure.com\e43c03d50a8520c906a87a68e2869d94c67f0f6beac75b150d5d633223e51b53\globalmount successfully I0618 10:20:30.201553 10816 azure_metrics.go:118] "Observed Request Latency" latency_seconds=225.2751726 request="azurefile_csi_driver_node_unstage_volume" resource_group="" subscription_id="" source="file.csi.azure.com" volumeid="##pvc-0eb2b4ab-9f0d-48e8-81e0-1639f3b8f26a###-suite" result_code="succeeded" I0618 10:20:30.201590 10816 utils.go:108] GRPC response: {} I0618 10:20:31.174562 10816 safe_mounter_host_process_windows.go:153] Unmount: remote path: \.file.core.windows.net\pvc-f3e3de62-8522-473b-99b3-ee77c7492d9c local path: c:\var\lib\kubelet\plugins\kubernetes.io\csi\file.csi.azure.com\2828eb0a20f4e36addf1412311ed2db77c2294bdd1119315fcfbd7cd969ec847\globalmount I0618 10:20:31.174562 10816 nodeserver.go:438] NodeUnstageVolume: unmount volume ##pvc-f3e3de62-8522-473b-99b3-ee77c7492d9c###-suite on \var\lib\kubelet\plugins\kubernetes.io\csi\file.csi.azure.com\2828eb0a20f4e36addf1412311ed2db77c2294bdd1119315fcfbd7cd969ec847\globalmount successfully I0618 10:20:31.175100 10816 azure_metrics.go:118] "Observed Request Latency" latency_seconds=226.2482117 request="azurefile_csi_driver_node_unstage_volume" resource_group="" subscription_id="" source="file.csi.azure.com" volumeid="##pvc-f3e3de62-8522-473b-99b3-ee77c7492d9c###-suite" result_code="succeeded" I0618 10:20:31.175135 10816 utils.go:108] GRPC response: {} I0618 10:20:31.556868 10816 smb.go:97] checking remote server path Get-Item : Cannot find path 'C:\var\lib\kubelet\plugins\kubernetes.io\csi\file.csi.azure.com\ca3534d0553fab117563cb8a57 e1428c986890a081e1ad90b6671e14f1b5df60\globalmount' because it does not exist. At line:1 char:2

Warning FailedMount 48s (x5 over 8m56s) kubelet MountVolume.MountDevice failed for volume 'pvc-0e8add25-432b-49bc-bd80-75f6de173faf' : rpc error: code = DeadlineExceeded desc = context deadline exceeded

The Azurefile CSI logging is as follows and similar to the above.

Warning FailedMount 48s (x5 over 8m56s) kubelet MountVolume.MountDevice failed for volume 'pvc-0e8add25-432b-49bc-bd80-75f6de173faf' : rpc error: code = DeadlineExceeded desc = context deadline exceeded

The Azurefile CSI logging is as follows

`I0606 14:11:42.325859 7304 azure_metrics.go:118] 'Observed Request Latency' latency_seconds=120.0105828 request='azurefile_csi_driver_node_stage_volume' resource_group='' subscription_id='' source='file.csi.azure.com' volumeid='pvc-0e8add25-432b-49bc-bd80-75f6de173faf###s' result_code='failed_csi_driver_node_stage_volume' E0606 14:11:42.325981 7304 utils.go:106] GRPC error: rpc error: code = Internal desc = volume(MC_rg-aks-prd_aks-prd_westeurope#fc7a964cdab3c4c3abd74c7#pvc-0e8add25-432b-49bc-bd80-75f6de173faf###force-mortgages) mount \\fc7a964cdab3c4c3abd74c7.file.core.windows.net\pvc-0e8add25-432b-49bc-bd80-75f6de173faf on \var\lib\kubelet\plugins\kubernetes.io\csi\file.csi.azure.com\a9d936b9e022f1c614ed0052bc2a9a606172dbeca4bca4d830ec0671b89f5662\globalmount failed with time out

`

andyzhangx commented 3 months ago

thanks for raising this issue, removing smb mapping is still essential, I think we could cache the mapping <local path, remote path> in GetRemoteServerFromTarget, that could avoid running same powershell commands a lot. Running powershell command inside the windows driver is really expensive.

andyzhangx commented 3 months ago

thanks for raising this issue, removing smb mapping is still essential, I think we could cache the mapping <local path, remote path> in GetRemoteServerFromTarget, that could avoid running same powershell commands a lot. Running powershell command inside the windows driver is really expensive.

Rob19999 commented 3 months ago

Thank you for the quick fix. Is there a way we can opt-in early on the v1.30.3 release or go back to v1.30.1? . Currently we're having major issues with this. Usually it takes around 4-6 weeks before changes like that become available with for our region.

andyzhangx commented 3 months ago

@Rob19999 v1.29.6 also fixes the issue, we are going to rollout v1.29.6 next month. just email me the config if you want to make a quick fix on your cluster, thx

Rob19999 commented 3 months ago

We're deployed in WestEU and usually we're later on the rollout roadmap. I'm trying to puzzle out what release this change would be in. Currently we're on v20240513 . v20240609 https://github.com/Azure/AKS/releases/tag/2024-06-09 is being rollout atm in westeu but this does not yet contain this fix. A new release had not been announced yet but usually if there is no release announced it will take atleast 4-6 weeks.

The csi driver v1.30.2 was introduced in https://github.com/Azure/AKS/releases/tag/2024-05-13

I'm unsure what config you would like me to send. Currently our cluster is on Kuberntes version 1.29.4 AKS v2024051 but we have no way off choosing the csi-driver version during creation or a update command as for as I'm aware. I will raise this question at Microsoft support given AKS is a managed service. But am afraid its pinned tothe next vxxxxxx version that includes this driver version.

andyzhangx commented 3 months ago

@Rob19999 the azure file csi driver is managed by aks, and we have a way in backend to ping your csi driver version to the fixed patch version if you want, otherwise you need to wait a few weeks.

Rob19999 commented 3 months ago

I would love that given the issues we have. We're more then willing to test the new version for you . If that still causes issues we could always go back to v.1.30.1.

I assume I can raise a support request for this through the Microsoft portal?

andyzhangx commented 3 months ago

I would love that given the issues we have. We're more then willing to test the new version for you . If that still causes issues we could always go back to v.1.30.1.

I assume I can raise a support request for this through the Microsoft portal?

@Rob19999 that also works but it would go through a process and takes time.

Rob19999 commented 3 months ago

I understand. What is the easier way to get this rolling? Me sending you our cluster details on your MS email? (xiazhang@microsoft.com) Then I can also do it form my corporate mail address for validation.

andyzhangx commented 3 months ago

I understand. What is the easier way to get this rolling? Me sending you our cluster details on your MS email? (xiazhang@microsoft.com) Then I can also do it form my corporate mail address for validation.

nvm, I got your cluster now: aks-prd-rgxxxc5.hcp.westeurope.azmk8s.io, and if you want to mitigate other clusters, just email me, thx

Rob19999 commented 3 months ago

Thank you. I will give it some time to propagate. A new node in a existing nodepool I added still pulled mcr.microsoft.com/oss/kubernetes-csi/azurefile-csi:v1.30.2-windows-hp.

Will the pinning of the version disappear when we upgrade the cluster or do we need to reach out to you to get this changes?

andyzhangx commented 3 months ago

@Rob19999 we will ping the version if you upgrade to v1.29.5 or v1.30.0, the fixed version is still there. This process would take one or two days, stay tuned.

Rob19999 commented 3 months ago

Nothing has changed for us yet in regards to the csi driver version. If I understand correctly you now pinned it to AKS version 1.29.5. Yesterday (Release 2024-06-09] became available to use and we installed that. But this version does not contain AKS 1.29.5. A newer version has not yet been announced and we don't know if that will contain1.29.5 and even then it will take 4-6 weeks before the roll is complete in our region.

Am I in the right assumption we just need to wait for 1.29.5 te become available.

andyzhangx commented 2 months ago

Nothing has changed for us yet in regards to the csi driver version. If I understand correctly you now pinned it to AKS version 1.29.5. Yesterday (Release 2024-06-09] became available to use and we installed that. But this version does not contain AKS 1.29.5. A newer version has not yet been announced and we don't know if that will contain1.29.5 and even then it will take 4-6 weeks before the roll is complete in our region.

Am I in the right assumption we just need to wait for 1.29.5 te become available.

@Rob19999 pls check again, azurefile-csi:v1.30.3-windows-hp image is deployed on your cluster now.

Rob19999 commented 2 months ago

I see it now. Thanks for the support.

Rob19999 commented 2 months ago

Goodday, We had the driver running for 3 days now and unfortunately we're still experiencing the same issue.

To test the issue in a more controlled manner we split up our deployment over several nodes pools and start making changes in the way we use PVC's. What did seem to help is fasing out dynamic pvc's we create with each deployment and creating 1 static (premium storage) pvc a day and use that instead. While this is workable and we havent had issues on this pool for over a week it is not how it should work.

Our generic load is around 150 -175 HELM release a day (deploy/delete) with 1400-2000 deployments mostly having 1 pod. where each release has it own dynamic pvc where the pods have a persistentVolumeClaim to the azurefile-csi with 1Gi storage and some deploys having 30Gi or 100Gi. With the pvc change we caused around 40% off this load to go to a different node pool that is stable for around a week now. The other nodepool still had 3-4 nodes a day dying off.

We also tested with smaller pods counts on nodes (85 pods), (65 pods) etc. This does not seem to lessen the issue.

We're now working on changing all our workload to use as less pvc's as possible that we pre-create each day. Other pvc's where already more permanent.

While we have a workaround now I would still like to assist Microsoft in a more permanent fix. Not just for our workload but also possible other/future customers of AKS.

Is there anything we can do to see if we can resolve the issue in the driver? I can imagen we cause a big load with our setup but I also feel Windows should be able to handle this. Given its relying an basic SMB functionality that also work on very file server where these numbers of smb connections are not very large.

Thank in advance.

andyzhangx commented 2 months ago

@Rob19999 could you share the csi driver logs on that node again? what's the csi driver version on that node? and how many PVs are mounted on that node in total?

Rob19999 commented 2 months ago

Needed to wait for a crash, here is the information. If its easier we can also go on a call sometime. We can save a node for investigation. I redacted all relevant cluster information. To be sure is there a way to mark this message as internal?

I think the user already removed some off the pods after I ran this command.
kubectl exec -n kube-system -it csi-azurefile-node-win-n6dgd -- powershell
get-smbglobalmapping

Disconnected            \\<redacted>.file.core.windows.net\pvc-44150571-38f1-458e-b438-7919a8353018
Disconnected            \\<redacted>.file.core.windows.net\pvc-a02a46f7-7325-4a51-bbd5-138dae704523
OK                      \\<redacted>.file.core.windows.net\pvc-9a3cacab-3290-42e2-982c-6555c6587df2
Disconnected            \\<redacted>.file.core.windows.net\pvc-7e437168-3a59-4801-b8ab-5f5261e7d29d
Disconnected            \\<redacted>.file.core.windows.net\pvc-36f51116-7143-4454-a7f9-0062a08b3e29
Disconnected            \\<redacted>.file.core.windows.net\pvc-d286d91a-7eab-4965-acb1-d45452bce160
OK                      \\<redacted>.file.core.windows.net\pvc-81161d5d-28b5-46f7-bdd7-9cc119496b25
OK                      \\<redacted>.file.core.windows.net\pvc-105a612d-2207-4202-8fe2-f452649159a5
Disconnected            \\<redacted>.file.core.windows.net\pvc-71aad805-b3a4-4c3d-baff-a7cdfaeafdac
Disconnected            \\<redacted>.file.core.windows.net\pvc-3c7f8f21-eb63-4460-9c6a-d6cbac6582dd
Disconnected            \\<redacted>.file.core.windows.net\pvc-558def79-d041-4263-a49c-8c021d679f96
Disconnected            \\<redacted>.file.core.windows.net\pvc-306793ef-5728-43dd-9ad5-fc7997c5c328
kubectl describe pod csi-azurefile-node-win-mzb8k -n kube-system

Name:                 csi-azurefile-node-win-mzb8k
Namespace:            kube-system
Priority:             2000001000
Priority Class Name:  system-node-critical
Service Account:      csi-azurefile-node-sa
Node:                 aksnpwin200000f/10.241.13.203
Start Time:           Mon, 08 Jul 2024 17:37:34 +0200
Labels:               app=csi-azurefile-node-win
                      controller-revision-hash=7b6b5fd8fb
                      kubernetes.azure.com/managedby=aks
                      pod-template-generation=36
Annotations:          <none>
Status:               Running
SeccompProfile:       RuntimeDefault
IP:                   10.241.13.203
IPs:
  IP:           10.241.13.203
Controlled By:  DaemonSet/csi-azurefile-node-win
Init Containers:
  init:
    Container ID:  containerd://77ab7ccdef01617fb3121f003ea19ead88691fbbea44c4ed18ae0242811fade1
    Image:         mcr.microsoft.com/oss/kubernetes-csi/azurefile-csi:v1.30.3-windows-hp
    Image ID:      mcr.microsoft.com/oss/kubernetes-csi/azurefile-csi@sha256:30ce602c8928227e3eafe766c99ae970a8dc9eb6dc6a82ed82982bbf7093ac1d
    Port:          <none>
    Host Port:     <none>
    Command:
      powershell.exe
      -c
      New-Item -ItemType Directory -Path C:\var\lib\kubelet\plugins\file.csi.azure.com\ -Force
    State:          Terminated
      Reason:       Completed
      Exit Code:    0
      Started:      Tue, 09 Jul 2024 08:39:08 +0200
      Finished:     Tue, 09 Jul 2024 08:39:09 +0200
    Ready:          True
    Restart Count:  1
    Environment:    <none>
    Mounts:
      /var/run/secrets/kubernetes.io/serviceaccount from kube-api-access-ws59v (ro)
Containers:
  node-driver-registrar:
    Container ID:  containerd://734ff0ccfb7db13d1db6a8d24f49563ff9a020505e6571b85a75da2ef0ec1426
    Image:         mcr.microsoft.com/oss/kubernetes-csi/csi-node-driver-registrar:v2.10.1
    Image ID:      mcr.microsoft.com/oss/kubernetes-csi/csi-node-driver-registrar@sha256:b3bbd7a6171bff15eeefd137316fa16415aa6a4c817e5ec609662793093b3526
    Port:          <none>
    Host Port:     <none>
    Command:
      csi-node-driver-registrar.exe
    Args:
      --csi-address=$(CSI_ENDPOINT)
      --kubelet-registration-path=$(DRIVER_REG_SOCK_PATH)
      --plugin-registration-path=$(PLUGIN_REG_DIR)
      --v=2
    State:          Running
      Started:      Tue, 09 Jul 2024 08:39:14 +0200
    Last State:     Terminated
      Reason:       Unknown
      Exit Code:    255
      Started:      Mon, 08 Jul 2024 17:37:47 +0200
      Finished:     Tue, 09 Jul 2024 08:38:54 +0200
    Ready:          True
    Restart Count:  1
    Limits:
      memory:  150Mi
    Requests:
      cpu:     40m
      memory:  40Mi
    Environment:
      KUBERNETES_SERVICE_HOST:       <redacted>.hcp.westeurope.azmk8s.io
      KUBERNETES_PORT:               tcp://<redacted>.hcp.westeurope.azmk8s.io:443
      KUBERNETES_PORT_443_TCP:       tcp://<redacted>.hcp.westeurope.azmk8s.io:443
      KUBERNETES_PORT_443_TCP_ADDR:  <redacted>.hcp.westeurope.azmk8s.io
      CSI_ENDPOINT:                  unix://C:\\var\\lib\\kubelet\\plugins\\file.csi.azure.com\\csi.sock
      DRIVER_REG_SOCK_PATH:          C:\\var\\lib\\kubelet\\plugins\\file.csi.azure.com\\csi.sock
      PLUGIN_REG_DIR:                C:\\var\\lib\\kubelet\\plugins_registry\\
      KUBE_NODE_NAME:                 (v1:spec.nodeName)
    Mounts:
      /var/run/secrets/kubernetes.io/serviceaccount from kube-api-access-ws59v (ro)
  azurefile:
    Container ID:  containerd://b8b3ab85cd2c5efb74eef06039d88fb97130272f0725c4ab43fc2969ee70213c
    Image:         mcr.microsoft.com/oss/kubernetes-csi/azurefile-csi:v1.30.3-windows-hp
    Image ID:      mcr.microsoft.com/oss/kubernetes-csi/azurefile-csi@sha256:30ce602c8928227e3eafe766c99ae970a8dc9eb6dc6a82ed82982bbf7093ac1d
    Port:          <none>
    Host Port:     <none>
    Command:
      azurefileplugin.exe
    Args:
      --v=5
      --endpoint=$(CSI_ENDPOINT)
      --nodeid=$(KUBE_NODE_NAME)
      --enable-windows-host-process=true
    State:          Running
      Started:      Tue, 09 Jul 2024 08:39:15 +0200
    Last State:     Terminated
      Reason:       Unknown
      Exit Code:    255
      Started:      Mon, 08 Jul 2024 17:37:50 +0200
      Finished:     Tue, 09 Jul 2024 08:38:54 +0200
    Ready:          True
    Restart Count:  1
    Environment:
      KUBERNETES_SERVICE_HOST:       <redacted>.hcp.westeurope.azmk8s.io
      KUBERNETES_PORT:               tcp://<redacted>.hcp.westeurope.azmk8s.io:443
      KUBERNETES_PORT_443_TCP:       tcp://<redacted>.hcp.westeurope.azmk8s.io:443
      KUBERNETES_PORT_443_TCP_ADDR:  <redacted>.hcp.westeurope.azmk8s.io
      AZURE_CREDENTIAL_FILE:         C:\k\azure.json
      CSI_ENDPOINT:                  unix://C:\\var\\lib\\kubelet\\plugins\\file.csi.azure.com\\csi.sock
      KUBE_NODE_NAME:                 (v1:spec.nodeName)
    Mounts:
      /var/run/secrets/kubernetes.io/serviceaccount from kube-api-access-ws59v (ro)
Conditions:
  Type                        Status
  PodReadyToStartContainers   True
  Initialized                 True
  Ready                       True
  ContainersReady             True
  PodScheduled                True
Volumes:
  kube-api-access-ws59v:
    Type:                    Projected (a volume that contains injected data from multiple sources)
    TokenExpirationSeconds:  3607
    ConfigMapName:           kube-root-ca.crt
    ConfigMapOptional:       <nil>
    DownwardAPI:             true
QoS Class:                   Burstable
Node-Selectors:              <none>
Tolerations:                 :NoExecute op=Exists
                             :NoSchedule op=Exists
                             CriticalAddonsOnly op=Exists
Events:                      <none>

Logging: See attachment. The node seem to have died around 13:00 although I am seeing timeout couple hours before that. csi-azurefile-node-win-mzb8.log

daanroeterink commented 2 months ago

Hello,

Colleague of @Rob19999 here. What we notice when a node is going "dead" is that the command Get-SMBGlobalMapping takes a long time to response or doesn't even response at all. Do you know if the WMI part that powershell uses does some sort of locking on the node? Because that would explain the seemingly random "timeouts" we see in the logging.

andyzhangx commented 2 months ago

I have disabled --remove-smb-mount-on-windows=false in your cluster, could you check again? thx

Rob19999 commented 2 months ago

Currently we're outside of working hours so not much is happening on the cluster. The --remove-smb-mount-on-windows was added after we made a support request (support ticket: 2403070050001502) at Microsoft to resolve the nodes breaking after a while. Usually after 14 days or so or when it reached around 701 smbglobalshares.

Back then we got the error below . I will see if it returns or if I can force it with a certain amount off deploys. Given the change we made on our end by reducing pvc mounts it would be harder to reach this amount.

MountVolume.MountDevice failed for volume 'pvc-f12a9f91-62ff-4d4b-9ff4-5d1dbe3bde14' : rpc error: code = Internal desc = volume(mc_<redacted>-<redacted>_westeurope#fc7a964cdab3c4c3abd74c7#pvc-f12a9f91-62ff-4d4b-9ff4-5d1dbe3bde14###<redacted>) mount \\\\<redacted>.file.core.windows.net\\pvc-f12a9f91-62ff-4d4b-9ff4-5d1dbe3bde14 on \\var\\lib\\kubelet\\plugins\\kubernetes.io\\csi\\file.csi.azure.com\\e0cef412d93e12842f12522496d30f960b5440e97bb4bc578a01da37d85dd7a1\\globalmount failed with NewSmbGlobalMapping failed. output: 'New-SmbGlobalMapping : Not enough memory resources are available to process this command. \\r\\nAt line:1 char:190\\r\\n+ ... ser, $PWord;New-SmbGlobalMapping -RemotePath $Env:smbremotepath -Cred ...\\r\\n+ ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~\\r\\n + CategoryInfo : ResourceUnavailable: (MSFT_SmbGlobalMapping:ROOT/Microsoft/...mbGlobalMapping) [New-SmbG \\r\\n lobalMapping], CimException\\r\\n + FullyQualifiedErrorId : Windows System Error 8,New-SmbGlobalMapping\\r\\n \\r\\n', err: exit status 1

andyzhangx commented 2 months ago

@Rob19999 so do you want to keep remove-smb-mount-on-windows feature or not? create 701 smbglobalshares on one node is crazy, will try to find out how to improve remove-smb-mount-on-windows feature to use less resources, that would take time.

Rob19999 commented 2 months ago

Lets keep the remove-smb-mount-on-window disabled for now. Without this functionality it was more stable for us. We will monitor the number off connections on the nodes and when they reach 600 we will remove them.

Just to make clear given you mentioned resources we got the 'New-SmbGlobalMapping : Not enough memory resources are available to process this command' error on the v.1.30.0 where the remove-smb-mount-on-window was not yet implemented. This error starting showing up when we reached the 701 smbglobalshares. Most of the 701 connections where in a disconnected state back then do to them not being removed.

I can imagen that not many clusters reach 701 smbglobalshares. It depends on how often you upgrade your node images if they're always running. But it can create seemingly random node crashes.

Rob19999 commented 2 months ago

We still have nodes crashing atm with smb errors. Yesterday one and today one. Its a lot better without the remove-smb-mount. We get this error without hitting the 500 smb mounts. I don't expect anything from your but I just want to provide you with as much information as possible. See full logs in the attachment.

The first error happen at : ``` 06:05:49.377083 7780 utils.go:106] GRPC error: rpc error: code = Internal desc = volume(##pvc-d7735dfb-d4de-46ee-9087-b4b4f97f9be0###--suite) mount \.file.core.windows.net\pvc-d7735dfb-d4de-46ee-9087-b4b4f97f9be0 on \var\lib\kubelet\plugins\kubernetes.io\csi\file.csi.azure.com\8b7e8149f0c5af6863b83e652d281b9132a12a9ab63e55ad56bcbf18d14d2760\globalmount failed with NewSmbGlobalMapp failed. output: "", err: exit status 0xc0000142 Please refer to http://aka.ms/filemounterror for possible causes and solutions for mount errors.



[azurefile.log](https://github.com/user-attachments/files/16264155/azurefile.log)