Closed jason-idk closed 1 year ago
Here is the vsphere-csi-node-windows daemonset spec:
apiVersion: v1
kind: Pod
metadata:
annotations:
cattle.io/timestamp: '2023-03-15T19:46:50Z'
cni.projectcalico.org/containerID: 9066348f5daa16f80f7127c681c77f7eee6fa0e5453b21d72c976a81d5cde6be
cni.projectcalico.org/podIP: 10.42.122.138/32
cni.projectcalico.org/podIPs: 10.42.122.138/32
kubernetes.io/psp: global-unrestricted-psp
creationTimestamp: '2023-03-15T19:46:52Z'
generateName: vsphere-csi-node-windows-
labels:
app: vsphere-csi-node-windows
controller-revision-hash: 6cbdc7fbfc
pod-template-generation: '4'
role: vsphere-csi-windows
managedFields:
<redacting managedFields to shorten>
spec:
affinity:
nodeAffinity:
requiredDuringSchedulingIgnoredDuringExecution:
nodeSelectorTerms:
- matchFields:
- key: metadata.name
operator: In
values:
- hybww1
containers:
- args:
- '--v=5'
- '--csi-address=$(ADDRESS)'
- '--kubelet-registration-path=$(DRIVER_REG_SOCK_PATH)'
env:
- name: ADDRESS
value: unix://C:\\csi\\csi.sock
- name: DRIVER_REG_SOCK_PATH
value: \\var\\lib\\kubelet\\plugins\\csi.vsphere.vmware.com\\csi.sock
image: rancher/mirrored-sig-storage-csi-node-driver-registrar:v2.5.1
imagePullPolicy: IfNotPresent
livenessProbe:
exec:
command:
- /csi-node-driver-registrar.exe
- >-
--kubelet-registration-path=C:\\var\\lib\\kubelet\\plugins\\csi.vsphere.vmware.com\\csi.sock
- '--mode=kubelet-registration-probe'
failureThreshold: 3
initialDelaySeconds: 3
periodSeconds: 10
successThreshold: 1
timeoutSeconds: 1
name: node-driver-registrar
resources: {}
terminationMessagePath: /dev/termination-log
terminationMessagePolicy: File
volumeMounts:
- mountPath: /csi
name: plugin-dir
- mountPath: /registration
name: registration-dir
- mountPath: /var/run/secrets/kubernetes.io/serviceaccount
name: kube-api-access-4fgrx
readOnly: true
- args:
- '--fss-name=internal-feature-states.csi.vsphere.vmware.com'
- '--fss-namespace=$(CSI_NAMESPACE)'
env:
- name: NODE_NAME
valueFrom:
fieldRef:
apiVersion: v1
fieldPath: spec.nodeName
- name: CSI_ENDPOINT
value: unix://C:\\csi\\csi.sock
- name: MAX_VOLUMES_PER_NODE
value: '0'
- name: X_CSI_MODE
value: node
- name: X_CSI_SPEC_REQ_VALIDATION
value: 'false'
- name: X_CSI_SPEC_DISABLE_LEN_CHECK
value: 'true'
- name: LOGGER_LEVEL
value: PRODUCTION
- name: X_CSI_LOG_LEVEL
value: DEBUG
- name: CSI_NAMESPACE
valueFrom:
fieldRef:
apiVersion: v1
fieldPath: metadata.namespace
- name: NODEGETINFO_WATCH_TIMEOUT_MINUTES
value: '1'
image: rancher/mirrored-cloud-provider-vsphere-csi-release-driver:v2.6.2
imagePullPolicy: IfNotPresent
livenessProbe:
failureThreshold: 3
httpGet:
path: /healthz
port: healthz
scheme: HTTP
initialDelaySeconds: 10
periodSeconds: 5
successThreshold: 1
timeoutSeconds: 5
name: vsphere-csi-node
ports:
- containerPort: 9808
name: healthz
protocol: TCP
resources: {}
terminationMessagePath: /dev/termination-log
terminationMessagePolicy: File
volumeMounts:
- mountPath: C:\csi
name: plugin-dir
- mountPath: C:\var\lib\kubelet
name: pods-mount-dir
- mountPath: \\.\pipe\csi-proxy-volume-v1
name: csi-proxy-volume-v1
- mountPath: \\.\pipe\csi-proxy-filesystem-v1
name: csi-proxy-filesystem-v1
- mountPath: \\.\pipe\csi-proxy-disk-v1
name: csi-proxy-disk-v1
- mountPath: \\.\pipe\csi-proxy-system-v1alpha1
name: csi-proxy-system-v1alpha1
- mountPath: /var/run/secrets/kubernetes.io/serviceaccount
name: kube-api-access-4fgrx
readOnly: true
- args:
- '--v=4'
- '--csi-address=/csi/csi.sock'
image: rancher/mirrored-sig-storage-livenessprobe:v2.7.0
imagePullPolicy: IfNotPresent
name: liveness-probe
resources: {}
terminationMessagePath: /dev/termination-log
terminationMessagePolicy: File
volumeMounts:
- mountPath: /csi
name: plugin-dir
- mountPath: /var/run/secrets/kubernetes.io/serviceaccount
name: kube-api-access-4fgrx
readOnly: true
dnsPolicy: ClusterFirst
enableServiceLinks: true
nodeName: hybww1
nodeSelector:
kubernetes.io/os: windows
preemptionPolicy: PreemptLowerPriority
priority: 0
restartPolicy: Always
schedulerName: default-scheduler
securityContext: {}
serviceAccount: vsphere-csi-node
serviceAccountName: vsphere-csi-node
terminationGracePeriodSeconds: 30
tolerations:
- effect: NoSchedule
key: node-role.kubernetes.io/master
operator: Exists
- effect: NoSchedule
key: node-role.kubernetes.io/controlplane
value: 'true'
- effect: NoSchedule
key: node-role.kubernetes.io/control-plane
operator: Exists
- effect: NoExecute
key: node-role.kubernetes.io/etcd
operator: Exists
- effect: NoExecute
key: node.kubernetes.io/not-ready
operator: Exists
- effect: NoExecute
key: node.kubernetes.io/unreachable
operator: Exists
- effect: NoSchedule
key: node.kubernetes.io/disk-pressure
operator: Exists
- effect: NoSchedule
key: node.kubernetes.io/memory-pressure
operator: Exists
- effect: NoSchedule
key: node.kubernetes.io/pid-pressure
operator: Exists
- effect: NoSchedule
key: node.kubernetes.io/unschedulable
operator: Exists
volumes:
- hostPath:
path: C:\var\lib\kubelet\plugins_registry\
type: Directory
name: registration-dir
- hostPath:
path: C:\var\lib\kubelet\plugins\csi.vsphere.vmware.com\
type: DirectoryOrCreate
name: plugin-dir
- hostPath:
path: \var\lib\kubelet
type: Directory
name: pods-mount-dir
- hostPath:
path: \\.\pipe\csi-proxy-disk-v1
type: ''
name: csi-proxy-disk-v1
- hostPath:
path: \\.\pipe\csi-proxy-volume-v1
type: ''
name: csi-proxy-volume-v1
- hostPath:
path: \\.\pipe\csi-proxy-filesystem-v1
type: ''
name: csi-proxy-filesystem-v1
- hostPath:
path: \\.\pipe\csi-proxy-system-v1alpha1
type: ''
name: csi-proxy-system-v1alpha1
- name: kube-api-access-4fgrx
projected:
defaultMode: 420
sources:
- serviceAccountToken:
expirationSeconds: 3607
path: token
- configMap:
items:
- key: ca.crt
path: ca.crt
name: kube-root-ca.crt
- downwardAPI:
items:
- fieldRef:
apiVersion: v1
fieldPath: metadata.namespace
path: namespace
I see named pipes being used for some of the paths, but I guess what confuses me is why a unix socket type mount is being used for the csi.sock path. I did not think this was supported on windows? https://kubernetes.io/docs/tasks/configure-pod-container/create-hostprocess-pod/#limitations
I did not create this daemonset - this was created when deploying from helm as describe above. Am I doing something wrong here?
We'll take a look at this; it does look like at least one of the socket paths is misconfigured on Windows nodes.
For the record, Windows has supported AF_UNIX sockets since Windows 10 (circa 2018).
Thanks @brandond! Much appreciated.
I will be out of town all next week, so I may not be able to respond then, but I will try to answer any questions and be available for testing etc as soon as possible.
Our QA team (including @VestigeJ) has access to a lab where they should be able to try to replicate this.
Hi @brandond and @VestigeJ,
Just wanted to let you both know I am back from being out of town and will be available again for anything you may need. Were you able to replicate the issue in your lab environment?
I've spoken directly with Jason on this to see if he can provide further information about his environment as I was not able to reproduce this in our vsphere cluster. Waiting to hear back before I close
Apologies - I have been pretty heads down on another project. Sent my response this morning and will be available today for collaboration.
@VestigeJ or @JLH993 Any updates on the status of this issue? Moving to the Backlog for now until we can confirm.
@caroline-suse-rancher Ive tried updating the rke2 version of the cluster and adding new windows nodes but haven't been successful yet. I was messaging @VestigeJ on Slack and we tested a few things but haven't found a solution yet.
Hi, just wanted to see if there had been any updates on this. The node-driver-registrar is still in a crashLoopBackoff state and I have not been able to successfully register a windows node using the rancher-vsphere-csi. If there is any further information I can provide that may be of use I am happy to work with someone to get this.
I am also available to if anyone needs me to run any tests etc.
I was not able to reproduce the original issue in our vsphere cluster which lead me to believe there was something on the VSphere side of the fence that wasn't functioning as intended. The additional configuration on the VSphere side is where our environments drifted far enough apart to prevent me from taking it further.
@VestigeJ - I have an update on this one... I had to put this on the shelf and come back to it. I was able to get the vsphere-csi-node-windows daemonset running this morning. Details below:
I went back to try to understand the error a bit more, I think I was getting misled due to seeing the csi.sock file existed on the windows nodes filesystem:
PS C:\var\lib\kubelet\plugins\csi.vsphere.vmware.com> ls
Directory: C:\var\lib\kubelet\plugins\csi.vsphere.vmware.com
Mode LastWriteTime Length Name
---- ------------- ------ ----
-a---- 8/8/2023 9:40 PM 0 csi.sock
The path shown in the logs looked a bit odd as I noticed the backslashes seemed to be escaped twice (this turned out to be irrelevant, but it led me to find a solution.)
transport: Error while dialing dial unix \\\\var\\\\lib\\\\kubelet\\\\plugins\\\\csi.vsphere.vmware.com\\\\csi.sock: connect: A socket operation was attempted to an unreachable network.
Testing locally on the node itself, I could not list the csi.sock file using the path configured in the daemonset for DRIVER_REG_SOCK_PATH, which was set to \\var\\lib\\kubelet\\plugins\\csi.vsphere.vmware.com\\csi.sock
. After updating the path in the daemonset to C:\\var\\lib\\kubelet\\plugins\\csi.vsphere.vmware.com\\csi.sock
, the pods were able to go into a running state.
Here are the logs after the change:
I0809 14:37:47.682853 1700 main.go:167] Version: v2.7.0
2023-08-09T14:37:47.686514000+01:00 I0809 14:37:47.682853 1700 main.go:168] Running node-driver-registrar in mode=registration
2023-08-09T14:37:47.686514000+01:00 I0809 14:37:47.686514 1700 main.go:192] Attempting to open a gRPC connection with: "unix://C:\\\\csi\\\\csi.sock"
2023-08-09T14:37:47.687116600+01:00 I0809 14:37:47.686514 1700 connection.go:154] Connecting to unix://C:\\csi\\csi.sock
I0809 14:37:55.779604 1700 main.go:199] Calling CSI driver to discover driver name
2023-08-09T14:37:55.780553500+01:00 I0809 14:37:55.779604 1700 connection.go:183] GRPC call: /csi.v1.Identity/GetPluginInfo
2023-08-09T14:37:55.783858400+01:00 I0809 14:37:55.779604 1700 connection.go:184] GRPC request: {}
2023-08-09T14:37:55.790137000+01:00 I0809 14:37:55.789639 1700 connection.go:186] GRPC response: {"name":"csi.vsphere.vmware.com","vendor_version":"v3.0.1"}
2023-08-09T14:37:55.790199000+01:00 I0809 14:37:55.789830 1700 connection.go:187] GRPC error: <nil>
2023-08-09T14:37:55.790281100+01:00 I0809 14:37:55.789938 1700 main.go:209] CSI driver name: "csi.vsphere.vmware.com"
2023-08-09T14:37:55.791613600+01:00 I0809 14:37:55.791284 1700 node_register.go:53] Starting Registration Server at: /registration/csi.vsphere.vmware.com-reg.sock
2023-08-09T14:37:55.793067100+01:00 I0809 14:37:55.792431 1700 node_register.go:62] Registration Server started at: /registration/csi.vsphere.vmware.com-reg.sock
2023-08-09T14:37:55.793735200+01:00 I0809 14:37:55.793351 1700 node_register.go:92] Skipping HTTP server because endpoint is set to: ""
2023-08-09T14:37:57.468773300+01:00 I0809 14:37:57.468046 1700 main.go:102] Received GetInfo call: &InfoRequest{}
2023-08-09T14:37:57.471666700+01:00 I0809 14:37:57.471318 1700 main.go:109] "Kubelet registration probe created" path="C:\\var\\lib\\kubelet\\plugins\\csi.vsphere.vmware.com\\registration"
I0809 14:37:58.458318 1700 main.go:121] Received NotifyRegistrationStatus call: &RegistrationStatus{PluginRegistered:true,Error:,}
Not sure if the last line indicates there is some new error, but at least the pods in the daemonset are running now.
I guess my question for this is - how do I correct this path at the time of provisioning? I am using the rancher2 terraform provider to create this cluster and I would like to prevent any manual intervention if possible. I was looking at the helm chart that gets installed and did not see any way to modify this variable.
I've opened a pull request for this change: https://github.com/rancher/charts/pull/2885
Please review when possible. This seems like a bug to me, but maybe there is just some documentation that needs to be updated. If so, please point me in the right direction for information on how to patch the path for DRIVER_REG_SOCK_PATH on the windows node-driver-registrarr daemonset environment variables during the chart installation.
Hi @caroline-suse-rancher, is there anyone that might be able to look at my PR to see if these changes might be a viable solution? This has been blocking me from completing a project internally.
I don't mind helping out or making some changes to my PR if needed. Just let me know and I will be happy to do so.
I added a comment to your PR, but rancher/charts is not the upstream for the RKE2 chart. rancher/vsphere-charts is the upstream; from there things go to rancher/charts and rancher/rke2-charts.
We can take a look at getting this fixed for the October release cycle.
@brandond - thank you for the review, I like the idea of adding a separate prefix path for windows in the values file and splitting them out. I have created a new pull request over on rancher/vsphere-charts with these changes when you get a chance to take a look.
Please let me know if there is anything else you need me to do and I can get it added in.
Hi, I just came across this discussion. In my case it is still stuck in 'Still connecting to unix://C:\csi\csi.sock'.
RKE version 2.7.6 K8 version: 1.24 Win node: win2022 datacenter vsphere CSI version: 3.0.1 CSI-proxy version: Running on the win node.
It works great on Linux Worker nodes.
I was just wondering if i should downgrade the Windows OS version to 2019. Or I just change some ENV value or something in the deployment, not sure where i should be looking for honestly. I am new to this.
Hi @sonergzn,
Can you show the full log output of the node-driver-registrar pod on the windows node? The key indicator you would be running into this same issue would be:
transport: Error while dialing dial unix \\\\var\\\\lib\\\\kubelet\\\\plugins\\\\csi.vsphere.vmware.com\\\\csi.sock: connect: A socket operation was attempted to an unreachable network.", restarting registration container.
If you wanted to test a change that would fix the above error, you would need to edit the vsphere-csi-node-windows daemonset to change DRIVER_REG_SOCK_PATH to: C:\\var\\lib\\kubelet\\plugins\\csi.vsphere.vmware.com\\csi.sock
Hi @JLH993,
Appreciate your quick reaction.
Here are the logs: not much going on in node-driver-registrar: 2.7.0 container
I0927 15:00:32.875449 1540 main.go:167] Version: v2.7.0 2023-09-27T15:00:32.876535100+01:00 I0927 15:00:32.875991 1540 main.go:168] Running node-driver-registrar in mode=registration 2023-09-27T15:00:32.876535100+01:00 I0927 15:00:32.875991 1540 main.go:192] Attempting to open a gRPC connection with: "unix://C:\\csi\\csi.sock" 2023-09-27T15:00:32.876535100+01:00 I0927 15:00:32.875991 1540 connection.go:154] Connecting to unix://C:\csi\csi.sock 2023-09-27T15:00:42.885403400+01:00 W0927 15:00:42.885403 1540 connection.go:173] Still connecting to unix://C:\csi\csi.sock 2023-09-27T15:00:52.876287200+01:00 W0927 15:00:52.876287 1540 connection.go:173] Still connecting to unix://C:\csi\csi.sock
I have changed to DRIVER_REG_SOCK_PATH to C:\\var\\lib\\kubelet\\plugins\\csi.vsphere.vmware.com\\csi.sock
. and reployed. No success.
I haven't noticed any issues related to underlying network connectivity.
@sonergzn hmmm... when you created the cluster, did you select Enable Windows CSI Support under Add-On Config? I am wondering if you are running into a different issue.
One separate issue I experienced before this with vmtools not sure what to do about the CNI interfaces. I had to add the following to C:\ProgramData\VMware\VMware Tools\tools.conf
:
[guestinfo]
exclude-nics=docker*,veth*,cali*,flan*
Once adding this, vmtools will pick it up within the next 5 seconds. Also make sure that the vm name matches the hostname of the node. This does seem a little different that anything I ran into in regards to the DRIVER_REG_SOCK_PATH issue.
@JLH993 I see. I feel like it is a different issue/topic in my case.
I have indeed enabled that Enable Windows CSI Support during installation.
I didn't check anything related to VMTools. But in my case
@sonergzn - I would go ahead and give windows server 2019 standard a shot to see if there is any difference in your experience. FWIW, I am using windows server 2019 standard core.
You'll want to make sure the hostname and the actual VM name match or the vSphere cloud provider is going to have a hard time identifying the node and that could cause other issues you may not notice yet. This is something I ran into initially as well.
The csi.sock is being created as part of the deployment process and the pod you are looking at the logs for (node-driver-registrar) has a job to register the CSI driver with kubelet. If you wanted to double check that the csi.sock actually exists, you could login to the windows node and go see that it does/does not exist.
With the recent merge of https://github.com/rancher/vsphere-charts/pull/61 I am going to close this issue out and I would suggest opening a new issue to track down what might be causing the behavior you are seeing.
Thank you @brandond for getting this merged in!
Environmental Info: RKE2 Version: rke2.exe version v1.24.9+rke2r2 (2f4571a879954e1ea8d4560023eaf57c567df737)
Node(s) CPU architecture, OS, and Version: control-plane: ubuntu 22.04 (8 vCPU / 16GB RAM) worker-node: ubuntu 22.04 (4 vCPU / 8GB RAM) worker-node: windows server core 2019 ltsc (8 vCPU / 16GB RAM)
Cluster Configuration:
Describe the bug: After connecting all three nodes to the cluster, letting the CPI/CSI initialize and install, the node-driver-registrar pod on the windows node is in a crashLoopBackoff state and cannot register properly. The logs show the following error:
Steps To Reproduce:
Expected behavior: windows node-driver-registrar is able to properly register node without error
Actual behavior: windows node-driver-registrar is failing to register node. cannot successfully connect to csi.sock