rancher / rke2

https://docs.rke2.io/
Apache License 2.0
1.58k stars 268 forks source link

Unable to complete installation of rancher-vsphere-csi on hybrid cluster #4025

Closed jason-idk closed 1 year ago

jason-idk commented 1 year ago

Environmental Info: RKE2 Version: rke2.exe version v1.24.9+rke2r2 (2f4571a879954e1ea8d4560023eaf57c567df737)

Node(s) CPU architecture, OS, and Version: control-plane: ubuntu 22.04 (8 vCPU / 16GB RAM) worker-node: ubuntu 22.04 (4 vCPU / 8GB RAM) worker-node: windows server core 2019 ltsc (8 vCPU / 16GB RAM)

Cluster Configuration:

Describe the bug: After connecting all three nodes to the cluster, letting the CPI/CSI initialize and install, the node-driver-registrar pod on the windows node is in a crashLoopBackoff state and cannot register properly. The logs show the following error:

I0316 12:22:40.230678 6952 main.go:166] Version: v2.5.1
I0316 12:22:40.231742 6952 main.go:167] Running node-driver-registrar in mode=registration
I0316 12:22:40.231784 6952 main.go:191] Attempting to open a gRPC connection with: "unix://C:\\\\csi\\\\csi.sock"
I0316 12:22:40.231784 6952 connection.go:154] Connecting to unix://C:\\csi\\csi.sock
W0316 12:22:50.249110 6952 connection.go:173] Still connecting to unix://C:\\csi\\csi.sock
I0316 12:22:54.449428 6952 main.go:198] Calling CSI driver to discover driver name
I0316 12:22:54.451098 6952 connection.go:183] GRPC call: /csi.v1.Identity/GetPluginInfo
I0316 12:22:54.451098 6952 connection.go:184] GRPC request: {}
I0316 12:22:54.489559 6952 connection.go:186] GRPC response: {"name":"csi.vsphere.vmware.com","vendor_version":"v2.6.2"}
I0316 12:22:54.489559 6952 connection.go:187] GRPC error: <nil>
I0316 12:22:54.489559 6952 main.go:208] CSI driver name: "csi.vsphere.vmware.com"
I0316 12:22:54.492531 6952 node_register.go:53] Starting Registration Server at: /registration/csi.vsphere.vmware.com-reg.sock
I0316 12:22:54.493901 6952 node_register.go:62] Registration Server started at: /registration/csi.vsphere.vmware.com-reg.sock
I0316 12:22:54.496468 6952 node_register.go:92] Skipping HTTP server because endpoint is set to: ""
I0316 12:22:54.593692 6952 main.go:102] Received GetInfo call: &InfoRequest{}
I0316 12:22:54.595982 6952 main.go:109] "Kubelet registration probe created" path="\\var\\lib\\kubelet\\plugins\\csi.vsphere.vmware.com\\registration"
I0316 12:22:56.947417 6952 main.go:120] Received NotifyRegistrationStatus call: &RegistrationStatus{PluginRegistered:false,Error:RegisterPlugin error -- plugin registration failed with err: rpc error: code = Unavailable desc = connection error: desc = "transport: Error while dialing dial unix \\\\var\\\\lib\\\\kubelet\\\\plugins\\\\csi.vsphere.vmware.com\\\\csi.sock: connect: A socket operation was attempted to an unreachable network.",}
E0316 12:22:56.947417 6952 main.go:122] Registration process failed with error: RegisterPlugin error -- plugin registration failed with err: rpc error: code = Unavailable desc = connection error: desc = "transport: Error while dialing dial unix \\\\var\\\\lib\\\\kubelet\\\\plugins\\\\csi.vsphere.vmware.com\\\\csi.sock: connect: A socket operation was attempted to an unreachable network.", restarting registration container. 

Steps To Reproduce:

Expected behavior: windows node-driver-registrar is able to properly register node without error

Actual behavior: windows node-driver-registrar is failing to register node. cannot successfully connect to csi.sock

jason-idk commented 1 year ago

Here is the vsphere-csi-node-windows daemonset spec:

apiVersion: v1
kind: Pod
metadata:
  annotations:
    cattle.io/timestamp: '2023-03-15T19:46:50Z'
    cni.projectcalico.org/containerID: 9066348f5daa16f80f7127c681c77f7eee6fa0e5453b21d72c976a81d5cde6be
    cni.projectcalico.org/podIP: 10.42.122.138/32
    cni.projectcalico.org/podIPs: 10.42.122.138/32
    kubernetes.io/psp: global-unrestricted-psp
  creationTimestamp: '2023-03-15T19:46:52Z'
  generateName: vsphere-csi-node-windows-
  labels:
    app: vsphere-csi-node-windows
    controller-revision-hash: 6cbdc7fbfc
    pod-template-generation: '4'
    role: vsphere-csi-windows
  managedFields:
  <redacting managedFields to shorten>
spec:
  affinity:
    nodeAffinity:
      requiredDuringSchedulingIgnoredDuringExecution:
        nodeSelectorTerms:
          - matchFields:
              - key: metadata.name
                operator: In
                values:
                  - hybww1
  containers:
    - args:
        - '--v=5'
        - '--csi-address=$(ADDRESS)'
        - '--kubelet-registration-path=$(DRIVER_REG_SOCK_PATH)'
      env:
        - name: ADDRESS
          value: unix://C:\\csi\\csi.sock
        - name: DRIVER_REG_SOCK_PATH
          value: \\var\\lib\\kubelet\\plugins\\csi.vsphere.vmware.com\\csi.sock
      image: rancher/mirrored-sig-storage-csi-node-driver-registrar:v2.5.1
      imagePullPolicy: IfNotPresent
      livenessProbe:
        exec:
          command:
            - /csi-node-driver-registrar.exe
            - >-
              --kubelet-registration-path=C:\\var\\lib\\kubelet\\plugins\\csi.vsphere.vmware.com\\csi.sock
            - '--mode=kubelet-registration-probe'
        failureThreshold: 3
        initialDelaySeconds: 3
        periodSeconds: 10
        successThreshold: 1
        timeoutSeconds: 1
      name: node-driver-registrar
      resources: {}
      terminationMessagePath: /dev/termination-log
      terminationMessagePolicy: File
      volumeMounts:
        - mountPath: /csi
          name: plugin-dir
        - mountPath: /registration
          name: registration-dir
        - mountPath: /var/run/secrets/kubernetes.io/serviceaccount
          name: kube-api-access-4fgrx
          readOnly: true
    - args:
        - '--fss-name=internal-feature-states.csi.vsphere.vmware.com'
        - '--fss-namespace=$(CSI_NAMESPACE)'
      env:
        - name: NODE_NAME
          valueFrom:
            fieldRef:
              apiVersion: v1
              fieldPath: spec.nodeName
        - name: CSI_ENDPOINT
          value: unix://C:\\csi\\csi.sock
        - name: MAX_VOLUMES_PER_NODE
          value: '0'
        - name: X_CSI_MODE
          value: node
        - name: X_CSI_SPEC_REQ_VALIDATION
          value: 'false'
        - name: X_CSI_SPEC_DISABLE_LEN_CHECK
          value: 'true'
        - name: LOGGER_LEVEL
          value: PRODUCTION
        - name: X_CSI_LOG_LEVEL
          value: DEBUG
        - name: CSI_NAMESPACE
          valueFrom:
            fieldRef:
              apiVersion: v1
              fieldPath: metadata.namespace
        - name: NODEGETINFO_WATCH_TIMEOUT_MINUTES
          value: '1'
      image: rancher/mirrored-cloud-provider-vsphere-csi-release-driver:v2.6.2
      imagePullPolicy: IfNotPresent
      livenessProbe:
        failureThreshold: 3
        httpGet:
          path: /healthz
          port: healthz
          scheme: HTTP
        initialDelaySeconds: 10
        periodSeconds: 5
        successThreshold: 1
        timeoutSeconds: 5
      name: vsphere-csi-node
      ports:
        - containerPort: 9808
          name: healthz
          protocol: TCP
      resources: {}
      terminationMessagePath: /dev/termination-log
      terminationMessagePolicy: File
      volumeMounts:
        - mountPath: C:\csi
          name: plugin-dir
        - mountPath: C:\var\lib\kubelet
          name: pods-mount-dir
        - mountPath: \\.\pipe\csi-proxy-volume-v1
          name: csi-proxy-volume-v1
        - mountPath: \\.\pipe\csi-proxy-filesystem-v1
          name: csi-proxy-filesystem-v1
        - mountPath: \\.\pipe\csi-proxy-disk-v1
          name: csi-proxy-disk-v1
        - mountPath: \\.\pipe\csi-proxy-system-v1alpha1
          name: csi-proxy-system-v1alpha1
        - mountPath: /var/run/secrets/kubernetes.io/serviceaccount
          name: kube-api-access-4fgrx
          readOnly: true
    - args:
        - '--v=4'
        - '--csi-address=/csi/csi.sock'
      image: rancher/mirrored-sig-storage-livenessprobe:v2.7.0
      imagePullPolicy: IfNotPresent
      name: liveness-probe
      resources: {}
      terminationMessagePath: /dev/termination-log
      terminationMessagePolicy: File
      volumeMounts:
        - mountPath: /csi
          name: plugin-dir
        - mountPath: /var/run/secrets/kubernetes.io/serviceaccount
          name: kube-api-access-4fgrx
          readOnly: true
  dnsPolicy: ClusterFirst
  enableServiceLinks: true
  nodeName: hybww1
  nodeSelector:
    kubernetes.io/os: windows
  preemptionPolicy: PreemptLowerPriority
  priority: 0
  restartPolicy: Always
  schedulerName: default-scheduler
  securityContext: {}
  serviceAccount: vsphere-csi-node
  serviceAccountName: vsphere-csi-node
  terminationGracePeriodSeconds: 30
  tolerations:
    - effect: NoSchedule
      key: node-role.kubernetes.io/master
      operator: Exists
    - effect: NoSchedule
      key: node-role.kubernetes.io/controlplane
      value: 'true'
    - effect: NoSchedule
      key: node-role.kubernetes.io/control-plane
      operator: Exists
    - effect: NoExecute
      key: node-role.kubernetes.io/etcd
      operator: Exists
    - effect: NoExecute
      key: node.kubernetes.io/not-ready
      operator: Exists
    - effect: NoExecute
      key: node.kubernetes.io/unreachable
      operator: Exists
    - effect: NoSchedule
      key: node.kubernetes.io/disk-pressure
      operator: Exists
    - effect: NoSchedule
      key: node.kubernetes.io/memory-pressure
      operator: Exists
    - effect: NoSchedule
      key: node.kubernetes.io/pid-pressure
      operator: Exists
    - effect: NoSchedule
      key: node.kubernetes.io/unschedulable
      operator: Exists
  volumes:
    - hostPath:
        path: C:\var\lib\kubelet\plugins_registry\
        type: Directory
      name: registration-dir
    - hostPath:
        path: C:\var\lib\kubelet\plugins\csi.vsphere.vmware.com\
        type: DirectoryOrCreate
      name: plugin-dir
    - hostPath:
        path: \var\lib\kubelet
        type: Directory
      name: pods-mount-dir
    - hostPath:
        path: \\.\pipe\csi-proxy-disk-v1
        type: ''
      name: csi-proxy-disk-v1
    - hostPath:
        path: \\.\pipe\csi-proxy-volume-v1
        type: ''
      name: csi-proxy-volume-v1
    - hostPath:
        path: \\.\pipe\csi-proxy-filesystem-v1
        type: ''
      name: csi-proxy-filesystem-v1
    - hostPath:
        path: \\.\pipe\csi-proxy-system-v1alpha1
        type: ''
      name: csi-proxy-system-v1alpha1
    - name: kube-api-access-4fgrx
      projected:
        defaultMode: 420
        sources:
          - serviceAccountToken:
              expirationSeconds: 3607
              path: token
          - configMap:
              items:
                - key: ca.crt
                  path: ca.crt
              name: kube-root-ca.crt
          - downwardAPI:
              items:
                - fieldRef:
                    apiVersion: v1
                    fieldPath: metadata.namespace
                  path: namespace

I see named pipes being used for some of the paths, but I guess what confuses me is why a unix socket type mount is being used for the csi.sock path. I did not think this was supported on windows? https://kubernetes.io/docs/tasks/configure-pod-container/create-hostprocess-pod/#limitations

I did not create this daemonset - this was created when deploying from helm as describe above. Am I doing something wrong here?

brandond commented 1 year ago

We'll take a look at this; it does look like at least one of the socket paths is misconfigured on Windows nodes.

For the record, Windows has supported AF_UNIX sockets since Windows 10 (circa 2018).

jason-idk commented 1 year ago

Thanks @brandond! Much appreciated.

I will be out of town all next week, so I may not be able to respond then, but I will try to answer any questions and be available for testing etc as soon as possible.

brandond commented 1 year ago

Our QA team (including @VestigeJ) has access to a lab where they should be able to try to replicate this.

jason-idk commented 1 year ago

Hi @brandond and @VestigeJ,

Just wanted to let you both know I am back from being out of town and will be available again for anything you may need. Were you able to replicate the issue in your lab environment?

VestigeJ commented 1 year ago

I've spoken directly with Jason on this to see if he can provide further information about his environment as I was not able to reproduce this in our vsphere cluster. Waiting to hear back before I close

jason-idk commented 1 year ago

Apologies - I have been pretty heads down on another project. Sent my response this morning and will be available today for collaboration.

caroline-suse-rancher commented 1 year ago

@VestigeJ or @JLH993 Any updates on the status of this issue? Moving to the Backlog for now until we can confirm.

jason-idk commented 1 year ago

@caroline-suse-rancher Ive tried updating the rke2 version of the cluster and adding new windows nodes but haven't been successful yet. I was messaging @VestigeJ on Slack and we tested a few things but haven't found a solution yet.

jason-idk commented 1 year ago

Hi, just wanted to see if there had been any updates on this. The node-driver-registrar is still in a crashLoopBackoff state and I have not been able to successfully register a windows node using the rancher-vsphere-csi. If there is any further information I can provide that may be of use I am happy to work with someone to get this.

I am also available to if anyone needs me to run any tests etc.

VestigeJ commented 1 year ago

I was not able to reproduce the original issue in our vsphere cluster which lead me to believe there was something on the VSphere side of the fence that wasn't functioning as intended. The additional configuration on the VSphere side is where our environments drifted far enough apart to prevent me from taking it further.

jason-idk commented 1 year ago

@VestigeJ - I have an update on this one... I had to put this on the shelf and come back to it. I was able to get the vsphere-csi-node-windows daemonset running this morning. Details below:

I went back to try to understand the error a bit more, I think I was getting misled due to seeing the csi.sock file existed on the windows nodes filesystem:

PS C:\var\lib\kubelet\plugins\csi.vsphere.vmware.com> ls

    Directory: C:\var\lib\kubelet\plugins\csi.vsphere.vmware.com

Mode                LastWriteTime         Length Name
----                -------------         ------ ----
-a----         8/8/2023   9:40 PM              0 csi.sock

The path shown in the logs looked a bit odd as I noticed the backslashes seemed to be escaped twice (this turned out to be irrelevant, but it led me to find a solution.)

transport: Error while dialing dial unix \\\\var\\\\lib\\\\kubelet\\\\plugins\\\\csi.vsphere.vmware.com\\\\csi.sock: connect: A socket operation was attempted to an unreachable network.

Testing locally on the node itself, I could not list the csi.sock file using the path configured in the daemonset for DRIVER_REG_SOCK_PATH, which was set to \\var\\lib\\kubelet\\plugins\\csi.vsphere.vmware.com\\csi.sock. After updating the path in the daemonset to C:\\var\\lib\\kubelet\\plugins\\csi.vsphere.vmware.com\\csi.sock, the pods were able to go into a running state.

Here are the logs after the change:

I0809 14:37:47.682853    1700 main.go:167] Version: v2.7.0
2023-08-09T14:37:47.686514000+01:00 I0809 14:37:47.682853    1700 main.go:168] Running node-driver-registrar in mode=registration
2023-08-09T14:37:47.686514000+01:00 I0809 14:37:47.686514    1700 main.go:192] Attempting to open a gRPC connection with: "unix://C:\\\\csi\\\\csi.sock"
2023-08-09T14:37:47.687116600+01:00 I0809 14:37:47.686514    1700 connection.go:154] Connecting to unix://C:\\csi\\csi.sock
I0809 14:37:55.779604    1700 main.go:199] Calling CSI driver to discover driver name
2023-08-09T14:37:55.780553500+01:00 I0809 14:37:55.779604    1700 connection.go:183] GRPC call: /csi.v1.Identity/GetPluginInfo
2023-08-09T14:37:55.783858400+01:00 I0809 14:37:55.779604    1700 connection.go:184] GRPC request: {}
2023-08-09T14:37:55.790137000+01:00 I0809 14:37:55.789639    1700 connection.go:186] GRPC response: {"name":"csi.vsphere.vmware.com","vendor_version":"v3.0.1"}
2023-08-09T14:37:55.790199000+01:00 I0809 14:37:55.789830    1700 connection.go:187] GRPC error: <nil>
2023-08-09T14:37:55.790281100+01:00 I0809 14:37:55.789938    1700 main.go:209] CSI driver name: "csi.vsphere.vmware.com"
2023-08-09T14:37:55.791613600+01:00 I0809 14:37:55.791284    1700 node_register.go:53] Starting Registration Server at: /registration/csi.vsphere.vmware.com-reg.sock
2023-08-09T14:37:55.793067100+01:00 I0809 14:37:55.792431    1700 node_register.go:62] Registration Server started at: /registration/csi.vsphere.vmware.com-reg.sock
2023-08-09T14:37:55.793735200+01:00 I0809 14:37:55.793351    1700 node_register.go:92] Skipping HTTP server because endpoint is set to: ""
2023-08-09T14:37:57.468773300+01:00 I0809 14:37:57.468046    1700 main.go:102] Received GetInfo call: &InfoRequest{}
2023-08-09T14:37:57.471666700+01:00 I0809 14:37:57.471318    1700 main.go:109] "Kubelet registration probe created" path="C:\\var\\lib\\kubelet\\plugins\\csi.vsphere.vmware.com\\registration"
I0809 14:37:58.458318    1700 main.go:121] Received NotifyRegistrationStatus call: &RegistrationStatus{PluginRegistered:true,Error:,}

Not sure if the last line indicates there is some new error, but at least the pods in the daemonset are running now.

jasonhensley commented 1 year ago

I guess my question for this is - how do I correct this path at the time of provisioning? I am using the rancher2 terraform provider to create this cluster and I would like to prevent any manual intervention if possible. I was looking at the helm chart that gets installed and did not see any way to modify this variable.

jason-idk commented 1 year ago

I've opened a pull request for this change: https://github.com/rancher/charts/pull/2885

Please review when possible. This seems like a bug to me, but maybe there is just some documentation that needs to be updated. If so, please point me in the right direction for information on how to patch the path for DRIVER_REG_SOCK_PATH on the windows node-driver-registrarr daemonset environment variables during the chart installation.

jason-idk commented 1 year ago

Hi @caroline-suse-rancher, is there anyone that might be able to look at my PR to see if these changes might be a viable solution? This has been blocking me from completing a project internally.

I don't mind helping out or making some changes to my PR if needed. Just let me know and I will be happy to do so.

brandond commented 1 year ago

I added a comment to your PR, but rancher/charts is not the upstream for the RKE2 chart. rancher/vsphere-charts is the upstream; from there things go to rancher/charts and rancher/rke2-charts.

We can take a look at getting this fixed for the October release cycle.

jason-idk commented 1 year ago

@brandond - thank you for the review, I like the idea of adding a separate prefix path for windows in the values file and splitting them out. I have created a new pull request over on rancher/vsphere-charts with these changes when you get a chance to take a look.

Please let me know if there is anything else you need me to do and I can get it added in.

sonergzn commented 1 year ago

Hi, I just came across this discussion. In my case it is still stuck in 'Still connecting to unix://C:\csi\csi.sock'.

RKE version 2.7.6 K8 version: 1.24 Win node: win2022 datacenter vsphere CSI version: 3.0.1 CSI-proxy version: Running on the win node.

It works great on Linux Worker nodes.

I was just wondering if i should downgrade the Windows OS version to 2019. Or I just change some ENV value or something in the deployment, not sure where i should be looking for honestly. I am new to this.

jason-idk commented 1 year ago

Hi @sonergzn,

Can you show the full log output of the node-driver-registrar pod on the windows node? The key indicator you would be running into this same issue would be: transport: Error while dialing dial unix \\\\var\\\\lib\\\\kubelet\\\\plugins\\\\csi.vsphere.vmware.com\\\\csi.sock: connect: A socket operation was attempted to an unreachable network.", restarting registration container.

If you wanted to test a change that would fix the above error, you would need to edit the vsphere-csi-node-windows daemonset to change DRIVER_REG_SOCK_PATH to: C:\\var\\lib\\kubelet\\plugins\\csi.vsphere.vmware.com\\csi.sock

sonergzn commented 1 year ago

Hi @JLH993,

Appreciate your quick reaction.

Here are the logs: not much going on in node-driver-registrar: 2.7.0 container

I0927 15:00:32.875449 1540 main.go:167] Version: v2.7.0 2023-09-27T15:00:32.876535100+01:00 I0927 15:00:32.875991 1540 main.go:168] Running node-driver-registrar in mode=registration 2023-09-27T15:00:32.876535100+01:00 I0927 15:00:32.875991 1540 main.go:192] Attempting to open a gRPC connection with: "unix://C:\\csi\\csi.sock" 2023-09-27T15:00:32.876535100+01:00 I0927 15:00:32.875991 1540 connection.go:154] Connecting to unix://C:\csi\csi.sock 2023-09-27T15:00:42.885403400+01:00 W0927 15:00:42.885403 1540 connection.go:173] Still connecting to unix://C:\csi\csi.sock 2023-09-27T15:00:52.876287200+01:00 W0927 15:00:52.876287 1540 connection.go:173] Still connecting to unix://C:\csi\csi.sock

I have changed to DRIVER_REG_SOCK_PATH to C:\\var\\lib\\kubelet\\plugins\\csi.vsphere.vmware.com\\csi.sock. and reployed. No success.

I haven't noticed any issues related to underlying network connectivity.

jason-idk commented 1 year ago

@sonergzn hmmm... when you created the cluster, did you select Enable Windows CSI Support under Add-On Config? I am wondering if you are running into a different issue.

One separate issue I experienced before this with vmtools not sure what to do about the CNI interfaces. I had to add the following to C:\ProgramData\VMware\VMware Tools\tools.conf:

[guestinfo]
exclude-nics=docker*,veth*,cali*,flan*

Once adding this, vmtools will pick it up within the next 5 seconds. Also make sure that the vm name matches the hostname of the node. This does seem a little different that anything I ran into in regards to the DRIVER_REG_SOCK_PATH issue.

sonergzn commented 1 year ago

@JLH993 I see. I feel like it is a different issue/topic in my case.

I have indeed enabled that Enable Windows CSI Support during installation.

I didn't check anything related to VMTools. But in my case

jason-idk commented 1 year ago

@sonergzn - I would go ahead and give windows server 2019 standard a shot to see if there is any difference in your experience. FWIW, I am using windows server 2019 standard core.

You'll want to make sure the hostname and the actual VM name match or the vSphere cloud provider is going to have a hard time identifying the node and that could cause other issues you may not notice yet. This is something I ran into initially as well.

The csi.sock is being created as part of the deployment process and the pod you are looking at the logs for (node-driver-registrar) has a job to register the CSI driver with kubelet. If you wanted to double check that the csi.sock actually exists, you could login to the windows node and go see that it does/does not exist.

With the recent merge of https://github.com/rancher/vsphere-charts/pull/61 I am going to close this issue out and I would suggest opening a new issue to track down what might be causing the behavior you are seeing.

Thank you @brandond for getting this merged in!