rancher / rke2

https://docs.rke2.io/
Apache License 2.0
1.56k stars 268 forks source link

[Release-1.27] - Vsphere-csi updates to 3.1.2-rancher300 #5768

Closed galal-hussein closed 6 months ago

galal-hussein commented 6 months ago

Backport fix for Vsphere-csi updates to 3.1.2-rancher300

rancher-max commented 6 months ago

Validated on release-1.27 branch with commit 191329a6e3faac429ec145523e8d84c1b3be81fe - Scenario 1 failing!

Failing Issue:

New pvcs are failing to create their associated pvs after these chart changes. This happens on both fresh install and after upgrade. The pvc events are:

Events:
  Type     Reason                Age               From                                                                                                Message
  ----     ------                ----              ----                                                                                                -------
  Normal   Provisioning          29s               csi.vsphere.vmware.com_vsphere-csi-controller-f57fb7df8-phs7b_8aa1a816-399e-4eb4-b950-62629596a402  External provisioner is provisioning volume for claim "default/claim1"
  Warning  ProvisioningFailed    29s               csi.vsphere.vmware.com_vsphere-csi-controller-f57fb7df8-phs7b_8aa1a816-399e-4eb4-b950-62629596a402  failed to provision volume with StorageClass "vsphere-csi-sc": rpc error: code = Unavailable desc = error reading from server: EOF
  Normal   Provisioning          7s                csi.vsphere.vmware.com_vsphere-csi-controller-f57fb7df8-g8wr6_f4f565d6-b559-4129-a08e-02da5bb0c197  External provisioner is provisioning volume for claim "default/claim1"
  Warning  ProvisioningFailed    7s                csi.vsphere.vmware.com_vsphere-csi-controller-f57fb7df8-g8wr6_f4f565d6-b559-4129-a08e-02da5bb0c197  failed to provision volume with StorageClass "vsphere-csi-sc": rpc error: code = Unavailable desc = error reading from server: EOF
  Normal   ExternalProvisioning  6s (x4 over 29s)  persistentvolume-controller                                                                         waiting for a volume to be created, either by external provisioner "csi.vsphere.vmware.com" or manually created by system administrator

I can get around this by setting csi-auth-check: "true" either by manually updating the configmap after install or by including it in the helmchart ahead of time with:

    csiAuthCheck:
      enabled: true

Environment Details

Infrastructure

Node(s) CPU architecture, OS, and Version:

$ cat /etc/os-release | grep PRETTY
PRETTY_NAME="Ubuntu 22.04.3 LTS"

Cluster Configuration:

1 server

Config.yaml:

# /etc/rancher/rke2/config.yaml
write-kubeconfig-mode: 644
cloud-provider-name: "rancher-vsphere"

Additional files

# /var/lib/rancher/rke2/server/manifests/vsphere-values-1.yaml
apiVersion: helm.cattle.io/v1
kind: HelmChartConfig
metadata:
  name: rancher-vsphere-cpi
  namespace: kube-system
spec:
  valuesContent: |-
    vCenter:
      host: "aa.bb.ccc.dd"
      datacenters: "Datacenter"
      username: "username"
      password: "password"
      credentialsSecret:
        generate: true
    cloudControllerManager:
      nodeSelector:
        node-role.kubernetes.io/control-plane: "true"
---
apiVersion: helm.cattle.io/v1
kind: HelmChartConfig
metadata:
  name: rancher-vsphere-csi
  namespace: kube-system
spec:
  valuesContent: |-
    vCenter:
      host: "aa.bb.ccc.dd"
      datacenters: "Datacenter"
      username: "username"
      password: "password"
      clusterId: "maxtestcluster1"
      configSecret:
        generate: true
    storageClass:
      datastoreURL: "ds:///vmfs/volumes/redacted/"
    csiController:
      nodeSelector:
        node-role.kubernetes.io/control-plane: "true"

# /var/lib/rancher/rke2/server/manifests/vsphere-values-2.yaml
apiVersion: helm.cattle.io/v1
kind: HelmChartConfig
metadata:
  name: rancher-vsphere-cpi
  namespace: kube-system
spec:
  valuesContent: |-
    vCenter:
      host: "aa.bb.ccc.dd"
      datacenters: "Datacenter"
      username: "username"
      password: "password"
      credentialsSecret:
        generate: true
    cloudControllerManager:
      nodeSelector:
        node-role.kubernetes.io/control-plane: "true"
---
apiVersion: helm.cattle.io/v1
kind: HelmChartConfig
metadata:
  name: rancher-vsphere-csi
  namespace: kube-system
spec:
  valuesContent: |-
    vCenter:
      host: "aa.bb.ccc.dd"
      datacenters: "Datacenter"
      username: "username"
      password: "password"
      clusterId: "maxtestcluster1"
      configSecret:
        generate: true
    storageClass:
      datastoreURL: "ds:///vmfs/volumes/redacted/"
    csiController:
      nodeSelector:
        node-role.kubernetes.io/control-plane: "true"
    topology:
      enabled: true
    multiVcenterCsiTopology:
      enabled: false
    csiAuthCheck:
      enabled: true
# pvcpod.yaml
apiVersion: v1
kind: PersistentVolumeClaim
metadata:
  name: claim1
spec:
  accessModes:
    - ReadWriteOnce
  storageClassName: vsphere-csi-sc
  resources:
    requests:
      storage: 1Gi
---
apiVersion: "v1"
kind: "Pod"
metadata:
  name: "basic"
  labels:
    name: "basic"
spec:
  nodeSelector:
    kubernetes.io/os: linux
  containers:
    - name: "basic"
      image: ranchertest/mytestcontainer:unprivileged
      ports:
        - containerPort: 8080
          name: "basic"
      volumeMounts:
        - mountPath: "/data"
          name: "pvol"
  volumes:
    - name: "pvol"
      persistentVolumeClaim:
        claimName: "claim1"

Testing Steps

Scenario 1:

  1. Install RKE2 using vsphere-values-1.yaml
  2. Deploy pvcpod.yaml: kubectl apply -f pvcpod.yaml
  3. Ensure node is using the vsphere provider: k describe node | grep -i providerid (expecting vsphere://<something>)
  4. Ensure all nodes and pods are up and running and not crashlooping
  5. Check logs for multi-vcenter: k logs -n kube-system -l app=vsphere-csi-controller --all-containers | grep -i multi-vcenter-csi-topology (expecting nothing to return)
  6. Ensure configmap has the value set: k get cm -n kube-system internal-feature-states.csi.vsphere.vmware.com -o yaml | grep -i multi (expecting multi-vcenter-csi-topology: "true")
  7. Check helm version to ensure it uses the correct one: helm ls -A (expecting rancher-vsphere-csi-3.1.2-rancher300)

Scenario 2:

  1. Install older version of rke2 (v1.26.14+rke2r1)
  2. Perform steps 2-7 from Scenario 1. The expectation from number 6 and 7 is nothing and rancher-vsphere-csi-3.1.2-rancher101, respectively
  3. Upgrade to version under test
  4. Perform steps 3-7 from Scenario 1.

Scenario 3:

  1. Install RKE2 using vsphere-values-2.yaml
  2. Ensure the values are present in the configmap: k get cm -n kube-system internal-feature-states.csi.vsphere.vmware.com -o yaml (expected set values to be true and false as set)
  3. Perform steps 2-4 from Scenario 1 to ensure everything is up and running correctly

Replication Results:

Validation Results:

rancher-max commented 6 months ago

Validated on commit bfc28051f45a8d8d2794bdffaf0501491094eecb on release-1.27 that the chart version has been updated to rancher-vsphere-csi-3.1.2-rancher400, and csiAuthCheck is now defaulted to enabled: true. See all defaults now below:

$ k get cm internal-feature-states.csi.vsphere.vmware.com -n kube-system -o yaml
apiVersion: v1
data:
  async-query-volume: "false"
  block-volume-snapshot: "false"
  cnsmgr-suspend-create-volume: "false"
  csi-auth-check: "true"
  csi-migration: "false"
  csi-windows-support: "false"
  improved-csi-idempotency: "false"
  improved-volume-topology: "false"
  list-volumes: "false"
  max-pvscsi-targets-per-vm: "false"
  multi-vcenter-csi-topology: "true"
  online-volume-extend: "false"
  pv-to-backingdiskobjectid-mapping: "false"
  topology-preferential-datastores: "false"
  trigger-csi-fullsync: "false"
  use-csinode-id: "true"
kind: ConfigMap
metadata:
  annotations:
    meta.helm.sh/release-name: rancher-vsphere-csi
    meta.helm.sh/release-namespace: kube-system
  creationTimestamp: "2024-04-19T17:30:01Z"
  labels:
    app.kubernetes.io/managed-by: Helm
  name: internal-feature-states.csi.vsphere.vmware.com
  namespace: kube-system
  resourceVersion: "622"
  uid: 4a11cb86-0887-4b98-bd46-254c4c61ce8a
$ helm ls -A
NAME                                NAMESPACE   REVISION    UPDATED                                 STATUS      CHART                                       APP VERSION   
rancher-vsphere-cpi                 kube-system 1           2024-04-19 17:30:00.635933258 +0000 UTC deployed    rancher-vsphere-cpi-1.7.001                 1.28.0        
rancher-vsphere-csi                 kube-system 1           2024-04-19 17:29:59.835948691 +0000 UTC deployed    rancher-vsphere-csi-3.1.2-rancher400        3.1.2-rancher4