nutanix / helm

Nutanix Helm Charts repository
https://nutanix.github.io/helm/
MIT License
18 stars 31 forks source link

Problems assigning the storage class with helm for the sci storage driver #45

Closed eyanez111 closed 2 years ago

eyanez111 commented 2 years ago

Hello I am trying to assign an storage class to a Rancher cluster. I followed this tutorial: I followed this process: https://portal.nutanix.com/page/documents/details?targetId=CSI-Volume-Driver-v2_5:CSI-Volume-Driver-v2_5

I saw that there were some fields that I needed to edit: https://github.com/nutanix/helm/blob/nutanix-csi-storage-2.5.0/charts/nutanix-csi-storage/values.yaml

so I downloaded the file and added the requested fields like: prismEndPoint:

username: password:

The containers start: kubectl get pods -A | grep csi ntnx-system csi-node-ntnx-plugin-ckh2l 3/3 Running 0 12m ntnx-system csi-node-ntnx-plugin-dn5br 3/3 Running 0 12m ntnx-system csi-node-ntnx-plugin-h4s9c 3/3 Running 0 12m ntnx-system csi-node-ntnx-plugin-kzhn7 3/3 Running 0 12m ntnx-system csi-node-ntnx-plugin-slzzw 3/3 Running 0 12m ntnx-system csi-node-ntnx-plugin-wftpg 3/3 Running 0 12m ntnx-system csi-provisioner-ntnx-plugin-0 5/5 Running 0 12m

but when I check the logs I am getting this error on all the containers: error: a container name must be specified for pod csi-provisioner-ntnx-plugin-0, choose one of: [csi-provisioner csi-resizer csi-snapshotter ntnx-csi-plugin liveness-probe]

not sure how to fix that or if I am missing anything on the installation.

Thanks Francisco Yanez

subodh01 commented 2 years ago

Try: kubectl -n ntnx-system logs csi-provisioner-ntnx-plugin-0 -c ntnx-csi-plugin

eyanez111 commented 2 years ago

This is what I got

 kubectl -n ntnx-system logs csi-provisioner-ntnx-plugin-0 -c ntnx-csi-plugin.
error: container ntnx-csi-plugin. is not valid for pod csi-provisioner-ntnx-plugin-0

apparently is not valid

eyanez111 commented 2 years ago

Ok I actually made it work... I just needed to wait after I deployed the pods again:

kubectl -n ntnx-system logs csi-provisioner-ntnx-plugin-0 -c ntnx-csi-plugin I0112 01:19:26.393641 1 ntnx_driver.go:84] Enabling volume access mode: SINGLE_NODE_WRITER I0112 01:19:26.393716 1 ntnx_driver.go:84] Enabling volume access mode: MULTI_NODE_READER_ONLY I0112 01:19:26.393720 1 ntnx_driver.go:84] Enabling volume access mode: MULTI_NODE_MULTI_WRITER I0112 01:19:26.393723 1 ntnx_driver.go:94] Enabling controller service capability: CREATE_DELETE_VOLUME I0112 01:19:26.393727 1 ntnx_driver.go:94] Enabling controller service capability: EXPAND_VOLUME I0112 01:19:26.393729 1 ntnx_driver.go:94] Enabling controller service capability: CLONE_VOLUME I0112 01:19:26.393732 1 ntnx_driver.go:94] Enabling controller service capability: CREATE_DELETE_SNAPSHOT I0112 01:19:26.393738 1 ntnx_driver.go:104] Enabling node service capability: GET_VOLUME_STATS I0112 01:19:26.393741 1 ntnx_driver.go:104] Enabling node service capability: STAGE_UNSTAGE_VOLUME I0112 01:19:26.393750 1 ntnx_driver.go:104] Enabling node service capability: EXPAND_VOLUME I0112 01:19:26.393755 1 ntnx_driver.go:145] Driver: csi.nutanix.com I0112 01:19:26.393963 1 server.go:98] Listening for connections on address: &net.UnixAddr{Name:"//var/lib/csi/sockets/pluginproxy/csi.sock", Net:"unix"} 2022-01-12T01:19:26.829Z identity.go:23: [INFO] Using default GetPluginInfo 2022-01-12T01:19:26.83Z identity.go:39: [INFO] Using default GetPluginCapabilities 2022-01-12T01:19:26.902Z identity.go:23: [INFO] Using default GetPluginInfo 2022-01-12T01:19:26.902Z identity.go:39: [INFO] Using default GetPluginCapabilities 2022-01-12T01:19:26.943Z identity.go:23: [INFO] Using default GetPluginInfo 2022-01-12T01:19:27.204Z identity.go:23: [INFO] Using default GetPluginInfo

does this mean it will work now?

Thanks Francisco Yanez

eyanez111 commented 2 years ago

I did find also a few things as I pointed to another container:

kubectl -n ntnx-system logs csi-provisioner-ntnx-plugin-0 -c csi-resizer
I0112 01:19:25.823514       1 main.go:90] Version : v1.2.0
I0112 01:19:25.823555       1 feature_gate.go:243] feature gates: &{map[]}
I0112 01:19:25.825050       1 connection.go:153] Connecting to unix:///var/lib/csi/sockets/pluginproxy/csi.sock
I0112 01:19:26.825774       1 common.go:111] Probing CSI driver for readiness
I0112 01:19:26.825792       1 connection.go:182] GRPC call: /csi.v1.Identity/Probe
I0112 01:19:26.825797       1 connection.go:183] GRPC request: {}
I0112 01:19:26.829425       1 connection.go:185] GRPC response: {}
I0112 01:19:26.829475       1 connection.go:186] GRPC error: <nil>
I0112 01:19:26.829484       1 connection.go:182] GRPC call: /csi.v1.Identity/GetPluginInfo
I0112 01:19:26.829487       1 connection.go:183] GRPC request: {}
I0112 01:19:26.829803       1 connection.go:185] GRPC response: {"name":"csi.nutanix.com","vendor_version":"v1.1.0"}
I0112 01:19:26.829845       1 connection.go:186] GRPC error: <nil>
I0112 01:19:26.829852       1 main.go:138] CSI driver name: "csi.nutanix.com"
I0112 01:19:26.829860       1 connection.go:182] GRPC call: /csi.v1.Identity/GetPluginCapabilities
I0112 01:19:26.829863       1 connection.go:183] GRPC request: {}
I0112 01:19:26.830312       1 connection.go:185] GRPC response: {"capabilities":[{"Type":{"Service":{"type":1}}},{"Type":{"VolumeExpansion":{"type":1}}}]}
I0112 01:19:26.830400       1 connection.go:186] GRPC error: <nil>
I0112 01:19:26.830412       1 connection.go:182] GRPC call: /csi.v1.Controller/ControllerGetCapabilities
I0112 01:19:26.830414       1 connection.go:183] GRPC request: {}
I0112 01:19:26.830788       1 connection.go:185] GRPC response: {"capabilities":[{"Type":{"Rpc":{"type":1}}},{"Type":{"Rpc":{"type":9}}},{"Type":{"Rpc":{"type":7}}},{"Type":{"Rpc":{"type":5}}}]}
I0112 01:19:26.830901       1 connection.go:186] GRPC error: <nil>
I0112 01:19:26.831035       1 main.go:166] ServeMux listening at ":9810"
I0112 01:19:26.831184       1 controller.go:251] Starting external resizer csi.nutanix.com
I0112 01:19:26.831424       1 reflector.go:219] Starting reflector *v1.PersistentVolumeClaim (10m0s) from k8s.io/client-go/informers/factory.go:134
I0112 01:19:26.831437       1 reflector.go:255] Listing and watching *v1.PersistentVolumeClaim from k8s.io/client-go/informers/factory.go:134
I0112 01:19:26.831523       1 reflector.go:219] Starting reflector *v1.PersistentVolume (10m0s) from k8s.io/client-go/informers/factory.go:134
I0112 01:19:26.831534       1 reflector.go:255] Listing and watching *v1.PersistentVolume from k8s.io/client-go/informers/factory.go:134
I0112 01:19:26.931397       1 shared_informer.go:270] caches populated
I0112 01:19:26.931479       1 controller.go:291] Started PVC processing "centralized-logging/data-my-cluster-zookeeper-0"
I0112 01:19:26.931493       1 controller.go:291] Started PVC processing "centralized-logging/data-my-cluster-zookeeper-1"
I0112 01:19:26.931497       1 controller.go:291] Started PVC processing "centralized-logging/data-my-cluster-zookeeper-2"
W0112 01:19:26.931523       1 controller.go:318] PV "" bound to PVC centralized-logging/data-my-cluster-zookeeper-2 not found
W0112 01:19:26.931496       1 controller.go:318] PV "" bound to PVC centralized-logging/data-my-cluster-zookeeper-0 not found
W0112 01:19:26.931513       1 controller.go:318] PV "" bound to PVC centralized-logging/data-my-cluster-zookeeper-1 not found
I0112 01:27:27.839571       1 reflector.go:530] k8s.io/client-go/informers/factory.go:134: Watch close - *v1.PersistentVolumeClaim total 0 items received
I0112 01:29:08.840254       1 reflector.go:530] k8s.io/client-go/informers/factory.go:134: Watch close - *v1.PersistentVolume total 0 items received
I0112 01:29:26.839352       1 reflector.go:381] k8s.io/client-go/informers/factory.go:134: forcing resync
I0112 01:29:26.839428       1 controller.go:291] Started PVC processing "centralized-logging/data-my-cluster-zookeeper-0"
W0112 01:29:26.839438       1 controller.go:318] PV "" bound to PVC centralized-logging/data-my-cluster-zookeeper-0 not found
I0112 01:29:26.839475       1 controller.go:291] Started PVC processing "centralized-logging/data-my-cluster-zookeeper-1"
W0112 01:29:26.839482       1 controller.go:318] PV "" bound to PVC centralized-logging/data-my-cluster-zookeeper-1 not found
I0112 01:29:26.839487       1 controller.go:291] Started PVC processing "centralized-logging/data-my-cluster-zookeeper-2"
W0112 01:29:26.839499       1 controller.go:318] PV "" bound to PVC centralized-logging/data-my-cluster-zookeeper-2 not found
I0112 01:35:46.841463       1 reflector.go:530] k8s.io/client-go/informers/factory.go:134: Watch close - *v1.PersistentVolumeClaim total 0 items received
I0112 01:36:19.842247       1 reflector.go:530] k8s.io/client-go/informers/factory.go:134: Watch close - *v1.PersistentVolume total 0 items received
I0112 01:39:26.840519       1 reflector.go:381] k8s.io/client-go/informers/factory.go:134: forcing resync
I0112 01:39:26.840641       1 controller.go:291] Started PVC processing "centralized-logging/data-my-cluster-zookeeper-0"
W0112 01:39:26.840657       1 controller.go:318] PV "" bound to PVC centralized-logging/data-my-cluster-zookeeper-0 not found
I0112 01:39:26.840680       1 controller.go:291] Started PVC processing "centralized-logging/data-my-cluster-zookeeper-1"
I0112 01:39:26.840697       1 controller.go:291] Started PVC processing "centralized-logging/data-my-cluster-zookeeper-2"
W0112 01:39:26.840700       1 controller.go:318] PV "" bound to PVC centralized-logging/data-my-cluster-zookeeper-1 not found
W0112 01:39:26.840707       1 controller.go:318] PV "" bound to PVC centralized-logging/data-my-cluster-zookeeper-2 not found

I then checked those same PVC that says not found and see that are still pending:


paoc@LAP-FYANEZ:~/csi-driver-nutanix$ kubectl get pvc -n centralized-logging
NAME                          STATUS    VOLUME   CAPACITY   ACCESS MODES   STORAGECLASS   AGE
data-my-cluster-zookeeper-0   Pending                                                     154m
data-my-cluster-zookeeper-1   Pending                                                     154m
data-my-cluster-zookeeper-2   Pending                                                     154m
 kubectl -n ntnx-system logs csi-provisioner-ntnx-plugin-0 -c csi-snapshotter
I0112 01:19:25.941125       1 main.go:87] Version: v3.0.3
I0112 01:19:25.942442       1 connection.go:153] Connecting to unix:///csi/csi.sock
W0112 01:19:26.944268       1 metrics.go:333] metrics endpoint will not be started because `metrics-address` was not specified.
I0112 01:19:26.944288       1 common.go:111] Probing CSI driver for readiness
I0112 01:19:26.945008       1 snapshot_controller_base.go:111] Starting CSI snapshotter
kubectl -n ntnx-system logs csi-provisioner-ntnx-plugin-0 -c liveness-probe
I0112 01:19:27.203386       1 main.go:149] calling CSI driver to discover driver name
I0112 01:19:27.204450       1 main.go:155] CSI driver name: "csi.nutanix.com"
I0112 01:19:27.204467       1 main.go:183] ServeMux listening at ":9807"

And then I checked the logs on the other pods and found this errors:

kubectl -n ntnx-system logs csi-node-ntnx-plugin-hxq2p -c driver-registrar
I0112 01:19:25.464520       1 main.go:113] Version: v2.2.0
I0112 01:19:25.465204       1 main.go:137] Attempting to open a gRPC connection with: "/csi/csi.sock"
I0112 01:19:25.465227       1 connection.go:153] Connecting to unix:///csi/csi.sock
I0112 01:19:26.466657       1 main.go:144] Calling CSI driver to discover driver name
I0112 01:19:26.466683       1 connection.go:182] GRPC call: /csi.v1.Identity/GetPluginInfo
I0112 01:19:26.466688       1 connection.go:183] GRPC request: {}
I0112 01:19:26.469112       1 connection.go:185] GRPC response: {"name":"csi.nutanix.com","vendor_version":"v1.1.0"}
I0112 01:19:26.469168       1 connection.go:186] GRPC error: <nil>
I0112 01:19:26.469174       1 main.go:154] CSI driver name: "csi.nutanix.com"
I0112 01:19:26.469206       1 node_register.go:52] Starting Registration Server at: /registration/csi.nutanix.com-reg.sock
I0112 01:19:26.469352       1 node_register.go:61] Registration Server started at: /registration/csi.nutanix.com-reg.sock
I0112 01:19:26.469399       1 node_register.go:83] Skipping healthz server because HTTP endpoint is set to: ""
I0112 01:19:26.659194       1 main.go:80] Received GetInfo call: &InfoRequest{}
I0112 01:19:26.680026       1 main.go:90] Received NotifyRegistrationStatus call: &RegistrationStatus{PluginRegistered:true,Error:,}
kubectl -n ntnx-system logs csi-node-ntnx-plugin-hxq2p -c csi-node-ntnx-plugin
I0112 01:19:25.864461       1 ntnx_driver.go:84] Enabling volume access mode: SINGLE_NODE_WRITER
I0112 01:19:25.864561       1 ntnx_driver.go:84] Enabling volume access mode: MULTI_NODE_READER_ONLY
I0112 01:19:25.864564       1 ntnx_driver.go:84] Enabling volume access mode: MULTI_NODE_MULTI_WRITER
I0112 01:19:25.864567       1 ntnx_driver.go:94] Enabling controller service capability: CREATE_DELETE_VOLUME
I0112 01:19:25.864570       1 ntnx_driver.go:94] Enabling controller service capability: EXPAND_VOLUME
I0112 01:19:25.864573       1 ntnx_driver.go:94] Enabling controller service capability: CLONE_VOLUME
I0112 01:19:25.864575       1 ntnx_driver.go:94] Enabling controller service capability: CREATE_DELETE_SNAPSHOT
I0112 01:19:25.864579       1 ntnx_driver.go:104] Enabling node service capability: GET_VOLUME_STATS
I0112 01:19:25.864582       1 ntnx_driver.go:104] Enabling node service capability: STAGE_UNSTAGE_VOLUME
I0112 01:19:25.864584       1 ntnx_driver.go:104] Enabling node service capability: EXPAND_VOLUME
I0112 01:19:25.864590       1 ntnx_driver.go:145] Driver: csi.nutanix.com
I0112 01:19:25.865024       1 server.go:98] Listening for connections on address: &net.UnixAddr{Name:"//csi/csi.sock", Net:"unix"}
2022-01-12T01:19:26.468Z identity.go:23: [INFO] Using default GetPluginInfo
2022-01-12T01:19:26.66Z node.go:215: [INFO] NodeGetInfo called with req: &csi.NodeGetInfoRequest{XXX_NoUnkeyedLiteral:struct {}{}, XXX_unrecognized:[]uint8(nil), XXX_sizecache:0}
2022-01-12T01:19:26.698Z identity.go:23: [INFO] Using default GetPluginInfo
 kubectl -n ntnx-system logs csi-node-ntnx-plugin-hxq2p -c liveness-probe
I0112 01:19:26.698098       1 main.go:149] calling CSI driver to discover driver name
I0112 01:19:26.699082       1 main.go:155] CSI driver name: "csi.nutanix.com"
I0112 01:19:26.699100       1 main.go:183] ServeMux listening at ":9808"

Thanks Francisco yanez

subodh01 commented 2 years ago

Please follow the example from here https://github.com/nutanix/csi-plugin/tree/master/example/ABS to test your deployment.

eyanez111 commented 2 years ago

Sorry I am a bit confused. I need to do those instead of the helm or after the helm installation?

also on this example:
kind: StorageClass
apiVersion: storage.k8s.io/v1
metadata:
    name: acs-abs
provisioner: csi.nutanix.com
parameters:
    csi.storage.k8s.io/provisioner-secret-name: ntnx-secret
    csi.storage.k8s.io/provisioner-secret-namespace: kube-system
    csi.storage.k8s.io/node-publish-secret-name: ntnx-secret
    csi.storage.k8s.io/node-publish-secret-namespace: kube-system
    csi.storage.k8s.io/controller-expand-secret-name: ntnx-secret
    csi.storage.k8s.io/controller-expand-secret-namespace: kube-system
    csi.storage.k8s.io/fstype: ext4  --> **how can I determine this?**
    dataServiceEndPoint: 10.5.65.156:3260
    storageContainer: default-container-30293
    storageType: NutanixVolumes  --> **Also how do I know what is the type of my nutanix cluster?**
    #whitelistIPMode: ENABLED
    #chapAuth: ENABLED
allowVolumeExpansion: true
reclaimPolicy: Delete

Thanks, I think we are close. Francisco Yanez

subodh01 commented 2 years ago

You need to do this after deploying CSI driver using helm. Spend some time going through: https://portal.nutanix.com/page/documents/details?targetId=CSI-Volume-Driver-v2_5:csi-csi-plugin-storage-c.html and https://kubernetes.io/docs/concepts/storage/

tuxtof commented 2 years ago

Hello @eyanez111 you are definitely everywhere :-D

are you always on your rancher cluster ? how did you install the CSI driver ? he is available from the rancher partner marketplace and you can directly create storage class during the configuration with a wizard

exemple

image
eyanez111 commented 2 years ago

Hello @eyanez111 you are definitely everywhere :-D

are you always on your rancher cluster ? how did you install the CSI driver ? he is available from the rancher partner marketplace and you can directly create storage class during the configuration with a wizard

exemple image

Hello @tuxtof,

   Yes I am pretty active here lately 👍 ... As per your questions:

are you always on your rancher cluster ? Mainly yes, I am doing tons of testing to see if we take this cluster to production. We also have Karbon but those are our dev, SAT and UAT environments.

how did you install the CSI driver ? Using this tutorial https://portal.nutanix.com/page/documents/details?targetId=CSI-Volume-Driver-v2_5:CSI-Volume-Driver-v2_5

I did install it manually using helm, would it be better if I uninstall and do it via the marketplace?

I do not find the marketplace. How can I get there? The UI is a bit different than the one in the Rancher documentation (not sure if rancher docs are out of date:

Screenshot 2022-01-12 113325

Also once it is installed (via helm or via GUI). I do need to define the StorageClass?

Do you know if I can install longhorn and if it will work with Nutanix SSD's?

Thanks for the help really appreciate it :D Francisco Yanez

tuxtof commented 2 years ago

Hello @eyanez111

Yes for Rancher , the best is to use their marketplace

you will find it in Explore Cluster -> your cluster and next the left Apps & Marketplace / Charts menu

image

you can directly define the storage class during install of the csi storage chart, so no need to care about the syntax

Using longhorn may potentially work but it is a little bit inconsistent because the Nutanix stack do already approximately the same kind of job

eyanez111 commented 2 years ago

Hi I cannot install it. I was able to make it work (partially still my pods will get errors with the PVC) via helm and I used the same credentials. Here are the parameters I am putting:

Screenshot 2022-01-13 122113 Screenshot 2022-01-13 122208

This are the erros: helm-operation-ffgsx_undefined.log

Screenshot 2022-01-13 122424

and this is the file w the changes I did: differences-config.txt

not sure what is wrong there... any way to check if the cluster is on xfs or ext4? that might be the problem? I am doing it on ext4

tuxtof commented 2 years ago

Hi

what is the output of the chart install ? i can't debug without it did you start with a clean rancher cluster, because if you try to install it on top of the one where you already install manually it can cause trouble concerning ext4 or ifs it is your choice not related to the cluster and an existing state

eyanez111 commented 2 years ago

How do I get the output from a fresh install? No, I did a fresh installation 2 days a go for this cluster. Is there a way to add those driver during the installation?

Thanks Francisco

tuxtof commented 2 years ago

when you launch the chart install you have a window who open with the helm output give me the content

no it is a two step process, first you install your cluster next you deploy driver

eyanez111 commented 2 years ago

ah yes the output is this helm-operation-ffgsx_undefined.log

thanks

tuxtof commented 2 years ago

to error here

you ask to install service monitor but you don't install the rancher monitoring stack you don't install the nutanix-snapshot chart before the storage one

eyanez111 commented 2 years ago

I did installed it, so do I need to remove it? that was the first thing I did I installed the snapshot.

eyanez111 commented 2 years ago

Hello @tuxtof

I think I am getting closer and closer. I found on that I was pointing at SC that I created. The installation creates a default named nutanix-volume . So I deleted the app , the PVC and added the name of the class as nutanix-volume instead of acs-abs. Then I check on the GUI that there was no Mount Options

Screenshot 2022-01-14 204011

After those changes I deployed again and found this:


kubectl describe pvc/data-my-cluster-zookeeper-0 -n centralized-logging
Name:          data-my-cluster-zookeeper-0
Namespace:     centralized-logging
StorageClass:  nutanix-volume
Status:        Bound
Volume:        pvc-b2415b65-f722-4615-9f6e-91f0d83d9b43
Labels:        app.kubernetes.io/instance=my-cluster
               app.kubernetes.io/managed-by=strimzi-cluster-operator
               app.kubernetes.io/name=zookeeper
               app.kubernetes.io/part-of=strimzi-my-cluster
               strimzi.io/cluster=my-cluster
               strimzi.io/kind=Kafka
               strimzi.io/name=my-cluster-zookeeper
Annotations:   pv.kubernetes.io/bind-completed: yes
               strimzi.io/delete-claim: false
Finalizers:    [kubernetes.io/pvc-protection]
Capacity:      100Gi
Access Modes:  RWO
VolumeMode:    Filesystem
Mounted By:    my-cluster-zookeeper-0
Events:
  Type    Reason                 Age   From                                                                Message
  ----    ------                 ----  ----                                                                -------
  Normal  Provisioning           16m   com.nutanix.csi_worker-nodes4_68c561d9-f786-4a4c-95ab-cdee76332408  External provisioner is provisioning volume for claim "centralized-logging/data-my-cluster-zookeeper-0"
  Normal  ExternalProvisioning   16m   persistentvolume-controller                                         waiting for a volume to be created, either by external provisioner "com.nutanix.csi" or manually created by system administrator
  Normal  ProvisioningSucceeded  16m   com.nutanix.csi_worker-nodes4_68c561d9-f786-4a4c-95ab-cdee76332408  Successfully provisioned volume pvc-b2415b65-f722-4615-9f6e-91f0d83d9b43

So the PVC is working! Now the pods are the problem now:

kubectl get pods -n centralized-logging
NAME                                        READY   STATUS              RESTARTS   AGE
my-cluster-zookeeper-0                      0/1     ContainerCreating   0          17m
my-cluster-zookeeper-1                      0/1     ContainerCreating   0          17m
my-cluster-zookeeper-2                      0/1     ContainerCreating   0          17m
strimzi-cluster-operator-76b49577c5-b62ln   1/1     Running             0          3d5h

The pods are stock in containercreating. I looked at the logs and described and found this:


Events:
  Type     Reason            Age                  From                    Message
  ----     ------            ----                 ----                    -------
  Warning  FailedScheduling  <unknown>                                    0/9 nodes are available: 9 pod has unbound immediate PersistentVolumeClaims.
  Warning  FailedScheduling  <unknown>                                    0/9 nodes are available: 9 pod has unbound immediate PersistentVolumeClaims.
  Normal   Scheduled         <unknown>                                    Successfully assigned centralized-logging/my-cluster-zookeeper-0 to worker-nodes6
  Warning  FailedMount       17m                  kubelet, worker-nodes6  Unable to attach or mount volumes: unmounted volumes=[data], unattached volumes=[data zookeeper-metrics-and-logging zookeeper-nodes cluster-ca-certs my-cluster-zookeeper-token-hzb9g strimzi-tmp]: timed out waiting for the condition
  Warning  FailedMount       5m45s (x3 over 19m)  kubelet, worker-nodes6  Unable to attach or mount volumes: unmounted volumes=[data], unattached volumes=[cluster-ca-certs my-cluster-zookeeper-token-hzb9g strimzi-tmp data zookeeper-metrics-and-logging zookeeper-nodes]: timed out waiting for the condition
  Warning  FailedMount       3m30s (x2 over 12m)  kubelet, worker-nodes6  Unable to attach or mount volumes: unmounted volumes=[data], unattached volumes=[zookeeper-nodes cluster-ca-certs my-cluster-zookeeper-token-hzb9g strimzi-tmp data zookeeper-metrics-and-logging]: timed out waiting for the condition
  Warning  FailedMount       74s (x4 over 21m)    kubelet, worker-nodes6  Unable to attach or mount volumes: unmounted volumes=[data], unattached volumes=[strimzi-tmp data zookeeper-metrics-and-logging zookeeper-nodes cluster-ca-certs my-cluster-zookeeper-token-hzb9g]: timed out waiting for the condition
  Warning  FailedMount       53s (x19 over 23m)   kubelet, worker-nodes6  MountVolume.SetUp failed for volume "pvc-b2415b65-f722-4615-9f6e-91f0d83d9b43" : rpc error: code = InvalidArgument desc = nutanix: iSCSI portal info is missing for 3d06afb8-231e-4e34-a289-3ce9a8fef581, err: <nil>
paoc@LAP-FYANEZ:~/centralized-logs/strimzi$ kubectl logs pod/my-cluster-zookeeper-0 -n centralized-logging
Error from server (BadRequest): container "zookeeper" in pod "my-cluster-zookeeper-0" is waiting to start: ContainerCreating

Seems there is this nutanix: iSCSI portal info is missing and that is causing the volumes to not be able to mount.

thanks for all the help. I am close to finish this and start deploying prod clusters w nutanix/rancher driver! Francisco Yanez

tuxtof commented 2 years ago

Hello Francisco

prerequisites is missing on your worker nodes on ubuntu for exemple

runcmd:

eyanez111 commented 2 years ago

Hello @tuxtof,

  I did installed nfs-common, let me update and enable iscsid 

thanks Francisco

eyanez111 commented 2 years ago

Hello @tuxtof,

 I did the changes manually just to make sure all installed but I keep getting the pods stuck in creatingpods and I get this:
Events:
  Type     Reason            Age                From                    Message
  ----     ------            ----               ----                    -------
  Warning  FailedScheduling  <unknown>                                  0/9 nodes are available: 9 pod has unbound immediate PersistentVolumeClaims.
  Warning  FailedScheduling  <unknown>                                  0/9 nodes are available: 9 pod has unbound immediate PersistentVolumeClaims.
  Normal   Scheduled         <unknown>                                  Successfully assigned centralized-logging/my-cluster-zookeeper-0 to worker-nodes3
  Warning  FailedMount       42s (x3 over 43s)  kubelet, worker-nodes3  MountVolume.MountDevice failed for volume "pvc-bf0d9deb-aaf9-417f-948c-53f2ca8ee806" : kubernetes.io/csi: attacher.MountDevice failed to create newCsiDriverClient: driver name com.nutanix.csi not found in the list of registered CSI drivers
  Warning  FailedMount       9s (x4 over 39s)   kubelet, worker-nodes3  MountVolume.SetUp failed for volume "pvc-bf0d9deb-aaf9-417f-948c-53f2ca8ee806" : rpc error: code = InvalidArgument desc = nutanix: iSCSI portal info is missing for 3c7e0a57-982f-4afa-b353-8449c7a74ed5, err: <nil>

any idea of what else would be missing?

thanks Francisco

eyanez111 commented 2 years ago

Hello @tuxtof,

I continued to work on it today and I found a few videos. Apparently I am missing on serviceaccounts one named attacher:
kubectl get serviceaccounts -A | grep csi
ntnx-system                   csi-node-ntnx-plugin                     1         4d2h
ntnx-system                   csi-provisioner                               1         4d2h
Do you think I am missing that? or that is just for older version the video is from 2018 and if I am how do I install it? the rancher market place did not installed it.

thanks Francisco

subodh01 commented 2 years ago

Nutanix CSI driver does not use attacher now. Please open a Nutanix support case to get help with your setup.

tuxtof commented 2 years ago

Hello @eyanez111

the CSI Helm chart from the Rancher marketplace is fully functional and contain all the needed components.

looking your logs i see multiple strange things

can you give me an output of the following command please

kubectl get sc -o yaml kubectl get csidrivers -o yaml helm list -A

in the namespace where you install the drivers/charts kubectl get pods helm get values nutanix-csi-storage < maybe you need to change the name with the one you use for the deployed chart helm get values nutanix-csi-snapshot < maybe you need to change the name with the one you use for the deployed chart

eyanez111 commented 2 years ago

Hello @tuxtof,

 Thanks for the help. Let me get you that:

kubectl get sc -o yaml

apiVersion: v1
items:
- allowVolumeExpansion: true
  apiVersion: storage.k8s.io/v1
  kind: StorageClass
  metadata:
    annotations:
      meta.helm.sh/release-name: nutanix-csi-storage
      meta.helm.sh/release-namespace: kube-system
      storageclass.kubernetes.io/is-default-class: "true"
    creationTimestamp: "2022-01-19T18:19:16Z"
    labels:
      app.kubernetes.io/managed-by: Helm
    managedFields:
    - apiVersion: storage.k8s.io/v1
      fieldsType: FieldsV1
      fieldsV1:
        f:allowVolumeExpansion: {}
        f:metadata:
          f:annotations:
            .: {}
            f:meta.helm.sh/release-name: {}
            f:meta.helm.sh/release-namespace: {}
            f:storageclass.kubernetes.io/is-default-class: {}
          f:labels:
            .: {}
            f:app.kubernetes.io/managed-by: {}
        f:parameters:
          .: {}
          f:csi.storage.k8s.io/controller-expand-secret-name: {}
          f:csi.storage.k8s.io/controller-expand-secret-namespace: {}
          f:csi.storage.k8s.io/fstype: {}
          f:csi.storage.k8s.io/node-publish-secret-name: {}
          f:csi.storage.k8s.io/node-publish-secret-namespace: {}
          f:csi.storage.k8s.io/provisioner-secret-name: {}
          f:csi.storage.k8s.io/provisioner-secret-namespace: {}
          f:isSegmentedIscsiNetwork: {}
          f:storageContainer: {}
          f:storageType: {}
        f:provisioner: {}
        f:reclaimPolicy: {}
        f:volumeBindingMode: {}
      manager: Go-http-client
      operation: Update
      time: "2022-01-19T18:19:16Z"
    name: nutanix-volume
    resourceVersion: "2978176"
    selfLink: /apis/storage.k8s.io/v1/storageclasses/nutanix-volume
    uid: 181146c1-d098-48c4-9d2f-934e0502b5dc
  parameters:
    csi.storage.k8s.io/controller-expand-secret-name: ntnx-secret
    csi.storage.k8s.io/controller-expand-secret-namespace: kube-system
    csi.storage.k8s.io/fstype: ext4
    csi.storage.k8s.io/node-publish-secret-name: ntnx-secret
    csi.storage.k8s.io/node-publish-secret-namespace: kube-system
    csi.storage.k8s.io/provisioner-secret-name: ntnx-secret
    csi.storage.k8s.io/provisioner-secret-namespace: kube-system
    isSegmentedIscsiNetwork: "true"
    storageContainer: default-container-67424176636311
    storageType: NutanixVolumes
  provisioner: csi.nutanix.com
  reclaimPolicy: Delete
  volumeBindingMode: Immediate
kind: List
metadata:
  resourceVersion: ""
  selfLink: ""

kubectl get csidrivers -o yaml


apiVersion: v1
items:
- apiVersion: storage.k8s.io/v1
  kind: CSIDriver
  metadata:
    annotations:
      meta.helm.sh/release-name: nutanix-csi-storage
      meta.helm.sh/release-namespace: kube-system
    creationTimestamp: "2022-01-19T18:19:16Z"
    labels:
      app.kubernetes.io/managed-by: Helm
    managedFields:
    - apiVersion: storage.k8s.io/v1
      fieldsType: FieldsV1
      fieldsV1:
        f:metadata:
          f:annotations:
            .: {}
            f:meta.helm.sh/release-name: {}
            f:meta.helm.sh/release-namespace: {}
          f:labels:
            .: {}
            f:app.kubernetes.io/managed-by: {}
        f:spec:
          f:attachRequired: {}
          f:podInfoOnMount: {}
          f:volumeLifecycleModes:
            .: {}
            v:"Persistent": {}
      manager: Go-http-client
      operation: Update
      time: "2022-01-19T18:19:16Z"
    name: csi.nutanix.com
    resourceVersion: "2978198"
    selfLink: /apis/storage.k8s.io/v1/csidrivers/csi.nutanix.com
    uid: a7668568-f50d-4737-b777-5286365d92b3
  spec:
    attachRequired: false
    podInfoOnMount: true
    volumeLifecycleModes:
    - Persistent
kind: List
metadata:
  resourceVersion: ""
  selfLink: ""

heml list -A

NAME                    NAMESPACE                       REVISION        UPDATED                                 STATUS                                                     CHART                                                                                            APP VERSION
fleet-agent-c-8zc6z     cattle-fleet-system             3               2022-01-11 22:06:51.611545228 +0000 UTC deployed                                                   fleet-agent-c-8zc6z-v0.0.0+s-62ff0e388e0cc43b5cd041ae6f39561b05d16a9f361b7501177c11349d72b
nutanix-csi-snapshot    kube-system                     1               2022-01-19 18:16:39.274791956 +0000 UTC deployed                                                   nutanix-csi-snapshot-1.0.0                                                                       1.0.0
nutanix-csi-storage     kube-system                     1               2022-01-19 18:19:15.989482312 +0000 UTC deployed                                                   nutanix-csi-storage-2.5.0                                                                        2.5.0
rancher-monitoring      cattle-monitoring-system        1               2022-01-14 19:47:41.55555458 +0000 UTC  deployed                                                   rancher-monitoring-100.1.0+up19.0.3                                                              0.50.0
rancher-monitoring-crd  cattle-monitoring-system        1               2022-01-14 19:47:32.119235294 +0000 UTC deployed                                                   rancher-monitoring-crd-100.1.0+up19.0.3

kubectl pods

kubectl get pods -n kube-system
NAME                                              READY   STATUS    RESTARTS   AGE
calico-kube-controllers-655c554569-k72sc          1/1     Running   0          47h
canal-4bb7s                                       2/2     Running   0          47h
canal-7rpc2                                       2/2     Running   0          47h
canal-87wm2                                       2/2     Running   0          2d
canal-bxj6g                                       2/2     Running   0          47h
canal-cplc8                                       2/2     Running   0          47h
canal-hzv4v                                       2/2     Running   0          47h
canal-l6m46                                       2/2     Running   0          47h
canal-mvgwk                                       2/2     Running   0          47h
canal-zdplh                                       2/2     Running   0          47h
coredns-7cc5cfbd77-225fh                          1/1     Running   0          47h
coredns-7cc5cfbd77-fsqfw                          1/1     Running   0          47h
coredns-autoscaler-76f8869cc9-c8fq4               1/1     Running   0          47h
csi-node-ntnx-plugin-488pb                        3/3     Running   0          27m
csi-node-ntnx-plugin-492xt                        3/3     Running   0          27m
csi-node-ntnx-plugin-4svvx                        3/3     Running   0          27m
csi-node-ntnx-plugin-8trfj                        3/3     Running   0          27m
csi-node-ntnx-plugin-dwfcg                        3/3     Running   0          27m
csi-node-ntnx-plugin-tln9v                        3/3     Running   0          27m
csi-provisioner-ntnx-plugin-0                     5/5     Running   0          27m
metrics-server-54788574fd-gnwff                   1/1     Running   0          47h
snapshot-controller-0                             1/1     Running   0          30m
snapshot-validation-deployment-66849d5586-lslqt   1/1     Running   0          30m
snapshot-validation-deployment-66849d5586-zsrvv   1/1     Running   0          30m

helm get values nutanix-csi-storage:

USER-SUPPLIED VALUES:
defaultStorageClass: volume
fsType: ext4
global:
  cattle:
    clusterId: c-8zc6z
    clusterName: cl-prod-nutanix-rancher-cae
    rkePathPrefix: ""
    rkeWindowsPathPrefix: ""
    systemDefaultRegistry: ""
    url: https://rancher.rd.zedev.net
  systemDefaultRegistry: ""
networkSegmentation: true
password: #%$&$*(*)
prismEndPoint: 172.22.4.101
servicemonitor:
  enabled: true
storageContainer: default-container-67424176636311
username: rancher-user
volumeClass: true

This values are correct

helm get values nutanix-csi-snapsnot


USER-SUPPLIED VALUES:
global:
  cattle:
    clusterId: c-8zc6z
    clusterName: cl-prod-nutanix-rancher-cae
    rkePathPrefix: ""
    rkeWindowsPathPrefix: ""
    systemDefaultRegistry: ""
    url: https://rancher.rd.zedev.net
  systemDefaultRegistry: ""

One thing that I noticed is that you mentioned that I needed to install storage first and snapshot later. I could not do it like that I was getting errors saying I needed snapshot if I install snapshot first.

Thanks again for the help Francisco

tuxtof commented 2 years ago

the correct order is nutanix-snapshot first and nutanix-storage next, that's why i say before, we are agree

the potential error i see is the networkSegmentation settings at least you make specific related configuration on the nutanix side it need to be set to false

all the other parameters seems ok can you confirm 172.22.4.101 is your Prism Element VIP ? (not the same you use for rancher node driver who is Prism Central)

last point be sure to make the entire test on a fresh new cluster because as you make a lot of experimentation with and without helm i'm always afraid there is some old piece stuck somewhere

let me resume the entire process

deploy a fresh rancher cluster with the correct node dependency (iscsi and nfs) deploy monitoring extension deploy nutanix-snapshot helm chart deploy nutanix-storage helm chart and use the integrated wizard to create storage class(es)

and that's all starting here you will be able to create PVC

eyanez111 commented 2 years ago

ok let me do that from scratch on another cluster and I will report.

Thanks for all the help @tuxtof

eyanez111 commented 2 years ago

Hello @tuxtof,

  I just wanted to report back. You were right! the installing the helm manually, then removing it, then installing the helm from the market place and for what I think it was in the wrong order (at least the first time). Made it unstable. I followed the instructions just as describe:

deploy a fresh rancher cluster with the correct node dependency (iscsi and nfs) deploy monitoring extension deploy nutanix-snapshot helm chart deploy nutanix-storage helm chart and use the integrated wizard to create storage class(es)

 and it worked. Also I worked with support on a new cluster and installed the 2.4 version using kubectl and it worked. So it is tested and proven that it is a fully working driver.

Thanks so much. I will report once we are done testing and we go live. You have been helping me since the beginning. Francisco Yanez

tuxtof commented 2 years ago

Good news