hpe-storage / csi-driver

A Container Storage Interface (CSI) driver from HPE
https://scod.hpedev.io
Apache License 2.0
55 stars 53 forks source link

OpenShift Virtualization Live Migration Volume Offline? #399

Closed dan-nawrocki closed 2 months ago

dan-nawrocki commented 2 months ago

I am using the CSI driver 2.4.1 with OpenShift 4.12.37, OpenShift Virtualization 4.12.10, and NimbleOS 6.1.2.400. I have configured the live migration settings and am using the ReadWriteMany access mode for Block volumes. The setup appears to work correctly, however, after a live migration, the CSI appears to make my volume Offline and the VM moves to the Paused state. When I manually set the volume Online, the VM resumes successfully.

Here's my steps:

  1. Create a VM with a ReadWriteMany PVC. VM starts successfully.
  2. Perform a live migration
  3. Observe the volume is set to Offline and the VM is paused

Any ideas on how to prevent the volume from being set offline?

datamattsson commented 2 months ago

Could you elaborate on how this VM was created? Have you considered these steps when setting up your OS images?

dan-nawrocki commented 2 months ago

Yes, I have completed the steps you mentioned. That being said, my VM disk was NOT created from the openshift-virtualization-os-images. It's an export of a VM we had setup in RedHat Virtualization. My steps to get here are a bit convoluted:

  1. Export disk from RHV
  2. Import disk as a PVC ("pvc1") to OpenShift. Note that this was done with the 2.3.0 of the CSI driver and used the RWO access mode.
  3. Upgrade to CSI driver 2.4.1.
  4. Clone pvc1 to a new PVC ("pvc2") with the RWX access mode.
  5. Create a new VM using pvc2.
datamattsson commented 2 months ago

This won't work as we don't support RWO to RWX transformation. We're sort of treating this as a bug at the moment and it will hopefully be fixed in the next version a few months out.

You can however fix this if you're handy with REST APIs. You need to set multi_initiator: true on the backend volume. I believe there's a Nimble CLI way to do this as well but I'm not sure.

dan-nawrocki commented 2 months ago

Could I export my RWO volume and re-import it as RWX?

datamattsson commented 2 months ago

Once RWO, RWX is a no go unfortunately. The same bug applies.

dan-nawrocki commented 2 months ago

I just updated to CSI driver 2.4.2 and have the same problem.

I've figured out how to set multi_initiator, however, it appears that this parameter only applies to iSCSI volumes. I'm using FC and getting this error when setting the flag:

curl -k -H 'X-Auth-Token: REDACTED' https://my-nimble-host:5392/v1/volumes/my-vol-id -d '{"data": {"multi_initiator": true}}' -X PUT
{"messages":[{"code":"SM_http_bad_request","severity":"error","text":"The request could not be understood by the server."},{"code":"SM_unexpected_arg","severity":"error","arguments":{"arg":"multi_initiator"},"text":"Unexpected argument 'multi_initiator'."}]}

I did notice that volumes I manually create on the Nimble have multi_initiator set to true.

I tried to import the disk to a new PVC w/ RWX mode, but multi_initiator was still false on the newly-created Nimble volume too.

virtctl image-upload pvc rwx-test-2 --size=64Gi --image-path=rhel8-template.img.gz --block-volume --storage-class=nimble-san-sc --access-mode ReadWriteMany

It looks like https://github.com/hpe-storage/csi-driver/pull/40/files should create new volumes w/ multi_initiator set to true.

datamattsson commented 2 months ago

This is very strange. I copied and pasted your curl command into my environment and I can flick multi_initiator back and forth true/false no problem on an FC array with a blank unattached volume. Does your volume have any connections to it? I added a dummy initiator to my volume and that didn't matter.

dan-nawrocki commented 2 months ago

There are no connections on my volume. I get the same error whether or not the volume is online. I can toggle the online state using curl, so it's not an obvious dumb error on my part.

What do you mean by "blank unattached volume"? I have a PVC (rwx) bound to the PV, however, the VM is turned off so the PVC isn't in active use.

I've got a Nimble with NimbleOS 6.1.2.400-1048557-opt in case that matters.

datamattsson commented 2 months ago

Your curl is spot on, I copied and pasted it no problem. I'm using NimbleOS 6.0.0 in my case and I can't see why this would've changed. I'll upgrade my array and see where it goes.

datamattsson commented 2 months ago

I updated my array. No change. Can we dig out some logs what's actually happening on your cluster?

The CSP logs should reveal some clues, oc logs -nhpe-storage deploy/nimble-csp.

datamattsson commented 2 months ago

Also now confirmed on OCP 4.14 that the following combination indeed set multi_initiator: true on the volume. Using CSI Operator v2.4.2.

apiVersion: storage.k8s.io/v1
kind: StorageClass
metadata:
  annotations:
    storageclass.kubernetes.io/is-default-class: "false"
  name: hpe-standard-fc
parameters:
  csi.storage.k8s.io/controller-expand-secret-name: hpe-backend-fc
  csi.storage.k8s.io/controller-expand-secret-namespace: hpe-storage
  csi.storage.k8s.io/controller-publish-secret-name: hpe-backend-fc
  csi.storage.k8s.io/controller-publish-secret-namespace: hpe-storage
  csi.storage.k8s.io/fstype: xfs
  csi.storage.k8s.io/node-publish-secret-name: hpe-backend-fc
  csi.storage.k8s.io/node-publish-secret-namespace: hpe-storage
  csi.storage.k8s.io/node-stage-secret-name: hpe-backend-fc
  csi.storage.k8s.io/node-stage-secret-namespace: hpe-storage
  csi.storage.k8s.io/provisioner-secret-name: hpe-backend-fc
  csi.storage.k8s.io/provisioner-secret-namespace: hpe-storage
  description: Volume created by the HPE CSI Driver for Kubernetes
  destroyOnDelete: "true"
  accessProtocol: fc
provisioner: csi.hpe.com
reclaimPolicy: Delete
volumeBindingMode: Immediate
allowVolumeExpansion: true
---
apiVersion: v1
kind: PersistentVolumeClaim
metadata:
  name: my-first-fc-pvc
spec:
  accessModes:
  - ReadWriteMany
  resources:
    requests:
      storage: 32Gi
  storageClassName: hpe-standard-fc
  volumeMode: Block
dan-nawrocki commented 2 months ago

I updated to OpenShift 4.14; same behavior.

I'm starting to think this is a Nimble problem. I created a brand-new volume using the Nimble UI, and I can't seem to get multi_initiator set to true. I've tried various data protection options and access options, but nothing seems to make multi_initiator set to true.

datamattsson commented 2 months ago

Yeah, I think so too. Nimble support should be able to add some clarity here.

dan-nawrocki commented 2 months ago

I created a ticket, but think I beat them to the punch :)

Turns out iSCSI has to be enabled, even for FC-only configurations. Once I ran group --edit --iscsi_enabled yes on the Nimble, I can confirm that new PVCs have multi_initiator set correctly. I'm going to assume that a software update at some point turned iSCSI off since my very old volumes are multi-initiator.

A quick test showed that live migration is working now. Thanks for the help!

datamattsson commented 2 months ago

Thanks for confirming. What an obscure finding. We need to document this.