OpenShift Virtualization Live Migration Volume Offline?

dan-nawrocki commented 2 months ago

I am using the CSI driver 2.4.1 with OpenShift 4.12.37, OpenShift Virtualization 4.12.10, and NimbleOS 6.1.2.400. I have configured the live migration settings and am using the ReadWriteMany access mode for Block volumes. The setup appears to work correctly, however, after a live migration, the CSI appears to make my volume Offline and the VM moves to the Paused state. When I manually set the volume Online, the VM resumes successfully.

Here's my steps:

Create a VM with a ReadWriteMany PVC. VM starts successfully.
Perform a live migration
Observe the volume is set to Offline and the VM is paused

Any ideas on how to prevent the volume from being set offline?

datamattsson commented 2 months ago

Could you elaborate on how this VM was created? Have you considered these steps when setting up your OS images?

dan-nawrocki commented 2 months ago

Yes, I have completed the steps you mentioned. That being said, my VM disk was NOT created from the openshift-virtualization-os-images. It's an export of a VM we had setup in RedHat Virtualization. My steps to get here are a bit convoluted:

Export disk from RHV
Import disk as a PVC ("pvc1") to OpenShift. Note that this was done with the 2.3.0 of the CSI driver and used the RWO access mode.
Upgrade to CSI driver 2.4.1.
Clone pvc1 to a new PVC ("pvc2") with the RWX access mode.
Create a new VM using pvc2.

datamattsson commented 2 months ago

This won't work as we don't support RWO to RWX transformation. We're sort of treating this as a bug at the moment and it will hopefully be fixed in the next version a few months out.

You can however fix this if you're handy with REST APIs. You need to set multi_initiator: true on the backend volume. I believe there's a Nimble CLI way to do this as well but I'm not sure.

dan-nawrocki commented 2 months ago

Could I export my RWO volume and re-import it as RWX?

datamattsson commented 2 months ago

Once RWO, RWX is a no go unfortunately. The same bug applies.

dan-nawrocki commented 2 months ago

I just updated to CSI driver 2.4.2 and have the same problem.

I've figured out how to set multi_initiator, however, it appears that this parameter only applies to iSCSI volumes. I'm using FC and getting this error when setting the flag:

curl -k -H 'X-Auth-Token: REDACTED' https://my-nimble-host:5392/v1/volumes/my-vol-id -d '{"data": {"multi_initiator": true}}' -X PUT
{"messages":[{"code":"SM_http_bad_request","severity":"error","text":"The request could not be understood by the server."},{"code":"SM_unexpected_arg","severity":"error","arguments":{"arg":"multi_initiator"},"text":"Unexpected argument 'multi_initiator'."}]}

I did notice that volumes I manually create on the Nimble have multi_initiator set to true.

I tried to import the disk to a new PVC w/ RWX mode, but multi_initiator was still false on the newly-created Nimble volume too.

virtctl image-upload pvc rwx-test-2 --size=64Gi --image-path=rhel8-template.img.gz --block-volume --storage-class=nimble-san-sc --access-mode ReadWriteMany

It looks like https://github.com/hpe-storage/csi-driver/pull/40/files should create new volumes w/ multi_initiator set to true.

datamattsson commented 2 months ago

This is very strange. I copied and pasted your curl command into my environment and I can flick multi_initiator back and forth true/false no problem on an FC array with a blank unattached volume. Does your volume have any connections to it? I added a dummy initiator to my volume and that didn't matter.

dan-nawrocki commented 2 months ago

There are no connections on my volume. I get the same error whether or not the volume is online. I can toggle the online state using curl, so it's not an obvious dumb error on my part.

What do you mean by "blank unattached volume"? I have a PVC (rwx) bound to the PV, however, the VM is turned off so the PVC isn't in active use.

I've got a Nimble with NimbleOS 6.1.2.400-1048557-opt in case that matters.

datamattsson commented 2 months ago

Your curl is spot on, I copied and pasted it no problem. I'm using NimbleOS 6.0.0 in my case and I can't see why this would've changed. I'll upgrade my array and see where it goes.

datamattsson commented 2 months ago

I updated my array. No change. Can we dig out some logs what's actually happening on your cluster?

The CSP logs should reveal some clues, oc logs -nhpe-storage deploy/nimble-csp.

datamattsson commented 2 months ago

Also now confirmed on OCP 4.14 that the following combination indeed set multi_initiator: true on the volume. Using CSI Operator v2.4.2.

apiVersion: storage.k8s.io/v1
kind: StorageClass
metadata:
  annotations:
    storageclass.kubernetes.io/is-default-class: "false"
  name: hpe-standard-fc
parameters:
  csi.storage.k8s.io/controller-expand-secret-name: hpe-backend-fc
  csi.storage.k8s.io/controller-expand-secret-namespace: hpe-storage
  csi.storage.k8s.io/controller-publish-secret-name: hpe-backend-fc
  csi.storage.k8s.io/controller-publish-secret-namespace: hpe-storage
  csi.storage.k8s.io/fstype: xfs
  csi.storage.k8s.io/node-publish-secret-name: hpe-backend-fc
  csi.storage.k8s.io/node-publish-secret-namespace: hpe-storage
  csi.storage.k8s.io/node-stage-secret-name: hpe-backend-fc
  csi.storage.k8s.io/node-stage-secret-namespace: hpe-storage
  csi.storage.k8s.io/provisioner-secret-name: hpe-backend-fc
  csi.storage.k8s.io/provisioner-secret-namespace: hpe-storage
  description: Volume created by the HPE CSI Driver for Kubernetes
  destroyOnDelete: "true"
  accessProtocol: fc
provisioner: csi.hpe.com
reclaimPolicy: Delete
volumeBindingMode: Immediate
allowVolumeExpansion: true
---
apiVersion: v1
kind: PersistentVolumeClaim
metadata:
  name: my-first-fc-pvc
spec:
  accessModes:
  - ReadWriteMany
  resources:
    requests:
      storage: 32Gi
  storageClassName: hpe-standard-fc
  volumeMode: Block

dan-nawrocki commented 2 months ago

I updated to OpenShift 4.14; same behavior.

I'm starting to think this is a Nimble problem. I created a brand-new volume using the Nimble UI, and I can't seem to get multi_initiator set to true. I've tried various data protection options and access options, but nothing seems to make multi_initiator set to true.

datamattsson commented 2 months ago

Yeah, I think so too. Nimble support should be able to add some clarity here.

dan-nawrocki commented 2 months ago

I created a ticket, but think I beat them to the punch :)

Turns out iSCSI has to be enabled, even for FC-only configurations. Once I ran group --edit --iscsi_enabled yes on the Nimble, I can confirm that new PVCs have multi_initiator set correctly. I'm going to assume that a software update at some point turned iSCSI off since my very old volumes are multi-initiator.

A quick test showed that live migration is working now. Thanks for the help!

datamattsson commented 2 months ago

Thanks for confirming. What an obscure finding. We need to document this.

hpe-storage / csi-driver

OpenShift Virtualization Live Migration Volume Offline? #399