vmware-tanzu / velero

Backup and migrate Kubernetes applications and their persistent volumes
https://velero.io
Apache License 2.0
8.7k stars 1.4k forks source link

Backup partially failed with csi plugin 0.6.0-rc2 on OVH cluster #6852

Open Arcahub opened 1 year ago

Arcahub commented 1 year ago

name: Bug report about: Using the velero 1.12.0 Data Movement feature on OVH managed cluster make backup partially failed while using matching csi plugin version v0.6.0-rc2 while it was working on v0.5.1.


What steps did you take and what happened: I wantend to test the Data Movement feature. I installed velero CLI v1.12.0-rc.2

velero install \
  --provider aws \
  --plugins velero/velero-plugin-for-aws:v1.8.0-rc.2,velero/velero-plugin-for-csi:v0.6.0-rc.2 \
  --no-default-backup-location \
  --features=EnableCSI \
  --no-secret \
  --use-node-agent

# Create kubernetes secret with s3 credentials

# Create velero storage location
velero backup-location create --bucket "${OVH_CLOUD_PROJECT_SERVICE}-my-cluster-backup" --provider aws --config region=gra,s3ForcePathStyle="true",s3Url=https://s3.gra.io.cloud.ovh.net "my-cluster-backup" --credential "my-cluster-backup=cloud"

# Create velero snapshot location
velero snapshot-location create --provider aws --config region=gra,s3ForcePathStyle="true",s3Url=https://s3.gra.io.cloud.ovh.net "my-cluster-backup" --credential "my-cluster-backup=cloud"

# VolumeSnapshotClass for ovh

# Create the backup
velero backup create "my-cluster-backup-${uuid}" --snapshot-move-data --storage-location "my-cluster-backup" --volume-snapshot-locations "my-cluster-backup" --csi-snapshot-timeout 10m

The backup ended in a PartiallyFailed state with error for the majority of PVC: Fail to wait VolumeSnapshot snapshot handle created. Still some PVC was able to be backup while some didn't, so I am guessing it's realated to some timeout error.

What did you expect to happen:

I expected the backup to work in the rc version of the csi plugin since nothing else changed on the cluster except this version.

The following information will help us better understand what's going on:

The bundle extract from velero debug --backup:

bundle-2023-09-21-11-15-47.tar.gz

Anything else you would like to add:

I tried running a backup with the exact same install commands mentionned before but changing the plugins version of the csi plugin to v0.5.1:

velero install \
  --provider aws \
  --plugins velero/velero-plugin-for-aws:v1.8.0-rc.2,velero/velero-plugin-for-csi:v0.5.1 \
  --no-default-backup-location \
  --features=EnableCSI \
  --no-secret \
  --use-node-agent

And it's worked without any error. Here is the debug bundle of the working backup with csi plugin in version v0.5.1. bundle-2023-09-21-12-20-13.tar.gz

Of course even if it's worked it is missing the DataUpload part to achieve DataMovement so it is not what I am looking for.

Environment:

Vote on this issue!

This is an invitation to the Velero community to vote on issues, you can see the project's top voted issues listed here.
Use the "reaction smiley face" up to the right of this comment to vote.

Arcahub commented 1 year ago

I took some time to debug while looking at the source code so here are my investigations if it can help in any way:

But if it is iterating two time on this loop it would mean that the first time it was able to successfuly get the VolumeSnapshot and reached the Waiting log line... At this point I don't have any more idea so I hope a maintainer can help me :angel: .

Lyndon-Li commented 1 year ago

From below log, Velero CSI plugin indeed polled the VS twice. For the first time, it got the VS successfully, but failed for the second time:

time="2023-09-20T23:34:26Z" level=info msg="Waiting for CSI driver to reconcile volumesnapshot gitea/velero-gitea-shared-storage-55wxb. Retrying in 5s" Backup=bashroom-cluster-backup5 Operation ID=du-7ce94fa2-58d3-447c-85c8-edb2af97b58a.75d4847f-3ae0-43e426984 Source PVC=gitea/gitea-shared-storage VolumeSnapshot=gitea/velero-gitea-shared-storage-55wxb backup=velero/bashroom-cluster-backup5 cmd=/plugins/velero-plugin-for-csi logSource="/go/src/velero-plugin-for-csi/internal/util/util.go:244" pluginName=velero-plugin-for-csi
time="2023-09-20T23:34:31Z" level=error msg="Fail to wait VolumeSnapshot snapshot handle created: failed to get volumesnapshot gitea/velero-gitea-shared-storage-55wxb: volumesnapshots.snapshot.storage.k8s.io \"velero-gitea-shared-storage-55wxb\" not found" Backup=bashroom-cluster-backup5 Operation ID=du-7ce94fa2-58d3-447c-85c8-edb2af97b58a.75d4847f-3ae0-43e426984 Source PVC=gitea/gitea-shared-storage VolumeSnapshot=gitea/velero-gitea-shared-storage-55wxb backup=velero/bashroom-cluster-backup5 cmd=/plugins/velero-plugin-for-csi logSource="/go/src/velero-plugin-for-csi/internal/backup/pvc_action.go:191" pluginName=velero-plugin-for-csi

Perhaps the VS was deleted after the 1st time poll, but I don't know why. I searched the log, Velero didn't do it since the DataUpload request had not created yet, no data mover modules would touch the VS.

@Arcahub Could you check the CSI driver and external snapshot provisioner log to see any clue about the deletion?

Additionally, could you also try CSI snapshot backup (without data movement) with Velero 1.12 + CSI plugin 0.6.0? You can run this by removing the --snapshot-move-data flag: velero backup create "my-cluster-backup-${uuid}" --storage-location "my-cluster-backup" --volume-snapshot-locations "my-cluster-backup" --csi-snapshot-timeout 10m

CSI snapshot backup has somehow different workflows from CSI snapshot data movement backup, let's see it is a generic problem related to CSI snapshot or not.

Arcahub commented 1 year ago

Hello @Lyndon-Li, thank you for taking a look at my issue. I can't answer you this week but on Monday I will try to take a look at the log and try CSI snapshot backup without movement and provide you my feedback.

I didn't mentionne it in my previous post but ovh csi driver is cinder if it can help somehow.

Arcahub commented 1 year ago

Hello @Lyndon-Li, I just tested running the backup without the data movement and it failed. The installation was the same and the command was also the same without the --snapshot-move-data so the breaking changes seems to in the csi snapshot. Here is the debug bundle

bundle-2023-10-02-12-33-45.tar.gz

As I said previously the csi driver is cinder on ovhcloud but I wasn't able to find any logs.

blackpiglet commented 1 year ago

@Arcahub Could you also try to do the same CSI backup with velero/velero:v1.11.1 and velero/velero-plugin-for-csi:v0.5.1?

There was some modification in how the VolumeSnapshot resources created during backup are handled. The VolumeSnapshot resources created during backup should be cleaned because that can prevent the snapshots from deleting when the VolumeSnapshots are deleted or the VolumeSnapshots' namespace is deleted.

The change introduced in v1.12.0 is the VolumeSnapshot cleaning logic is moved into the CSI plugin. The benefit is the time-consuming multiple VolumeSnapshots handling is now handled concurrently.

It's possible that the v1.12.0 Velero and the v0.5.1 CSI plugin both don't have the VolumeSnapshot resources cleaning. This is the CSI plugin and the Velero server's compatibility matrix. https://github.com/vmware-tanzu/velero-plugin-for-csi/tree/main#compatibility

Arcahub commented 1 year ago

@blackpiglet

I already tested with velero/velero:v1.11.1 and velero/velero-plugin-for-csi:v0.5.1 back when I was implementing velero in my cluster. I re-tested and the backup was successful. I didn't test the restore but I saw the csi snapshot in ovh web interface. Here is the bundle in case it can help. bundle-2023-10-16-15-23-15.tar.gz

I am currently using the file-system backup since the data movement is an essential feature in my case and that is why I am experimenting the csi data movement since I would rather prefer this strategy.

I also tested with the official 1.20.0 release of velero and the velero-plugin-for-csi:v0.6.1 just in case the release fixed something related but sadly it is still failing. Again here is the bundle in case it can help bundle-2023-10-16-15-52-41.tar.gz

blackpiglet commented 1 year ago

@Arcahub Thanks for detailed information. I couldn't find any information other than VolumeSnapshot not found from the partially failed backup.

But I found some things from the succeed backup. First, the version doesn't seem right there.

Client:
    Version: v1.11.1
    Git commit: bdbe7eb242b0f64d5b04a7fea86d1edbb3a3587c
Server:
    Version: v1.12.0-rc.2
# WARNING: the client version does not match the server version. Please update client

The client version is right, but the server's version is still v1.12.0.

The images used are:

Second, although the backup finished with completed, but no PVs' data is backed up.

Velero-Native Snapshots: <none included>

Could you please use the v1.11.x version of Velero CLI to reinstall the Velero environment? Please uninstall the Velero environment with velero uninstall command first.

To debug further, could you also check the CSI snapshotter pods' log to find whether there is some information about why the VolumeSnapshots deleted?

Arcahub commented 1 year ago

@blackpiglet

I am sorry for my mistake, I was using alias to switch between version but they were not expanded in my bash scripts. Here is the bundle of the test with velero/velero:v1.11.1 and velero/velero-plugin-for-csi:v0.5.1. bundle-2023-10-17-11-14-09.tar.gz

The Velero-Native Snapshots field you mentionned is still empty but I can assure you that the snapshot are appearing on ovh interface like this: image

I am 100% sure that those snapshot are created and managed by velero since there is no other snapshot mecanisme currently enable on this cluster and when I delete the backup the snapshot are also deleted.

Sadly as I said before, I am not able to provide cinder csi pods log since I just can access them. When I run kubectl get pods -A -o name on my cluster with root kubeconfig here is the output:

Click me ### Pods list ``` pod/argo-server-79d445949-6nwsf pod/workflow-controller-55bd57fb6d-pngn8 pod/argocd-application-controller-0 pod/argocd-applicationset-controller-7c9cb6785d-hjd4g pod/argocd-dex-server-69dbdcbf7d-zzdjj pod/argocd-notifications-controller-f9d4457df-tttlz pod/argocd-redis-ha-haproxy-7d7c895d48-7rqrg pod/argocd-redis-ha-haproxy-7d7c895d48-9lvlj pod/argocd-redis-ha-haproxy-7d7c895d48-frmxn pod/argocd-redis-ha-server-0 pod/argocd-redis-ha-server-1 pod/argocd-redis-ha-server-2 pod/argocd-repo-server-774ffb985d-25778 pod/argocd-repo-server-774ffb985d-fk4nf pod/argocd-server-65c96f7d86-dfj6s pod/argocd-server-65c96f7d86-kpw2z pod/website-7df55575f8-zcdt7 pod/camel-k-operator-7d66896b75-s5b8c pod/cert-manager-6ffb79dfdb-sqp7d pod/cert-manager-cainjector-5fcd49c96-fkffb pod/cert-manager-webhook-796ff7697b-8f6fl pod/cert-manager-webhook-ovh-65648fd49-xzrfb pod/emqx-operator-controller-manager-697f499bb7-kmgzj pod/gitea-79f968f68c-zgrkt pod/gitea-postgresql-ha-pgpool-5b967d985f-ht48w pod/gitea-postgresql-ha-postgresql-0 pod/gitea-postgresql-ha-postgresql-1 pod/gitea-postgresql-ha-postgresql-2 pod/gitea-redis-cluster-0 pod/gitea-redis-cluster-1 pod/gitea-redis-cluster-2 pod/gitea-redis-cluster-3 pod/gitea-redis-cluster-4 pod/gitea-redis-cluster-5 pod/gdrive-files-processing-wip-to-process-69d4bf86b-crv4w pod/kafka-cluster-entity-operator-59654f75cb-qbjf2 pod/kafka-cluster-kafka-0 pod/kafka-cluster-kafka-1 pod/kafka-cluster-zookeeper-0 pod/strimzi-cluster-operator-695878cfc8-mj7d2 pod/calico-kube-controllers-65b74d475d-jqzl9 pod/canal-c8k4d pod/canal-cbr9f pod/canal-xz9t5 pod/coredns-545567dbbc-qvmtq pod/coredns-545567dbbc-r5nz2 pod/kube-dns-autoscaler-7d57686cf5-vn6sc pod/kube-proxy-4gbnk pod/kube-proxy-5ljwh pod/kube-proxy-hcsjr pod/metrics-server-59bc47dc74-dw6wd pod/secrets-store-csi-driver-4tclt pod/secrets-store-csi-driver-d8wcb pod/secrets-store-csi-driver-tlr9h pod/wormhole-7c2k5 pod/wormhole-f988b pod/wormhole-nrlh4 pod/alertmanager-kube-prometheus-kube-prome-alertmanager-0 pod/kube-prometheus-grafana-747559ff98-mxlkl pod/kube-prometheus-kube-prome-operator-698dccc59-68qnj pod/kube-prometheus-kube-state-metrics-cc66d7d4c-894sp pod/kube-prometheus-prometheus-node-exporter-4x5bp pod/kube-prometheus-prometheus-node-exporter-cbzhj pod/kube-prometheus-prometheus-node-exporter-t9thr pod/prometheus-kube-prometheus-kube-prome-prometheus-0 pod/nginx-ingress-controller-847c4bbdd-6mtj8 pod/keycloak-operator-6b9cf65f87-7x6r2 pod/sso-0 pod/sso-1 pod/sso-2 pod/sso-db-postgresql-ha-pgpool-5444f46c7d-tcxhs pod/sso-db-postgresql-ha-postgresql-0 pod/sso-db-postgresql-ha-postgresql-1 pod/sso-db-postgresql-ha-postgresql-2 pod/vault-0 pod/vault-1 pod/vault-2 pod/vault-agent-injector-57db6b66cf-gvmzq pod/vault-csi-provider-4zvdv pod/vault-csi-provider-cts95 pod/vault-csi-provider-wmxcb pod/node-agent-7vn95 pod/node-agent-c9brp pod/node-agent-gccqv pod/velero-64bdb44f88-8rdr8 ```

OVH might not be managing the csi through pods or just hiding them from the users but I am not able to provide any logs since I don't have access to them. I would totally agree that it would help to debug this issue and at least I can try to contact the support to ask for the logs.

Just in case I rerun with the official lastest release 1.12.0 since I had done the same mistake by not changing the version. It ended with the same PartialyFailed as before bundle-2023-10-17-11-55-03.tar.gz

blackpiglet commented 1 year ago

Thanks for the feed back. I found there was a pattern for the snapshot data moved PVCs. Those three PVCs created in namespace sso succeeded, and their StorageClass is csi-cinder-high-speed. The failed PVC's StorageClass is csi-cinder-classic.

Could you check the other failed PVCs' StorageClass setting? And what's the difference of the storage backend of those two StorageClasses?

Backup Item Operations:
  Operation for persistentvolumeclaims gitea/redis-data-gitea-redis-cluster-5:
    Backup Item Action Plugin:  velero.io/csi-pvc-backupper
    Operation ID:               du-d1abf0dc-9873-42bc-9659-399d470fdd94.95bfeed4-7089-435508b50
    Items to Update:
                           datauploads.velero.io velero/bashroom-cluster-backup9-949tv
    Phase:                 Failed
    Operation Error:       error to expose snapshot: error to get volume snapshot content: error getting volume snapshot content from API: volumesnapshotcontents.snapshot.storage.k8s.io "snapcontent-bab583b4-02cb-4b44-a6ec-5d14fb2f9300" not found
    Progress description:  Failed
    Created:               2023-10-17 11:53:13 +0200 CEST
    Started:               2023-10-17 11:53:13 +0200 CEST
    Updated:               2023-10-17 11:53:13 +0200 CEST
  Operation for persistentvolumeclaims sso/data-sso-db-postgresql-ha-postgresql-0:
    Backup Item Action Plugin:  velero.io/csi-pvc-backupper
    Operation ID:               du-d1abf0dc-9873-42bc-9659-399d470fdd94.74bb2aa0-9b19-4afff55dc
    Items to Update:
                           datauploads.velero.io velero/bashroom-cluster-backup9-lcfzs
    Phase:                 Completed
    Progress:              228711229 of 228711229 complete (Bytes)
    Progress description:  Completed
    Created:               2023-10-17 11:53:35 +0200 CEST
    Started:               2023-10-17 11:53:35 +0200 CEST
    Updated:               2023-10-17 11:54:16 +0200 CEST
  Operation for persistentvolumeclaims sso/data-sso-db-postgresql-ha-postgresql-1:
    Backup Item Action Plugin:  velero.io/csi-pvc-backupper
    Operation ID:               du-d1abf0dc-9873-42bc-9659-399d470fdd94.5cae76ff-431d-4ee4bc856
    Items to Update:
                           datauploads.velero.io velero/bashroom-cluster-backup9-fvgpr
    Phase:                 Completed
    Progress:              77841062 of 77841062 complete (Bytes)
    Progress description:  Completed
    Created:               2023-10-17 11:53:40 +0200 CEST
    Started:               2023-10-17 11:53:40 +0200 CEST
    Updated:               2023-10-17 11:54:17 +0200 CEST
  Operation for persistentvolumeclaims sso/data-sso-db-postgresql-ha-postgresql-2:
    Backup Item Action Plugin:  velero.io/csi-pvc-backupper
    Operation ID:               du-d1abf0dc-9873-42bc-9659-399d470fdd94.2ea47ffd-8e3c-4cb9a7068
    Items to Update:
                           datauploads.velero.io velero/bashroom-cluster-backup9-q76n9
    Phase:                 Completed
    Progress:              144949922 of 144949922 complete (Bytes)
    Progress description:  Completed
    Created:               2023-10-17 11:53:45 +0200 CEST
    Started:               2023-10-17 11:53:45 +0200 CEST
    Updated:               2023-10-17 11:54:25 +0200 CEST
Arcahub commented 1 year ago

@blackpiglet Sorry for the late reply,

The csi-cinder-high-speed is the default storageclass on ovh cluster, the only difference is that it is based on ssd storage instead of hdd for faster io operations. We are mostly using thecsi-cinder-classic and some case we have csi-cinder-high-speed in case of loosy configuration of storageclass or voluntary use of this one.

Here is the list of pvc in the cluster:

Click me ### PVC list ``` NAME STORAGECLASS data-gitea-postgresql-ha-postgresql-0 csi-cinder-classic data-gitea-postgresql-ha-postgresql-1 csi-cinder-classic data-gitea-postgresql-ha-postgresql-2 csi-cinder-classic gitea-shared-storage csi-cinder-high-speed redis-data-gitea-redis-cluster-0 csi-cinder-classic redis-data-gitea-redis-cluster-1 csi-cinder-classic redis-data-gitea-redis-cluster-2 csi-cinder-classic redis-data-gitea-redis-cluster-3 csi-cinder-classic redis-data-gitea-redis-cluster-4 csi-cinder-classic redis-data-gitea-redis-cluster-5 csi-cinder-classic data-0-kafka-cluster-kafka-0 csi-cinder-high-speed data-0-kafka-cluster-kafka-1 csi-cinder-high-speed data-kafka-cluster-zookeeper-0 csi-cinder-high-speed data-sso-db-postgresql-ha-postgresql-0 csi-cinder-high-speed data-sso-db-postgresql-ha-postgresql-1 csi-cinder-high-speed data-sso-db-postgresql-ha-postgresql-2 csi-cinder-high-speed audit-vault-0 csi-cinder-classic audit-vault-1 csi-cinder-classic audit-vault-2 csi-cinder-classic data-vault-0 csi-cinder-classic data-vault-1 csi-cinder-classic data-vault-2 csi-cinder-classic ```

My interpretation is that the error we are facing is somehow a latency error or at least a time related error and high speed pvc are more likely to complete or be reachable at the moment velero make the API call but still we can see that all high speed are not successful.

I checked others bundle I uploaded before in this issue and I was able to find other pvc that succeded but they were not always using csi-cinder-high-speed

blackpiglet commented 1 year ago

Thanks. I agree that using a high-speed disk doesn't mean the snapshot creation will succeed. I think we need more information from the CSI driver and snapshot controller to learn why the VolumeSnapshots are deleted.

Arcahub commented 1 year ago

Yeah I do agree on that. I have created a ticket on ovh support to ask for access to csi driver logs and some help on this issue from their side. I am waiting for an answers from them and will keep you updated.

I also have an openstack install on premise on my side so I will try to install a Kubernetes cluster with my own cinder csi driver to test if it is an issue only related to ovh or on overall cinder csi driver

Lyndon-Li commented 11 months ago

@Arcahub See issue #7068, though the current problem is different from that one, but we can troubleshoot it in the same way --- collect the snapshot controller pods' log(there are many containers in the snapshot controller pods, need to collect log for each container) before & after the problem happens, and from the log we will be able to know who deleted the VS for sure.

I think you may not need to contract the CSI driver vendor, because the snapshot controller is a Kubernetes upstream module and the pods should be in kube-system namespace.

MrOffline77 commented 5 months ago

I'm running on OVH too with the same behavior as far as I understood this so far. Anyway, I do have access to the CSI Driver Logs. At least I think its the correct spot from which you requested the logs.

On each K8S Node runs a container like this registry.kubernatine.ovh/public/cinder-csi-plugin-amd64:192 within a extra containerd namespace. Below you can find the logs of one container as an example. The other ones look the same when the backup starts.

The Log below starts together with the Velero Backup.

I0515 13:00:21.637381       9 utils.go:81] GRPC call: /csi.v1.Node/NodeGetCapabilities
I0515 13:00:21.662108       9 utils.go:81] GRPC call: /csi.v1.Node/NodeGetCapabilities
I0515 13:00:21.664284       9 utils.go:81] GRPC call: /csi.v1.Node/NodeGetCapabilities
I0515 13:00:21.666238       9 utils.go:81] GRPC call: /csi.v1.Node/NodeStageVolume
I0515 13:00:21.666258       9 nodeserver.go:352] NodeStageVolume: called with args {"publish_context":{"DevicePath":"/dev/sdd"},"staging_target_path":"/var/lib/kubelet/plugins/kubernetes.io/csi/cinder.csi.openstack.org/1506deb58c4ea03b0a8c329ac69367dc7252a798cca88c9048bbb936f2c3a55c/globalmount","volume_capability":{"AccessType":{"Mount":{"fs_type":"ext4"}},"access_mode":{"mode":1}},"volume_context":{"storage.kubernetes.io/csiProvisionerIdentity":"1715775916462-7155-cinder.csi.openstack.org"},"volume_id":"e3ca84b8-e6d5-46de-b1d9-9253df4ab2ad"}
I0515 13:00:22.333662       9 mount.go:171] Found disk attached as "scsi-0QEMU_QEMU_HARDDISK_e3ca84b8-e6d5-46de-b1d9-9253df4ab2ad"; full devicepath: /dev/disk/by-id/scsi-0QEMU_QEMU_HARDDISK_e3ca84b8-e6d5-46de-b1d9-9253df4ab2ad
I0515 13:00:22.333740       9 mount_linux.go:446] Attempting to determine if disk "/dev/disk/by-id/scsi-0QEMU_QEMU_HARDDISK_e3ca84b8-e6d5-46de-b1d9-9253df4ab2ad" is formatted using blkid with args: ([-p -s TYPE -s PTTYPE -o export /dev/disk/by-id/scsi-0QEMU_QEMU_HARDDISK_e3ca84b8-e6d5-46de-b1d9-9253df4ab2ad])
I0515 13:00:22.342622       9 mount_linux.go:449] Output: "DEVNAME=/dev/disk/by-id/scsi-0QEMU_QEMU_HARDDISK_e3ca84b8-e6d5-46de-b1d9-9253df4ab2ad\nTYPE=ext4\n"
I0515 13:00:22.342648       9 mount_linux.go:340] Checking for issues with fsck on disk: /dev/disk/by-id/scsi-0QEMU_QEMU_HARDDISK_e3ca84b8-e6d5-46de-b1d9-9253df4ab2ad
I0515 13:00:22.535385       9 mount_linux.go:436] Attempting to mount disk /dev/disk/by-id/scsi-0QEMU_QEMU_HARDDISK_e3ca84b8-e6d5-46de-b1d9-9253df4ab2ad in ext4 format at /var/lib/kubelet/plugins/kubernetes.io/csi/cinder.csi.openstack.org/1506deb58c4ea03b0a8c329ac69367dc7252a798cca88c9048bbb936f2c3a55c/globalmount
I0515 13:00:22.535449       9 mount_linux.go:175] Mounting cmd (mount) with arguments (-t ext4 -o defaults /dev/disk/by-id/scsi-0QEMU_QEMU_HARDDISK_e3ca84b8-e6d5-46de-b1d9-9253df4ab2ad /var/lib/kubelet/plugins/kubernetes.io/csi/cinder.csi.openstack.org/1506deb58c4ea03b0a8c329ac69367dc7252a798cca88c9048bbb936f2c3a55c/globalmount)
I0515 13:00:22.557128       9 utils.go:81] GRPC call: /csi.v1.Node/NodeGetCapabilities
I0515 13:00:22.574137       9 utils.go:81] GRPC call: /csi.v1.Node/NodeGetCapabilities
I0515 13:00:22.575351       9 utils.go:81] GRPC call: /csi.v1.Node/NodeGetCapabilities
I0515 13:00:22.576431       9 utils.go:81] GRPC call: /csi.v1.Node/NodePublishVolume
I0515 13:00:22.576458       9 nodeserver.go:51] NodePublishVolume: called with args {"publish_context":{"DevicePath":"/dev/sdd"},"staging_target_path":"/var/lib/kubelet/plugins/kubernetes.io/csi/cinder.csi.openstack.org/1506deb58c4ea03b0a8c329ac69367dc7252a798cca88c9048bbb936f2c3a55c/globalmount","target_path":"/var/lib/kubelet/pods/1e210d86-15ce-4eee-9132-e237bb237ac0/volumes/kubernetes.io~csi/ovh-managed-kubernetes-8o7qqc-pvc-10a49715-f6f7-4580-95f8-7b9b53b2849a/mount","volume_capability":{"AccessType":{"Mount":{"fs_type":"ext4"}},"access_mode":{"mode":1}},"volume_context":{"csi.storage.k8s.io/ephemeral":"false","csi.storage.k8s.io/pod.name":"nightly-20240515125933-m8dzp","csi.storage.k8s.io/pod.namespace":"velero","csi.storage.k8s.io/pod.uid":"1e210d86-15ce-4eee-9132-e237bb237ac0","csi.storage.k8s.io/serviceAccount.name":"velero","storage.kubernetes.io/csiProvisionerIdentity":"1715775916462-7155-cinder.csi.openstack.org"},"volume_id":"e3ca84b8-e6d5-46de-b1d9-9253df4ab2ad"}
I0515 13:00:22.698135       9 mount_linux.go:175] Mounting cmd (mount) with arguments (-t ext4 -o bind /var/lib/kubelet/plugins/kubernetes.io/csi/cinder.csi.openstack.org/1506deb58c4ea03b0a8c329ac69367dc7252a798cca88c9048bbb936f2c3a55c/globalmount /var/lib/kubelet/pods/1e210d86-15ce-4eee-9132-e237bb237ac0/volumes/kubernetes.io~csi/ovh-managed-kubernetes-8o7qqc-pvc-10a49715-f6f7-4580-95f8-7b9b53b2849a/mount)
I0515 13:00:22.702164       9 mount_linux.go:175] Mounting cmd (mount) with arguments (-t ext4 -o bind,remount,rw /var/lib/kubelet/plugins/kubernetes.io/csi/cinder.csi.openstack.org/1506deb58c4ea03b0a8c329ac69367dc7252a798cca88c9048bbb936f2c3a55c/globalmount /var/lib/kubelet/pods/1e210d86-15ce-4eee-9132-e237bb237ac0/volumes/kubernetes.io~csi/ovh-managed-kubernetes-8o7qqc-pvc-10a49715-f6f7-4580-95f8-7b9b53b2849a/mount)
I0515 13:00:25.378530       9 utils.go:81] GRPC call: /csi.v1.Node/NodeGetCapabilities
I0515 13:00:25.381054       9 utils.go:81] GRPC call: /csi.v1.Node/NodeGetVolumeStats
I0515 13:00:25.381068       9 nodeserver.go:478] NodeGetVolumeStats: called with args {"volume_id":"8e9c13ad-01ae-41bd-b37f-244177f2d894","volume_path":"/var/lib/kubelet/pods/bf847b6a-415c-4c9b-b272-e3f60261041d/volumes/kubernetes.io~csi/ovh-managed-kubernetes-8o7qqc-pvc-064bf59e-48fc-4ab9-929e-f269e3013183/mount"}
I0515 13:00:33.363713       9 utils.go:81] GRPC call: /csi.v1.Node/NodeGetCapabilities
I0515 13:00:33.369426       9 utils.go:81] GRPC call: /csi.v1.Node/NodeGetVolumeStats
I0515 13:00:33.369441       9 nodeserver.go:478] NodeGetVolumeStats: called with args {"volume_id":"8dcfb96c-9d91-4b7e-bf15-2cbff72fd399","volume_path":"/var/lib/kubelet/pods/fe1b03a7-a4fa-4ff8-a9bb-8cd809ddc46e/volumes/kubernetes.io~csi/ovh-managed-kubernetes-8o7qqc-pvc-82fd4660-1d71-4aac-b026-8880a5abc3ff/mount"}
I0515 13:00:46.774014       9 utils.go:81] GRPC call: /csi.v1.Node/NodeUnpublishVolume
I0515 13:00:46.774055       9 nodeserver.go:269] NodeUnPublishVolume: called with args {"target_path":"/var/lib/kubelet/pods/1e210d86-15ce-4eee-9132-e237bb237ac0/volumes/kubernetes.io~csi/ovh-managed-kubernetes-8o7qqc-pvc-10a49715-f6f7-4580-95f8-7b9b53b2849a/mount","volume_id":"e3ca84b8-e6d5-46de-b1d9-9253df4ab2ad"}
I0515 13:00:47.068446       9 mount_helper_common.go:99] "/var/lib/kubelet/pods/1e210d86-15ce-4eee-9132-e237bb237ac0/volumes/kubernetes.io~csi/ovh-managed-kubernetes-8o7qqc-pvc-10a49715-f6f7-4580-95f8-7b9b53b2849a/mount" is a mountpoint, unmounting
I0515 13:00:47.068479       9 mount_linux.go:266] Unmounting /var/lib/kubelet/pods/1e210d86-15ce-4eee-9132-e237bb237ac0/volumes/kubernetes.io~csi/ovh-managed-kubernetes-8o7qqc-pvc-10a49715-f6f7-4580-95f8-7b9b53b2849a/mount
W0515 13:00:47.073768       9 mount_helper_common.go:129] Warning: "/var/lib/kubelet/pods/1e210d86-15ce-4eee-9132-e237bb237ac0/volumes/kubernetes.io~csi/ovh-managed-kubernetes-8o7qqc-pvc-10a49715-f6f7-4580-95f8-7b9b53b2849a/mount" is not a mountpoint, deleting
I0515 13:00:47.177426       9 utils.go:81] GRPC call: /csi.v1.Node/NodeGetCapabilities
I0515 13:00:47.179256       9 utils.go:81] GRPC call: /csi.v1.Node/NodeUnstageVolume
I0515 13:00:47.179290       9 nodeserver.go:418] NodeUnstageVolume: called with args {"staging_target_path":"/var/lib/kubelet/plugins/kubernetes.io/csi/cinder.csi.openstack.org/1506deb58c4ea03b0a8c329ac69367dc7252a798cca88c9048bbb936f2c3a55c/globalmount","volume_id":"e3ca84b8-e6d5-46de-b1d9-9253df4ab2ad"}
I0515 13:00:47.255762       9 mount_helper_common.go:99] "/var/lib/kubelet/plugins/kubernetes.io/csi/cinder.csi.openstack.org/1506deb58c4ea03b0a8c329ac69367dc7252a798cca88c9048bbb936f2c3a55c/globalmount" is a mountpoint, unmounting
I0515 13:00:47.255798       9 mount_linux.go:266] Unmounting /var/lib/kubelet/plugins/kubernetes.io/csi/cinder.csi.openstack.org/1506deb58c4ea03b0a8c329ac69367dc7252a798cca88c9048bbb936f2c3a55c/globalmount
W0515 13:00:47.324616       9 mount_helper_common.go:129] Warning: "/var/lib/kubelet/plugins/kubernetes.io/csi/cinder.csi.openstack.org/1506deb58c4ea03b0a8c329ac69367dc7252a798cca88c9048bbb936f2c3a55c/globalmount" is not a mountpoint, deleting
I0515 13:02:07.609636       9 utils.go:81] GRPC call: /csi.v1.Node/NodeGetCapabilities

Let me know if you need any further logs from me to assist.

Lyndon-Li commented 5 months ago

@MrOffline77 Actually, we need the external-snapshotter log as mentioned in #7068, there are multiple containers including sidecar containers, we need the logs from all the containers.