failed to take snapshot of the volume -- failed with error: error parsing volume id -- should at least contain two #

sivarama-p-raju commented 3 weeks ago

What steps did you take and what happened: When a normal scheduled backup is run, the backup completes with state "PartiallyFailed". On reviewing the description of the backup, the below errors were found repeating many times:

  Velero:    message: /VolumeSnapshotContent snapcontent-b1cf790e-dfd6-4689-9881-ab9d329cea16 has error: Failed to check and update snapshot content: failed to take snapshot of the volume <AZ-RG>-main-dev: "rpc error: code = Internal desc = GetFileShareInfo(<AZ-RG>-main-dev) failed with error: error parsing volume id: \"<AZ-RG>-main-dev\", should at least contain two #"

The volume in question "-main-dev" is using the storageclass "azurefile-csi", but is not a dynamically provisioned volume.

There are other volumes using the same storageclass but are dynamically provisioned volumes, and the volume handle of those volumes contains at least two #, and so the requirement seems to be met.

Is this really a hard requirement ? And is this requirement only applicable to the volumes using "azurefile-csi" storageclass ?

What did you expect to happen:

Expect the backup to complete successfully without the errors.

Anything else you would like to add: This is on AKS cluster running Kubernetes v1.29.4. We have a similar issue on multiple AKS clusters.

Environment:

Velero version (use velero version): v1.14.0
Velero features (use velero client config get features):
Kubernetes version (use kubectl version): v1.29.4
Kubernetes installer & version: AKS
Cloud provider or hardware configuration: Azure
OS (e.g. from /etc/os-release): NA

Vote on this issue!

This is an invitation to the Velero community to vote on issues, you can see the project's top voted issues listed here.
Use the "reaction smiley face" up to the right of this comment to vote.

:+1: for "I would like to see this bug fixed as soon as possible"
:-1: for "There are more important bugs to focus on right now"

blackpiglet commented 3 weeks ago

I'm a little confused about your scenario. If a PVC is not dynamically provisioned, how is it related to StorageClass?

For the error, I think there is a limitation on the format of the shared file ID in the Azure File CSI code. https://github.com/kubernetes-sigs/azurefile-csi-driver/blob/ed0c596cf08226abce9091b18e82e9261ed99131/pkg/azurefile/azurefile.go#L474-L481

reasonerjt commented 3 weeks ago

Seems Azure thinks if a PV is provisioned by azurefile-csi the volume ID MUST have 2 # s.

sivarama-p-raju commented 3 weeks ago

@blackpiglet

Thank you for your response.

The PV is provisioned statically using a manifest with a specific name. The PVC then defines use of the PV. This is also a valid way of provisioning volume as you can find in the official documentation here. Here are the manifests used in this use-case:

---
apiVersion: v1
kind: PersistentVolume
metadata:
  name: <pv-name>
spec:
  accessModes:
  - ReadWriteMany
  capacity:
    storage: 10Gi
  csi:
    driver: file.csi.azure.com
    nodeStageSecretRef:
      name: file-share-secret
      namespace: <secret-ns>
    readOnly: false
    volumeAttributes:
      resourceGroup: <azure-resource-group>
      shareName: <share-name>
    volumeHandle: <azure-resource-group>-main-dev
  mountOptions:
  - dir_mode=0777
  - file_mode=0777
  - uid=0
  - gid=0
  - mfsymlinks
  - cache=strict
  - nosharesock
  - nobrl
  persistentVolumeReclaimPolicy: Retain
  storageClassName: azurefile-csi
---
apiVersion: v1
kind: PersistentVolumeClaim
metadata:
  name: <pvc-name>
  namespace: <pvc-ns>
spec:
  accessModes:
  - ReadWriteMany
  resources:
    requests:
      storage: 10Gi
  storageClassName: azurefile-csi
  volumeName: <pv-name>
---

Thank you for the link to the Azure file CSI code which shows the limitation on the format of the volume handle.

Could you please let me know your thoughts on what I could do in this case ?

blackpiglet commented 3 weeks ago

Due to the CSI Azure File snapshotter limitation, and the volume was not created by the CSI way, I think we cannot back up the volume by the CSI way. How about using the file system backup? https://velero.io/docs/v1.14/file-system-backup/

sivarama-p-raju commented 2 weeks ago

@blackpiglet

Thank you for the update. Yes, I plan to test the filesystem backup method for the problematic volumes.

I plan to annotate the deployments using the said volumes with the below, so that filesystem backup is done for those only:

backup.velero.io/backup-volumes: <volume name>

sivarama-p-raju commented 1 week ago

@blackpiglet

Please note that I annotated the pods with the below annotation:

backup.velero.io/backup-volumes: <volume name>

On running a fresh backup, I see that it completed with the status "PartiallyFailed" and have the below errors on a velero backup describe:

Errors:
  Velero:    name: /<pod-name> message: /Error backing up item error: /daemonset pod not found in running state in node <aks-node>
             name: /<pod-name> message: /Error backing up item error: /daemonset pod not found in running state in node <aks-node>
             name: /<pod-name> message: /Error backing up item error: /daemonset pod not found in running state in node <aks-node>

Not sure I understand this error. The pods are not part of a daemonset but are part of a deployment.

Could you please let me know your thoughts on the same ?

sseago commented 6 days ago

@sivarama-p-raju Are you running the node agent? You told velero to use fs-backup for those pods, but if you're using fs-backup with kopia (or restic), then you need to run the node agent daemonset. From the error message, either the node agent isn't running at all, or it's not running on the nodes with your pods for some reason.

sivarama-p-raju commented 5 days ago

@sseago Thank you for your reply. We did not have node agent running on the particular cluster. I enabled deploy of node agent and triggered a new backup after this.

There are new errors this time:

Errors:
  Velero:    name: /<pod-name> message: /Error backing up item error: /failed to wait BackupRepository, errored early: backup repository is not ready: error to connect to backup repo: error to connect repo with storage: error to connect to repository: unable to write config file: unable to create config directory: mkdir /home/cnb/udmrepo: read-only file system

On searching more on this, I found this issue, and tried doing the same.

The backups now complete successfully.

However, I notice that there are <ns>-default-kopia-<xxxxx>-maintain-job-.... now started failing with below errors:

time="2024-09-07T15:15:07Z" level=info msg="use the storage account URI retrieved from the storage account properties \"https://<storage-account>.blob.core.windows.net/\""
time="2024-09-07T15:15:07Z" level=info msg="auth with Azure AD"
time="2024-09-07T15:15:08Z" level=info msg="use the storage account URI retrieved from the storage account properties \"https://<storage-account>.blob.core.windows.net/\""
time="2024-09-07T15:15:08Z" level=info msg="auth with Azure AD"
time="2024-09-07T15:15:08Z" level=warning msg="active indexes [xn0_05f18fdeda88f2ebfc065202161e93f0-s745559b64bb7df4f12c-c1 xn0_256311ded6016c2ac402604a412c7e79-sb73cf658e60c314812c-c1] deletion watermark 0001-01-01 00:00:00 +0000 UTC" logModule=kopia/kopia/format logSource="pkg/kopia/kopia_log.go:101" logger name="[index-blob-manager]" sublevel=error
time="2024-09-07T15:15:08Z" level=info msg="Start to open repo for maintenance, allow index write on load" logSource="pkg/repository/udmrepo/kopialib/lib_repo.go:165"
time="2024-09-07T15:15:08Z" level=info msg="use the storage account URI retrieved from the storage account properties \"https://<storage-account>.blob.core.windows.net/\""
time="2024-09-07T15:15:08Z" level=info msg="auth with Azure AD"
time="2024-09-07T15:15:09Z" level=warning msg="active indexes [xn0_05f18fdeda88f2ebfc065202161e93f0-s745559b64bb7df4f12c-c1 xn0_256311ded6016c2ac402604a412c7e79-sb73cf658e60c314812c-c1] deletion watermark 0001-01-01 00:00:00 +0000 UTC" logModule=kopia/kopia/format logSource="pkg/kopia/kopia_log.go:101" logger name="[index-blob-manager]" sublevel=error
time="2024-09-07T15:15:09Z" level=info msg="Succeeded to open repo for maintenance" logSource="pkg/repository/udmrepo/kopialib/lib_repo.go:172"
time="2024-09-07T15:15:09Z" level=error msg="An error occurred when running repo prune" error="failed to prune repo: error to prune backup repo: error to maintain repo: error to run maintenance under mode auto: maintenance must be run by designated user: " error.file="/go/src/github.com/vmware-tanzu/velero/pkg/repository/udmrepo/kopialib/lib_repo.go:219" error.function="github.com/vmware-tanzu/velero/pkg/repository/udmrepo/kopialib.(*kopiaMaintenance).runMaintenance" logSource="pkg/cmd/cli/repomantenance/maintenance.go:72"

I needed your help with the below queries:

Is configuring the extraVolumes and extraVolumeMounts the correct solution to the backup error unable to write config file: unable to create config directory: mkdir /home/cnb/udmrepo: read-only file system ?
- If yes, how should the problems with the maintain job be fixed ?
- If no, what is the solution ?

Thanks a lot in advance.

vmware-tanzu / velero

failed to take snapshot of the volume -- failed with error: error parsing volume id -- should at least contain two # #8122