vmware-tanzu / velero

Backup and migrate Kubernetes applications and their persistent volumes
https://velero.io
Apache License 2.0
8.61k stars 1.39k forks source link

Restore is in PartiallyFailed : stderr=ignoring error for /.snapshot: UtimesNano: read-only file system #5578

Open vfolk30 opened 1 year ago

vfolk30 commented 1 year ago

What steps did you take and what happened: I have a k8s cluster with velero+restic . I have taken backup of namespace ,which has persistent claim . Velero backup took backup properly with Completed status. As a next step to restore in same cluster. I have deleted the namespace and tried to restore from S3 object storage. Its in PartiallyFailed state .

restore_describe_data-1-20221108160411.txt

Name: data-1-20221108160411 Namespace: velero Labels: Annotations:

Phase: PartiallyFailed (run 'velero restore logs data-1-20221108160411' for more information) Total items to be restored: 8 Items restored: 8

Started: 2022-11-08 16:04:11 +0100 CET Completed: 2022-11-08 16:04:15 +0100 CET

What did you expect to happen:

In cluster NFS storage class is exported in RW .

Velero should restore the backup in proper manner.

velero restore logs

cat restore_data-1-20221108160411.log
f`or all restic restores to complete" logSource="pkg/restore/restore.go:551" restore=velero/data-1-20221108160411
time="2022-11-08T15:04:15Z" level=error msg="unable to successfully complete restic restores of pod's volumes" error="pod volume restore failed: error running restic restore, cmd=restic restore --repo=s3:s3-url:10443/s3-poc/restic/data --password-file=/tmp/credentials/velero/velero-restic-credentials-repository-password --cacert=/tmp/cacert-default3812448782 --cache-dir=/scratch/.cache/restic e95771eb --target=., stdout=restoring <Snapshot e95771eb of [/host_pods/605121c3-7da1-4e7d-846f-94e8f9228bad/volumes/kubernetes.io~nfs/pvc-9da69266-e025-4a64-abac-00a172106f29] at 2022-11-08 15:02:14.904234069 +0000 UTC by root@velero> to .\n, stderr=ignoring error for /.snapshot: UtimesNano: read-only file system\nFatal: There were 1 errors\n\n: exit status 1" logSource="pkg/restore/restore.go:1579" restore=velero/data-1-20221108160411

kubectl logs deployment/velero -n velero

time="2022-11-09T11:57:36Z" level=info msg="Validating BackupStorageLocation" backup-storage-location=velero/default controller=backup-storage-location logSource="pkg/controller/backup_storage_location_controller.go:131"
time="2022-11-09T11:57:36Z" level=info msg="BackupStorageLocations is valid, marking as available" backup-storage-location=velero/default controller=backup-storage-location logSource="pkg/controller/backup_storage_location_controller.go:116"
time="2022-11-09T11:58:36Z" level=info msg="Validating BackupStorageLocation" backup-storage-location=velero/default controller=backup-storage-location logSource="pkg/controller/backup_storage_location_controller.go:131"
time="2022-11-09T11:58:36Z" level=info msg="BackupStorageLocations is valid, marking as available" backup-storage-location=velero/default controller=backup-storage-location logSource="pkg/controller/backup_storage_location_controller.go:116"
time="2022-11-09T11:59:36Z" level=info msg="Validating BackupStorageLocation" backup-storage-location=velero/default controller=backup-storage-location logSource="pkg/controller/backup_storage_location_controller.go:131"
time="2022-11-09T11:59:36Z" level=info msg="BackupStorageLocations is valid, marking as available" backup-storage-location=velero/default controller=backup-storage-location logSource="pkg/controller/backup_storage_location_controller.go:116"
time="2022-11-09T12:00:36Z" level=info msg="Validating BackupStorageLocation" backup-storage-location=velero/default controller=backup-storage-location logSource="pkg/controller/backup_storage_location_controller.go:131"
time="2022-11-09T12:00:36Z" level=info msg="BackupStorageLocations is valid, marking as available" backup-storage-location=velero/default controller=backup-storage-location logSource="pkg/controller/backup_storage_location_controller.go:116"
time="2022-11-09T12:01:36Z" level=info msg="Validating BackupStorageLocation" backup-storage-location=velero/default controller=backup-storage-location logSource="pkg/controller/backup_storage_location_controller.go:131"
time="2022-11-09T12:01:36Z" level=info msg="BackupStorageLocations is valid, marking as available" backup-storage-location=velero/default controller=backup-storage-location logSource="pkg/controller/backup_storage_location_controller.go:116"
# velero backup describe data-1 --details 

Name:         data-1
Namespace:    velero
Labels:       velero.io/storage-location=default
Annotations:  velero.io/source-cluster-k8s-gitversion=v1.21.14
              velero.io/source-cluster-k8s-major-version=1
              velero.io/source-cluster-k8s-minor-version=21

Phase:  Completed

Errors:    0
Warnings:  0

Namespaces:
  Included:  data
  Excluded:  <none>

Resources:
  Included:        *
  Excluded:        <none>
  Cluster-scoped:  auto

Label selector:  <none>

Storage Location:  default

Velero-Native Snapshot PVs:  auto

TTL:  720h0m0s

Hooks:  <none>

Backup Format Version:  1.1.0

Started:    2022-11-08 16:02:01 +0100 CET
Completed:  2022-11-08 16:02:16 +0100 CET

Expiration:  2022-12-08 16:02:01 +0100 CET

Total items to be backed up:  24
Items backed up:              24

Resource List:
  v1/ConfigMap:
    - data/istio-ca-root-cert
    - data/kube-root-ca.crt
  v1/Event:
    - data/task-pv-claim.1725a3d09fefd735
    - data/task-pv-claim.1725a3e4b6cd9633
    - data/task-pv-claim.1725a3e4b7400472
    - data/task-pv-claim.1725a3e4b7cebed9
    - data/task-pv-pod.1725a3d8255f92ff
    - data/task-pv-pod.1725a3e198697c42
    - data/task-pv-pod.1725a3e730906a34
    - data/task-pv-pod.1725a3e76a4b693b
    - data/task-pv-pod.1725a3e8dcbc5221
    - data/task-pv-pod.1725a3e8dcbcb033
    - data/task-pv-pod.1725a3e8e28123e7
    - data/task-pv-pod.1725a3e8e2814efd
    - data/task-pv-pod.1725a3fa711d8ff4
    - data/task-pv-pod.1725a3fa7ed85e06
    - data/task-pv-pod.1725a3fa7fb66ba9
    - data/task-pv-pod.1725a3fa847bca60
  v1/Namespace:
    - data
  v1/PersistentVolume:
    - pvc-9da69266-e025-4a64-abac-00a172106f29
  v1/PersistentVolumeClaim:
    - data/task-pv-claim
  v1/Pod:
    - data/task-pv-pod
  v1/Secret:
    - data/default-token-tfp4k
  v1/ServiceAccount:
    - data/default

Velero-Native Snapshots: <none included>

Restic Backups:
  Completed:
    data/task-pv-pod: task-pv-storage

Anything else you would like to add:

Having NFS as storage class and it is exported as RW .

Environment:

velero version Client: Version: v1.9.2 Git commit: 82a100981cc66d119cf9b1d121f45c5c9dcf99e1 Server: Version: v1.9.2

Vote on this issue!

This is an invitation to the Velero community to vote on issues, you can see the project's top voted issues listed here.
Use the "reaction smiley face" up to the right of this comment to vote.

blackpiglet commented 1 year ago

I think this is caused by Restic trying to restore the modification timestamp of directory .snapshot, but got failure of read-only file system. This is the piece of Restic code to restore the timestamp. Could you check whether this directory's permission setting? I think maybe adding write permission could make it work.

func (node Node) RestoreTimestamps(path string) error {
    var utimes = [...]syscall.Timespec{
        syscall.NsecToTimespec(node.AccessTime.UnixNano()),
        syscall.NsecToTimespec(node.ModTime.UnixNano()),
    }

    if node.Type == "symlink" {
        return node.restoreSymlinkTimestamps(path, utimes)
    }

    if err := syscall.UtimesNano(path, utimes[:]); err != nil {
        return errors.Wrap(err, "UtimesNano")
    }

    return nil
}
sseago commented 1 year ago

This is NFS. Is root squashing enabled on the volume? If so, then restic running as root won't guarantee write access. To get around this, you will need to set the supplementalGroups on the Restic DaemonSet's SecurityContext to a group which has write access to the volume.

vfolk30 commented 1 year ago

Hello @sseago , NFS shares is exported as no root squash .

@blackpiglet there is any way to exclude the specific directory to restore or backup?

blackpiglet commented 1 year ago

@vfolk30 No. By far, Velero doesn't support specify the directory of volume to backup. I agree that raising the permission of Restic DaemonSet could solve the problem.

sseago commented 1 year ago

@vfolk30 OK, if you're using no_root_squash, then there shouldn't be any NFS-related permission issues with the restic pod, assuming it's running as a privileged pod (running as root). If you were using root_squash, you'd need to add to the supplementalGroups to the restic DaemonSet SecurityContext to match a GID that has write access to the volume. With root_squash, without that, there's no way the Restic pod will have write access.

vfolk30 commented 1 year ago

@blackpiglet and @sseago I did some more investigation . I found as below

.snapshot directory appears in each PVC as backend storage is on Netapp and owned by "root. So restic is unable to modify it while restoring.

Do you think there any workaround ?

blackpiglet commented 1 year ago

@vfolk30 Could you post the Restic DaemonSet's yaml? If there is the customized ConfigMap is used, please also post it. It should be a ConfigMap named with restic-restore-action-config under the namespace that Velero is installed.

Need to check the Restic DaemonSet running privilege, user and group setting.

vfolk30 commented 1 year ago

Hello , Please find DS yaml of restic. I have tried to removed unnecessary info from yaml.

apiVersion: apps/v1
kind: DaemonSet
metadata:
  annotations:
  labels:
    component: velero
  - apiVersion: apps/v1
  name: restic
  namespace: velero
spec:
  selector:
    matchLabels:
      name: restic
  template:
    metadata:

      labels:
        component: velero
        name: restic
    spec:
      containers:
      - args:
        - restic
        - server
        - --features=
        command:
        - /velero
        env:
        - name: NODE_NAME
          valueFrom:
            fieldRef:
              apiVersion: v1
              fieldPath: spec.nodeName
        - name: VELERO_NAMESPACE
          valueFrom:
            fieldRef:
              apiVersion: v1
              fieldPath: metadata.namespace
        - name: VELERO_SCRATCH_DIR
          value: /scratch
        - name: GOOGLE_APPLICATION_CREDENTIALS
          value: /credentials/cloud
        - name: AWS_SHARED_CREDENTIALS_FILE
          value: /credentials/cloud
        - name: AZURE_CREDENTIALS_FILE
          value: /credentials/cloud
        - name: ALIBABA_CLOUD_CREDENTIALS_FILE
          value: /credentials/cloud
        image: velero/velero:v1.9.2
        imagePullPolicy: IfNotPresent
        name: restic
        resources:
          limits:
            cpu: "1"
            memory: 1Gi
          requests:
            cpu: 500m
            memory: 512Mi
        volumeMounts:
        - mountPath: /host_pods
          mountPropagation: HostToContainer
          name: host-pods
        - mountPath: /scratch
          name: scratch
        - mountPath: /credentials
          name: cloud-credentials
      dnsPolicy: ClusterFirst
      restartPolicy: Always
      schedulerName: default-scheduler
      securityContext:
        runAsUser: 0
      serviceAccount: velero
      serviceAccountName: velero
      terminationGracePeriodSeconds: 30
      volumes:
      - hostPath:
          path: /var/lib/kubelet/pods
          type: ""
        name: host-pods
      - emptyDir: {}
        name: scratch
      - name: cloud-credentials
        secret:
          defaultMode: 420
          secretName: cloud-credentials
Lyndon-Li commented 1 year ago

The .snapshot directory contains the data and metadata for snapshots that are managed by netapp storage. The data/metadata can only be interpretable through netapp storage tools, so they are useless to Velero as Velero treats them as plain files. Moreover, the data in the .snapshot directory are primarily delta/changed data among snapshots, so along with the increase of the snapshot and data overwrite, the .snapshot will get increasingly large, this means, Velero needs to pay lots of time to back up the useless data. Finally, as you've already seen, Velero has a problem to restore the data in .snapshot, because the netapp storage tools didn't set a writing permission to the data/metadata files. This is understandable as they don't want 3rd to write them and then mess the data up.

Lyndon-Li commented 1 year ago

Considering above, there are two solutions:

  1. Velero skip the .snapshot directory at the time of backup. At present, Velero doesn't support this
  2. Hide the .snapshot directory from NFS client. I found a netapp doc for this topic
Lyndon-Li commented 1 year ago

@vfolk30 Not sure the approach # 2 as mentioned above fits your environment. For approach # 1, I don't think it is in the near plan of Velero, as Velero will focus on solving the generic problems with Kubernetes DP both on cloud and on-premise, for the storage specific problems like this one, we may treat them as low priority.

vfolk30 commented 1 year ago

@Lyndon-Li Thanks for updating the thread. In future it would be a good feature from Velero to skip file/folder.

maristeslk commented 1 year ago

Considering above, there are two solutions:

  1. Velero skip the .snapshot directory at the time of backup. At present, Velero doesn't support this
  2. Hide the .snapshot directory from NFS client. I found a netapp doc for this topic

approach 2 save me, thx!!!

piyushjain1804 commented 1 year ago

Hello @Lyndon-Li @vfolk30 @maristeslk , I am also facing same issue and getting this error related to read only file system. Is there any workaround to solve this problem ?

Velero running on on-promise k8s cluster with minio storage. storage class is nfs

time="2023-09-13T10:46:34Z" level=info msg="Run command=restore, stdout=restoring <Snapshot 7b2a12ed of [/host_pods/96dddf92-3cef-4117-85e3-bd35ba4c4f74/volumes/kubernetes.io~nfs/pvc-f7131a1b-939f-453b-962b-d98bacb2cc15] at 2023-09-13 10:25:30.613599075 +0000 UTC by root@velero> to .\n, stderr=ignoring error for /.snapshot: UtimesNano: read-only file system\nFatal: There were 1 errors\n\n" PodVolumeRestore=velero/fsb-postgres-4-20230913114420-ldxs4 controller=PodVolumeRestore logSource="pkg/uploader/provider/restic.go:209" pod=dx-backstage-dev-0/backstage-pg-db-0 restore=velero/fsb-postgres-4-20230913114420 snapshotID=7b2a12ed volumePath="/host_pods/401ad349-6749-42f6-a3ae-ff7393f9ebb1/volumes/kubernetes.io~nfs/pvc-94ef504b-794c-4389-86ff-521f463a8139"

blackpiglet commented 1 year ago

NFS storage is provided by which provider? The previous issue happened on NetApp, and NetApp can make the directory invisible to Velero by setting.