Velero CSI volume snapshot - Timeout issues aws plugin

MoZadro commented 2 months ago

What steps did you take and what happened:

Installed latest helm chart 6.0.0 with velero image 1.13.0 and AWS plugin 1.9.2 and CSI plugin 0.7.1.

We are using Velero with CSI, since our cluster is on Hetzner, we are using longhorn storageClass, so we configured velero to work with CSI plugin. And we are using option With moving snapshot data to backup storage (Velero backs-up K8s resources along with snapshot data to backup storage. A volume snapshot will be created for the PVs and the data from snapshot is moved to backup storage using Kopia.)

What did you expect to happen:

Our cluster is on Hetzner, we have cloud and bare metal worker and master nodes, data center location Falkenstein Germany, our bucket is on Cloudflare R2. We expected that backups will be in Completed status, not Failed or PartiallyFailed.

velero config:

backupsEnabled: true
snapshotsEnabled: true

configuration:
  backupStorageLocation:
   - prefix: /*
     bucket: xxx
     provider: aws
     default: true
     config:
       region: auto
       s3ForcePathStyle: true
       s3Url: https://xxx.r2.cloudflarestorage.com
       checksumAlgorithm: ""
       signatureVersion: 'v4'
  uploaderType: kopia
  fsBackupTimeout: 6h
  features: EnableCSI
  defaultBackupStorageLocation: aws
  defaultItemOperationTimeout: 6h
  defaultSnapshotMoveData: true

  volumeSnapshotLocation:
   - snapshotLocation: aws
     config:
       region: auto
       apiTimeout: 5m

deployNodeAgent: true

I increased timeouts on nodeAgents:

'--data-mover-prepare-timeout=2h' I also increased --csi-snapshot-timeout=90m0s on velero backup

The following information will help us better understand what's going on:

We have a different timeout errors on the DataUpload object:


message: timeout on preparing data upload
   phase: Failed

message: >-
     error to expose snapshot: error wait volume snapshot ready: volume snapshot
     is not ready until timeout, errors: []
phase: Failed

message: >-
     error to expose snapshot: error wait volume snapshot ready: error to get
     volumesnapshot xxx/xxx:
     Get
     "https://xxx/apis/snapshot.storage.k8s.io/v1/namespaces/xxx/volumesnapshots/velero-xxx":
     http2: client connection lost
phase: Failed

With these settings, and even less timeouts we had no problems with the gcp plugin and bucket on google. Here we are trying with aws plugin on Cloudflare R2 S3 storage.

Vote on this issue!

This is an invitation to the Velero community to vote on issues, you can see the project's top voted issues listed here.
Use the "reaction smiley face" up to the right of this comment to vote.

:+1: for "I would like to see this bug fixed as soon as possible"
:-1: for "There are more important bugs to focus on right now"

allenxu404 commented 1 month ago

Can you help provide us with debug bundle by running command: velero debug --backup <backupname> --restore <restorename>?

reasonerjt commented 1 month ago

Get
     "https://xxx/apis/snapshot.storage.k8s.io/v1/namespaces/xxx/volumesnapshots/velero-xxx":
     http2: client connection lost

This error seems to have nothing to do with the object storage. It looks like a connection issue from velero pod to k8s api-server.

MoZadro commented 1 month ago

Also to mention that we get error even if I configured checksumAlgorithm: ""

failed to put object:
    backupstore/volumes/35/9f/pvc-xxxx/blocks/ce/b7/ceb79853043161b70c0e9e136e8a09e2384e68433b377eba3b3d7c4832acd424.blk
    response: {\n\n} error: AWS Error:  SignatureDoesNotMatch The request
    signature we calculated does not match the signature you provided. Check
    your secret access key and signing method.  <nil>\n403 \n"]

MoZadro commented 1 month ago

Hello @allenxu404 here is bundle from recent PartiallyFailed backup

bundle-2024-04-30-10-32-54.tar.gz

alimoezzi commented 1 month ago

I'm getting this error:

level=error msg="Error uploading log file" backup=full bucket=k3s error="rpc error: code = Unknown
 desc = error putting object backups/full/full-logs.gz: operation error S3: PutObject, https response error StatusCode: 501, RequestID: , HostID: , ap
i error NotImplemented: STREAMING-UNSIGNED-PAYLOAD-TRAILER not implemented" error.file="/go/src/velero-plugin-for-aws/velero-plugin-for-aws/object_sto
re.go:266" error.function="main.(*ObjectStore).PutObject" logSource="pkg/persistence/object_store.go:252" prefix=

I'm using cloudflare R2 but the s3 compatibility doesn't work with velero.

sseago commented 1 month ago

@alimoezzi I found the following issue comment from someone who had the same problem with cloudflare s3 on a different project -- maybe this would resolve it for you too?

https://github.com/hashicorp/terraform/issues/33847#issuecomment-1854605813

"I was able to use a Cloudflare R2 bucket as a s3 backend with terraform 1.6.6 today.

In order to solve the NotImplemented: STREAMING-UNSIGNED-PAYLOAD-TRAILER error I needed to add skip_s3_checksum = true:"

terraform {
  required_providers {
    cloudflare = {
      source  = "cloudflare/cloudflare"
      version = "~> 4.0"
    }
  }

  backend "s3" {
    bucket = "terraform-state"
    key    = "project_name/terraform.tfstate"
    endpoints = { s3 = "https://xxxxx.r2.cloudflarestorage.com" }
    region = "us-east-1"

    access_key = "xxxx"
    secret_key = "xxxxx"
    skip_credentials_validation = true
    skip_region_validation = true
    skip_requesting_account_id  = true
    skip_metadata_api_check     = true
    skip_s3_checksum = true
  }
}

alimoezzi commented 1 month ago

@sseago It has no equivalent in velero, I also tried checksumAlgorithm: ""

MoZadro commented 1 month ago

Hello, I tried with older version of Velero Chart 5.2.2, app version 1.12.3 and plugins velero/velero-plugin-for-aws:v1.8.2 and velero/velero-plugin-for-csi:v0.6.3. Status is again PartiallyFailed:

Velero-Native Snapshot PVs:  auto
Snapshot Move Data:              true
Data Mover:                             velero

TTL:  168h0m0s

CSISnapshotTimeout:    1h0m0s
ItemOperationTimeout:  4h0m0s

Backup Item Operations: 133 of 146 completed successfully, 13 failed (specify --details for more information)

bundle-2024-05-08-08-31-46.tar.gz

Velero config:

backupsEnabled: true
snapshotsEnabled: true
priorityClassName: "system-node-critical"

configuration:
  backupStorageLocation:
   - prefix: /*
     bucket: xxx
     provider: aws
     default: true
     config:
       region: auto
       s3ForcePathStyle: true                                                  
       s3Url: https://xxx.r2.cloudflarestorage.com
  uploaderType: kopia
  features: EnableCSI
  defaultBackupStorageLocation: aws
  defaultSnapshotMoveData: true

  volumeSnapshotLocation:
   - snapshotLocation: aws
     provider: aws
     config:
       region: auto
       apiTimeout: 120m

deployNodeAgent: true

Which means there are 13 DataUpload errors, all are with message:

  message: timeout on preparing data upload
  phase: Failed

blackpiglet commented 1 month ago

What is the size of the failed DataUpload's related PVC, for example, sports-aio-uof-adapter-feed-3-pvc? This one failed with timeout, and the timeout happened after 30 mins. Could you please check whether this can be resolved by enlarging the timeout setting? backup.Spec.CSISnapshotTimeout

MoZadro commented 1 month ago

Ok, I increased timeouts:

CSISnapshotTimeout:    1h30m0s
ItemOperationTimeout:  5h0m0s

also configured on node-agents:

- '--data-mover-prepare-timeout=2h'

Latest backup PartiallyFailed

Backup Item Operations: 133 of 146 completed successfully, 13 failed (specify --details for more information)

sourcePVC: xxx-xxx-pvc 100Gi

message: >-
    error to expose snapshot: error wait volume snapshot ready: volume snapshot
    is not ready until timeout, errors: []
  phase: Failed

sourcePVC: xxx-xxx-pvc 60Gi

message: >-
    error to expose snapshot: error wait volume snapshot ready: volume snapshot
    is not ready until timeout, errors: []
  phase: Failed

Need to mention that with same configuration it is working on Google Bucket, but issue with google is that is generating high Egress costs so we changed bucket to Cloudflare R2, and since it doesn't have native plugin like gcp, we are using aws plugin with CF R2 bucket.

MoZadro commented 1 month ago

sourcePVCs which are bigger in size 50Gi+ are Failed:

message: >-
    error to expose snapshot: error wait volume snapshot ready: volume snapshot
    is not ready until timeout, errors: []

node-agent pod logs:

time="2024-05-08T11:20:17Z" level=info msg="Accepting data upload zzzz-44x2t" logSource="pkg/controller/data_upload_controller.go:690"
2024-05-08T13:20:17.884408910+02:00 time="2024-05-08T11:20:17Z" level=info msg="This datauplod has been accepted by s-k8sxxx-xxx-xxx" Dataupload=zzzz-44x2t logSource="pkg/controller/data_upload_controller.go:715"
2024-05-08T13:20:17.884429701+02:00 time="2024-05-08T11:20:17Z" level=info msg="Data upload is accepted" controller=dataupload dataupload=velero/zzzz-44x2t logSource="pkg/controller/data_upload_controller.go:173"
2024-05-08T13:20:17.884476612+02:00 time="2024-05-08T11:20:17Z" level=info msg="Exposing CSI snapshot" logSource="pkg/exposer/csi_snapshot.go:95" owner=zzzz-44x2t
2024-05-08T15:00:17.941498993+02:00 1.715173217941236e+09   ERROR   Reconciler error    {"controller": "dataupload", "controllerGroup": "velero.io", "controllerKind": "DataUpload", "dataUpload": {"name":"zzzz-44x2t","namespace":"velero"}, "namespace": "velero", "name": "zzzz-44x2t", "reconcileID": "71ac87a5-fee4-483d-acbb-6240565e1c4c", "error": "error wait volume snapshot ready: volume snapshot is not ready until timeout, errors: []", "errorVerbose": "volume snapshot is not ready until timeout, errors: []\ngithub.com/vmware-tanzu/velero/pkg/util/csi.WaitVolumeSnapshotReady\n\t/go/src/github.com/vmware-tanzu/velero/pkg/util/csi/volume_snapshot.go:78\ngithub.com/vmware-tanzu/velero/pkg/exposer.(*csiSnapshotExposer).Expose\n\t/go/src/github.com/vmware-tanzu/velero/pkg/exposer/csi_snapshot.go:97\ngithub.com/vmware-tanzu/velero/pkg/controller.(*DataUploadReconciler).Reconcile\n\t/go/src/github.com/vmware-tanzu/velero/pkg/controller/data_upload_controller.go:188\nsigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller).Reconcile\n\t/go/pkg/mod/sigs.k8s.io/controller-runtime@v0.12.2/pkg/internal/controller/controller.go:121\nsigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller).reconcileHandler\n\t/go/pkg/mod/sigs.k8s.io/controller-runtime@v0.12.2/pkg
time="2024-05-08T13:00:17Z" level=info msg="Reconcile zzzz-44x2t" controller=dataupload dataupload=velero/zzzz-44x2t logSource="pkg/controller/data_upload_controller.go:113"

DataUpload object is first in Accepted state, it stays in that state until it change state to Failed with message:

error to expose snapshot: error wait volume snapshot ready: volume snapshot
    is not ready until timeout, errors: []

Of course with increased timeouts.

blackpiglet commented 1 month ago

The exposing process doesn't involve object storage so the timeout should be the same for the GCP bucket and the CloudFlare R2 environment.

error wait volume snapshot ready: volume snapshot
    is not ready until timeout

Please check the Longhorn CSI driver log and the CSI external-snapshotter log to find more information.

MoZadro commented 1 month ago

The DataUpload object stalls on larger PVCs, for example, 100GB, meaning the InProgress phase takes longer until all bytes are transferred.

Other DataUpload objects remain in the Accepted or Prepared status, waiting in line, while this process is ongoing.

As you probably know Velero created intermediate objects (i.e., pods, PVCs, PVs) in Velero namespace or the cluster scope, they are to help data movers to move data. And they will be removed after the backup completes. But when a larger PVC DataUpload is completed, and Phase is changed from InProgress to Completed, I don't see new pods being created to continue the DataUpload process in the Velero namespace.

From what I can see, it gets stuck in the ContainerCreating status, and new pods are not created. So all those Accepted or Prepared DataUpload objects are after some time Failed since no new pods are created to do the job correctly.

Backup is in Status WaitingForPluginOperations while waiting

Backup Item Operations: 164 of 186 completed successfully, 0 failed (specify --details for more information)

After restarting the node-agent pods, 2 or 3 DataUpload objects are again in the InProgress status, while the rest are marked as Canceled. And of course status of backup is PartiallyFailed.

I'm not sure if data upload to Cloudflare R2 buckets is slower than data upload to buckets on Google,

blackpiglet commented 1 month ago

I see. Please check the node-agent concurrency document. Velero supports to run multiple DataUploads in a node-agent at the same time. https://velero.io/docs/v1.13/node-agent-concurrency/

MoZadro commented 1 month ago

Yes, I found it, will try it.

MoZadro commented 1 month ago

Hello @blackpiglet I created configMap, example:

{
    "loadConcurrency": {
        "globalConfig": 4,
        "perNodeConfig": [
            {
                "nodeSelector": {
                    "matchLabels": {
                        "kubernetes.io/hostname": "s-k8sxxx-xxx-c01n03"
                    }
                },
                "number": 4
            },
            {
                "nodeSelector": {
                    "matchLabels": {
                        "kubernetes.io/hostname": "s-k8sxxx-xxx-c01n05"
                    }
                },
                "number": 4
            }
            }
        ]
    }
}

So just to confirm, we have multiple nodes on cluster, example for node in configMap example above, labels on node:

labels:
    beta.kubernetes.io/arch: amd64
    beta.kubernetes.io/os: linux
    kubernetes.io/arch: amd64
    kubernetes.io/hostname: s-k8sxxx-xxx-c01n03
    topology.kubernetes.io/region: xxx
    topology.kubernetes.io/zone: xxx-xxx

I need to mount created configMap on node-agent daemonset like so:

volumeMounts:
            - mountPath: /credentials
              name: cloud-credentials
            - mountPath: /host_pods
              mountPropagation: HostToContainer
              name: host-pods
            - mountPath: /scratch
              name: scratch
            - mountPath: /etc/velero
              name: node-agent-config-volume

volumes:
        - name: cloud-credentials
          secret:
            defaultMode: 420
            secretName: velero
        - hostPath:
            path: /var/lib/kubelet/pods
            type: ''
          name: host-pods
        - emptyDir: {}
          name: scratch
        - name: node-agent-config-volume
          configMap:
            name: node-agent-configs
            items:
              - key: node-agent-config.json
                path: node-agent-config.json

Can you confirm that this config is ok especially mountPath: /etc/velero part ?

blackpiglet commented 1 month ago

The ConfigMap content looks good. Please notice only the v1.13.x and the main Velero support the node-agent concurrence setting. Another thing is the document has a defect. Please create the ConfigMap with the name node-agent-config. The document mistook it as node-agent-configs. https://github.com/vmware-tanzu/velero/blob/6499444106d0d7891131e12e7df5d3065aa9ca74/pkg/nodeagent/node_agent.go#L38

I will create a PR to address that: #7790.

MoZadro commented 1 month ago

Hello @blackpiglet I created everything, used newest velero helm chart version 6.0.0 app version 1.13.0. Used plugins velero/velero-plugin-for-aws:v1.9.2 and velero/velero-plugin-for-csi:v0.7.1.

I also created node-agent-config configMap, I noticed the mistake, so I referenced configmap correctly.

{
    "loadConcurrency": {
        "globalConfig": 9,
        "perNodeConfig": [
            {
                "nodeSelector": {
                    "matchLabels": {
                        "kubernetes.io/hostname": "s-k8xxx-xxx-c01n03"
                    }
                },
                "number": 8
            },
            {
                "nodeSelector": {
                    "matchLabels": {
                        "kubernetes.io/hostname": "s-k8xxx-xxx-c01n010"
                    }
                },
                "number": 8
            }
        ]
    }
}

This time I excluded namespaces from velero backup with higher pvc's. But again some dataUpload objects are Failed

xxxxx PartiallyFailed 4 0 2024-05-13 11:41:13 +0200 CEST 29d aws <none>

status:
  message: timeout on preparing data upload
  phase: Failed

bundle-2024-05-13-14-26-43.tar.gz

draghuram commented 1 month ago

@MoZadro, If you are seeing the problem only for large volume, it could be because the temporary PVC created from snapshot (to be used as backup source) is not ready in time. This in turn could be due to Longhorn creating multiple replicas and copying data over. I opened https://github.com/longhorn/longhorn/issues/7794 to request a more efficient process in creating PVC from a snapshot. In the meanwhile, #7700 will help as and when it is resolved.

MoZadro commented 1 month ago

Hello, as I mentioned previously i excluded namespaces from velero backup with higher pvc's. But again some dataUpload objects are Failed, so not only on large volumes.

vmware-tanzu / velero

Velero CSI volume snapshot - Timeout issues aws plugin #7742