Open MoZadro opened 2 months ago
Can you help provide us with debug bundle by running command: velero debug --backup <backupname> --restore <restorename>
?
Get
"https://xxx/apis/snapshot.storage.k8s.io/v1/namespaces/xxx/volumesnapshots/velero-xxx":
http2: client connection lost
This error seems to have nothing to do with the object storage. It looks like a connection issue from velero pod to k8s api-server.
Also to mention that we get error even if I configured checksumAlgorithm: ""
failed to put object:
backupstore/volumes/35/9f/pvc-xxxx/blocks/ce/b7/ceb79853043161b70c0e9e136e8a09e2384e68433b377eba3b3d7c4832acd424.blk
response: {\n\n} error: AWS Error: SignatureDoesNotMatch The request
signature we calculated does not match the signature you provided. Check
your secret access key and signing method. <nil>\n403 \n"]
Hello @allenxu404 here is bundle from recent PartiallyFailed backup
I'm getting this error:
level=error msg="Error uploading log file" backup=full bucket=k3s error="rpc error: code = Unknown
desc = error putting object backups/full/full-logs.gz: operation error S3: PutObject, https response error StatusCode: 501, RequestID: , HostID: , ap
i error NotImplemented: STREAMING-UNSIGNED-PAYLOAD-TRAILER not implemented" error.file="/go/src/velero-plugin-for-aws/velero-plugin-for-aws/object_sto
re.go:266" error.function="main.(*ObjectStore).PutObject" logSource="pkg/persistence/object_store.go:252" prefix=
I'm using cloudflare R2 but the s3 compatibility doesn't work with velero.
@alimoezzi I found the following issue comment from someone who had the same problem with cloudflare s3 on a different project -- maybe this would resolve it for you too?
https://github.com/hashicorp/terraform/issues/33847#issuecomment-1854605813
"I was able to use a Cloudflare R2 bucket as a s3 backend with terraform 1.6.6 today.
In order to solve the NotImplemented: STREAMING-UNSIGNED-PAYLOAD-TRAILER error I needed to add skip_s3_checksum = true:"
terraform {
required_providers {
cloudflare = {
source = "cloudflare/cloudflare"
version = "~> 4.0"
}
}
backend "s3" {
bucket = "terraform-state"
key = "project_name/terraform.tfstate"
endpoints = { s3 = "https://xxxxx.r2.cloudflarestorage.com" }
region = "us-east-1"
access_key = "xxxx"
secret_key = "xxxxx"
skip_credentials_validation = true
skip_region_validation = true
skip_requesting_account_id = true
skip_metadata_api_check = true
skip_s3_checksum = true
}
}
@sseago It has no equivalent in velero, I also tried checksumAlgorithm: ""
Hello, I tried with older version of Velero Chart 5.2.2, app version 1.12.3 and plugins velero/velero-plugin-for-aws:v1.8.2 and velero/velero-plugin-for-csi:v0.6.3. Status is again PartiallyFailed
:
Velero-Native Snapshot PVs: auto
Snapshot Move Data: true
Data Mover: velero
TTL: 168h0m0s
CSISnapshotTimeout: 1h0m0s
ItemOperationTimeout: 4h0m0s
Backup Item Operations: 133 of 146 completed successfully, 13 failed (specify --details for more information)
bundle-2024-05-08-08-31-46.tar.gz
Velero config:
backupsEnabled: true
snapshotsEnabled: true
priorityClassName: "system-node-critical"
configuration:
backupStorageLocation:
- prefix: /*
bucket: xxx
provider: aws
default: true
config:
region: auto
s3ForcePathStyle: true
s3Url: https://xxx.r2.cloudflarestorage.com
uploaderType: kopia
features: EnableCSI
defaultBackupStorageLocation: aws
defaultSnapshotMoveData: true
volumeSnapshotLocation:
- snapshotLocation: aws
provider: aws
config:
region: auto
apiTimeout: 120m
deployNodeAgent: true
Which means there are 13 DataUpload errors, all are with message:
message: timeout on preparing data upload
phase: Failed
What is the size of the failed DataUpload's related PVC, for example, sports-aio-uof-adapter-feed-3-pvc?
This one failed with timeout, and the timeout happened after 30 mins.
Could you please check whether this can be resolved by enlarging the timeout setting?
backup.Spec.CSISnapshotTimeout
Ok, I increased timeouts:
CSISnapshotTimeout: 1h30m0s
ItemOperationTimeout: 5h0m0s
also configured on node-agents:
- '--data-mover-prepare-timeout=2h'
Latest backup PartiallyFailed
Backup Item Operations: 133 of 146 completed successfully, 13 failed (specify --details for more information)
sourcePVC: xxx-xxx-pvc 100Gi
message: >-
error to expose snapshot: error wait volume snapshot ready: volume snapshot
is not ready until timeout, errors: []
phase: Failed
sourcePVC: xxx-xxx-pvc 60Gi
message: >-
error to expose snapshot: error wait volume snapshot ready: volume snapshot
is not ready until timeout, errors: []
phase: Failed
Need to mention that with same configuration it is working on Google Bucket, but issue with google is that is generating high Egress costs so we changed bucket to Cloudflare R2, and since it doesn't have native plugin like gcp, we are using aws plugin with CF R2 bucket.
sourcePVCs which are bigger in size 50Gi+ are Failed:
message: >-
error to expose snapshot: error wait volume snapshot ready: volume snapshot
is not ready until timeout, errors: []
node-agent pod logs:
time="2024-05-08T11:20:17Z" level=info msg="Accepting data upload zzzz-44x2t" logSource="pkg/controller/data_upload_controller.go:690"
2024-05-08T13:20:17.884408910+02:00 time="2024-05-08T11:20:17Z" level=info msg="This datauplod has been accepted by s-k8sxxx-xxx-xxx" Dataupload=zzzz-44x2t logSource="pkg/controller/data_upload_controller.go:715"
2024-05-08T13:20:17.884429701+02:00 time="2024-05-08T11:20:17Z" level=info msg="Data upload is accepted" controller=dataupload dataupload=velero/zzzz-44x2t logSource="pkg/controller/data_upload_controller.go:173"
2024-05-08T13:20:17.884476612+02:00 time="2024-05-08T11:20:17Z" level=info msg="Exposing CSI snapshot" logSource="pkg/exposer/csi_snapshot.go:95" owner=zzzz-44x2t
2024-05-08T15:00:17.941498993+02:00 1.715173217941236e+09 ERROR Reconciler error {"controller": "dataupload", "controllerGroup": "velero.io", "controllerKind": "DataUpload", "dataUpload": {"name":"zzzz-44x2t","namespace":"velero"}, "namespace": "velero", "name": "zzzz-44x2t", "reconcileID": "71ac87a5-fee4-483d-acbb-6240565e1c4c", "error": "error wait volume snapshot ready: volume snapshot is not ready until timeout, errors: []", "errorVerbose": "volume snapshot is not ready until timeout, errors: []\ngithub.com/vmware-tanzu/velero/pkg/util/csi.WaitVolumeSnapshotReady\n\t/go/src/github.com/vmware-tanzu/velero/pkg/util/csi/volume_snapshot.go:78\ngithub.com/vmware-tanzu/velero/pkg/exposer.(*csiSnapshotExposer).Expose\n\t/go/src/github.com/vmware-tanzu/velero/pkg/exposer/csi_snapshot.go:97\ngithub.com/vmware-tanzu/velero/pkg/controller.(*DataUploadReconciler).Reconcile\n\t/go/src/github.com/vmware-tanzu/velero/pkg/controller/data_upload_controller.go:188\nsigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller).Reconcile\n\t/go/pkg/mod/sigs.k8s.io/controller-runtime@v0.12.2/pkg/internal/controller/controller.go:121\nsigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller).reconcileHandler\n\t/go/pkg/mod/sigs.k8s.io/controller-runtime@v0.12.2/pkg
time="2024-05-08T13:00:17Z" level=info msg="Reconcile zzzz-44x2t" controller=dataupload dataupload=velero/zzzz-44x2t logSource="pkg/controller/data_upload_controller.go:113"
DataUpload object is first in Accepted state, it stays in that state until it change state to Failed with message:
error to expose snapshot: error wait volume snapshot ready: volume snapshot
is not ready until timeout, errors: []
Of course with increased timeouts.
The exposing process doesn't involve object storage so the timeout should be the same for the GCP bucket and the CloudFlare R2 environment.
error wait volume snapshot ready: volume snapshot
is not ready until timeout
Please check the Longhorn CSI driver log and the CSI external-snapshotter log to find more information.
The DataUpload object stalls on larger PVCs, for example, 100GB, meaning the InProgress phase takes longer until all bytes are transferred.
Other DataUpload objects remain in the Accepted or Prepared status, waiting in line, while this process is ongoing.
As you probably know Velero created intermediate objects (i.e., pods, PVCs, PVs) in Velero namespace or the cluster scope, they are to help data movers to move data. And they will be removed after the backup completes. But when a larger PVC DataUpload is completed, and Phase is changed from InProgress to Completed, I don't see new pods being created to continue the DataUpload process in the Velero namespace.
From what I can see, it gets stuck in the ContainerCreating status, and new pods are not created. So all those Accepted or Prepared DataUpload objects are after some time Failed since no new pods are created to do the job correctly.
Backup is in Status WaitingForPluginOperations while waiting
Backup Item Operations: 164 of 186 completed successfully, 0 failed (specify --details for more information)
After restarting the node-agent pods, 2 or 3 DataUpload objects are again in the InProgress status, while the rest are marked as Canceled. And of course status of backup is PartiallyFailed.
I'm not sure if data upload to Cloudflare R2 buckets is slower than data upload to buckets on Google,
I see. Please check the node-agent concurrency document. Velero supports to run multiple DataUploads in a node-agent at the same time. https://velero.io/docs/v1.13/node-agent-concurrency/
Yes, I found it, will try it.
Hello @blackpiglet I created configMap, example:
{
"loadConcurrency": {
"globalConfig": 4,
"perNodeConfig": [
{
"nodeSelector": {
"matchLabels": {
"kubernetes.io/hostname": "s-k8sxxx-xxx-c01n03"
}
},
"number": 4
},
{
"nodeSelector": {
"matchLabels": {
"kubernetes.io/hostname": "s-k8sxxx-xxx-c01n05"
}
},
"number": 4
}
}
]
}
}
So just to confirm, we have multiple nodes on cluster, example for node in configMap example above, labels on node:
labels:
beta.kubernetes.io/arch: amd64
beta.kubernetes.io/os: linux
kubernetes.io/arch: amd64
kubernetes.io/hostname: s-k8sxxx-xxx-c01n03
topology.kubernetes.io/region: xxx
topology.kubernetes.io/zone: xxx-xxx
I need to mount created configMap on node-agent daemonset like so:
volumeMounts:
- mountPath: /credentials
name: cloud-credentials
- mountPath: /host_pods
mountPropagation: HostToContainer
name: host-pods
- mountPath: /scratch
name: scratch
- mountPath: /etc/velero
name: node-agent-config-volume
volumes:
- name: cloud-credentials
secret:
defaultMode: 420
secretName: velero
- hostPath:
path: /var/lib/kubelet/pods
type: ''
name: host-pods
- emptyDir: {}
name: scratch
- name: node-agent-config-volume
configMap:
name: node-agent-configs
items:
- key: node-agent-config.json
path: node-agent-config.json
Can you confirm that this config is ok especially mountPath: /etc/velero part ?
The ConfigMap content looks good.
Please notice only the v1.13.x and the main Velero support the node-agent concurrence setting.
Another thing is the document has a defect.
Please create the ConfigMap with the name node-agent-config
. The document mistook it as node-agent-configs
.
https://github.com/vmware-tanzu/velero/blob/6499444106d0d7891131e12e7df5d3065aa9ca74/pkg/nodeagent/node_agent.go#L38
I will create a PR to address that: #7790.
Hello @blackpiglet I created everything, used newest velero helm chart version 6.0.0 app version 1.13.0. Used plugins velero/velero-plugin-for-aws:v1.9.2 and velero/velero-plugin-for-csi:v0.7.1.
I also created node-agent-config configMap, I noticed the mistake, so I referenced configmap correctly.
{
"loadConcurrency": {
"globalConfig": 9,
"perNodeConfig": [
{
"nodeSelector": {
"matchLabels": {
"kubernetes.io/hostname": "s-k8xxx-xxx-c01n03"
}
},
"number": 8
},
{
"nodeSelector": {
"matchLabels": {
"kubernetes.io/hostname": "s-k8xxx-xxx-c01n010"
}
},
"number": 8
}
]
}
}
This time I excluded namespaces from velero backup with higher pvc's. But again some dataUpload objects are Failed
xxxxx PartiallyFailed 4 0 2024-05-13 11:41:13 +0200 CEST 29d aws <none>
status:
message: timeout on preparing data upload
phase: Failed
@MoZadro, If you are seeing the problem only for large volume, it could be because the temporary PVC created from snapshot (to be used as backup source) is not ready in time. This in turn could be due to Longhorn creating multiple replicas and copying data over. I opened https://github.com/longhorn/longhorn/issues/7794 to request a more efficient process in creating PVC from a snapshot. In the meanwhile, #7700 will help as and when it is resolved.
Hello, as I mentioned previously i excluded namespaces from velero backup with higher pvc's. But again some dataUpload objects are Failed, so not only on large volumes.
What steps did you take and what happened:
Installed latest helm chart 6.0.0 with velero image 1.13.0 and AWS plugin 1.9.2 and CSI plugin 0.7.1.
We are using Velero with CSI, since our cluster is on Hetzner, we are using longhorn storageClass, so we configured velero to work with CSI plugin. And we are using option With moving snapshot data to backup storage (Velero backs-up K8s resources along with snapshot data to backup storage. A volume snapshot will be created for the PVs and the data from snapshot is moved to backup storage using Kopia.)
What did you expect to happen:
Our cluster is on Hetzner, we have cloud and bare metal worker and master nodes, data center location Falkenstein Germany, our bucket is on Cloudflare R2. We expected that backups will be in Completed status, not Failed or PartiallyFailed.
velero config:
I increased timeouts on nodeAgents:
'--data-mover-prepare-timeout=2h'
I also increased--csi-snapshot-timeout=90m0s
on velero backupThe following information will help us better understand what's going on:
We have a different timeout errors on the DataUpload object:
With these settings, and even less timeouts we had no problems with the gcp plugin and bucket on google. Here we are trying with aws plugin on Cloudflare R2 S3 storage.
Vote on this issue!
This is an invitation to the Velero community to vote on issues, you can see the project's top voted issues listed here.
Use the "reaction smiley face" up to the right of this comment to vote.