vmware-tanzu / velero

Backup and migrate Kubernetes applications and their persistent volumes
https://velero.io
Apache License 2.0
8.45k stars 1.37k forks source link

Velero CSI volume snapshot using Kopia using excessive egress (downloading) cost #7660

Open MoZadro opened 3 months ago

MoZadro commented 3 months ago

What steps did you take and what happened:

Installed latest helm chart 6.0.0 with velero image 1.13.0 and GCP plugin 1.9.0 and CSI plugin 0.7.0.

We are using Velero with CSI, since our cluster is on Hetzner, we are using longhorn storageClass, so we configured velero to work with CSI plugin. And we are using option With moving snapshot data to backup storage (Velero backs-up K8s resources along with snapshot data to backup storage. A volume snapshot will be created for the PVs and the data from snapshot is moved to backup storage using Kopia.)

What did you expect to happen: Our cluster is on Hetzner, we have cloud and bare metal worker and master nodes, data center location Falkenstein Germany, our bucket is on GCP region europe-west1, from document above:

To my knowledge, Kopia uses egress in two ways. First is through repository compaction, where Kopia rewrites blobs when cleaning up the repository after you have deleted files. This is done during the daily full maintenance. Second is when Kopia runs snapshot verify, where Kopia downloads the metadata. This is run by Kopia daily during full maintenance also.

Anything else you would like to add:

We saw unreasonable/large data transformation from/to the GCP object store, during backup mostly.

Download worldwide destinations is increased when we use Kopia, our cluster is not big, we are testing it, and it generates this cost.

image

Download Worldwide Destinations costs are almost 100€ a day.

Vote on this issue!

This is an invitation to the Velero community to vote on issues, you can see the project's top voted issues listed here.
Use the "reaction smiley face" up to the right of this comment to vote.

blackpiglet commented 3 months ago

The behavior is expected. In Velero's CSI snapshot data mover, during backup, first, a CSI snapshot is taken, then Velero uses the CSI snapshot to create a new volume mounted with a temporary pod, and at last, Velero uses the Kopia to upload the mounted volume data into the backup repository. This should be the reason you saw the ingress traffic. The mechanism is different from the standard-alone Kopia. T

I suggest choosing a local bucket to reduce the cost.

Lyndon-Li commented 3 months ago

7301 could help in this scenario, at present compression is not enabled for Kopia repository.

Lyndon-Li commented 3 months ago

@MoZadro Besides the above information, here are some more questions helping us to troubleshoot the problem/seek a solution:

  1. Do you see this much data downloading every day or in a particular day? (e.g., after a backup finishes, or after a backup is deleted)
  2. Please confirm that for the days when you saw the excessive/unreasonable data download, there was no restore happened
  3. Were there backups happen during the days you saw the excessive/unreasonable data download? If there were, were there dramatic data changes to the volume included in the backups?
  4. What is your data being backed up like? What is the data size? Are the files compatible?
  5. Why do you need to save the backup data cross cloud?
MoZadro commented 3 months ago

Hello @Lyndon-Li

  1. Every day during backup, just doing regular backup procedure, no manual interventions.
  2. We did not perform any restores, this is only backup, which we have daily schedule for.
  3. It’s basically regular k8s definitions ( non-pv data ) + Longhorn PV data for stateful services, however on setup no1 we have live-data object like 120GB and on setup no2 we have live-data object like 500G but soft-deleted-object 5TB ( not really sure what this means ) Soft delete policy (for data recovery) was turned on we turned it off.
  4. Our setup is located in “non-cloud” infrastructure, on Hetzner… and we’re trying to have off-site backup into a 3rd party bucket

Buckets are created with following specifications:

Location type: Region Default storage class: Standard Access control: Fine-grained

Following screenshot is for Download worldwide destinations cost, on Apr 8 we configured Velero CSI Kopia, buckets on google cloud platform, you can see increased daily cost from then.

image

Lyndon-Li commented 3 months ago

@MoZadro Could you double confirm this DOWNLOAD happened during the backup? or the time out of backup? Backup should not download much data from the object store. What do you mean by live-data object and soft-deleted-object? Where do you see them? Do you think they are volume data in the PV or k8s object/resource?

MoZadro commented 3 months ago

We see only bucket interaction around the scheduled backup time, main suspect currently is verify phase for the backup, suspecting it might generate high download rate ( egress from gcp ) Bucket has an option for live-data and soft-deleted-object like a protection mechanism…. we’ve disabled that just this morning and will see how that behaves…

Lyndon-Li commented 3 months ago

@MoZadro Please help to describe the PV data you are backing up, specifically: What is the data size? Are they changed dramatically every day (especially during the days you see the excessive data download)? Are the files compressible?

Lyndon-Li commented 3 months ago

@MoZadro

We see only bucket interaction around the scheduled backup time, main suspect currently is verify phase for the backup

Velero doesn't call snapshot Verify. We suspect it is by repo maintenance since backup doesn't involve much data download. Can you confirm this --- the data downloading is not happening during backup?

MoZadro commented 3 months ago

Hello @Lyndon-Li

You can see that Download worldwide destinations cost are increased since we configured Velero CSI plugin to work with Kopia tool to upload data to object storage:

image

Within the cluster we have about 180 PVCs of various sizes, the largest are 100Gi and there are about 25 of them. Then 60Gi we have 5 of them and the rest are of smaller sizes.

All volumes:

Access Modes: RWO VolumeMode: Filesystem

Observing Grafana when the backup is done, we can notice the following from the node-agent pods:

image

But wouldn't say it looks like a big download, because it shows on the graph:

Transmission around 150-250 MB/s Receive around 4-9 MB/s

Observability from Google bucket:

image

Lyndon-Li commented 3 months ago

@MoZadro As the original problem, the data download (egress) costed the charge, however, from your above observation, the node-agent was transferring lots of data out of backup time, it should cause ingress charge. So do you this is a new problem?

MoZadro commented 2 months ago

Not sure about that node-agent pods are transferring lots of data out of backup time. We noticed that excessive egress cost when using Velero with Kopia, as soon as we deleted buckets and changed bucket to other provider those costs are gone.

github-actions[bot] commented 2 weeks ago

This issue is stale because it has been open 60 days with no activity. Remove stale label or comment or this will be closed in 14 days. If a Velero team member has requested log or more information, please provide the output of the shared commands.

MoZadro commented 2 weeks ago

We are not using Velero with Kopia because of high Egress cost :)

blackpiglet commented 2 weeks ago

unstale