vmware-tanzu / helm-charts

Contains Helm charts for Kubernetes related open source tools
https://vmware-tanzu.github.io/helm-charts/
Apache License 2.0
241 stars 357 forks source link

Velero Pod Replicas & alternative for emptydir #475

Open kkavin opened 1 year ago

kkavin commented 1 year ago

What steps did you take and what happened: Velero pod was evicted due to disk full in worker nodes in GKE.

We raised a support ticket with Google Cloud regarding the pod eviction due to the storage issue in the worker node. They reported that:

"Our analysis concluded that the pods are using emptyDir for scratch space. As per the product behavior, this uses storage space from the node's disk. It creates emptyDir volumes from the node's local disk, network storage, or memory-backed file system."

"Following up with the conclusion, we recommend using a Persistent Volume Claim (PVC) instead. This seems necessary because the “velero & restic” pods use a lot of storage. This results in the eviction of the pods."

Following their analysis, we have planned to add persistent storage for the velero and restic pods instead of emptyDir.

We need to know if we can use a GCS bucket for the velero and restic pods. By default, the Helm chart comes with 1 replica. Is it possible to add more than 1 replicas? Will Velero work with more than 1 replicas?

Velero-Error velero-Issue

Environment:

jenting commented 1 year ago

We need to know if we can use a GCS bucket for the velero and restic pods.

Yes, Velero could work with a GCS bucket. https://github.com/vmware-tanzu/velero-plugin-for-gcp#setup

Will Velero work with more than 1 replicas?

No. Velero server does not work with more than 1 replica.

we have planned to add persistent storage for the velero and restic pods instead of emptyDir.

I did not tried it before but I think it's possible and doable.

navilg commented 1 year ago

@jenting What data is filled in emptyDir path ? is housekeeping of this path not done by velero ? I think there are temporary data under this path.

kkavin commented 11 months ago

@jenting Can you please let us know what data are stored in the /scratch or emptyDir ? Often, we are getting issue in the velero pod it has been evicted due to disk pressure or the node was on low disk space ephemeral storage error.

jenting commented 11 months ago

@qiuming-best could you help this issue?

qiuming-best commented 11 months ago

@kkavin Velero server could not work with more than 1 replica, it'll have concurrency issues currently.

The scratch dir it's a place where Restic put its' cache in it, and the empty dir is where Velero put its' third-party plugin.

All of the Restic cache or third-party plugins are temp files, so we didn't put them into persistent volume.

But for your problem, you could put them into persistent volume and it's work.

DonghaopengZhu commented 1 week ago

Hi @qiuming-best and @jenting, I just came across the same issue. As you can see the node that velero locates got a spike of usage of node filesystem size in a short time. 图片 And then, it was evicted by kubelet. "kind":"Pod","namespace":"velero","name":"velero-c4844d876-bvntd","uid":"1960a28a-15e6-44da-ab2b-65bf77616020","apiVersion":"v1","resourceVersion":"452456224"},"reason":"Evicted","message":"The node was low on resource: ephemeral-storage. Threshold quantity: 5119338572, available: 4544316Ki. Container velero was using 211020Ki, request is 0, has larger consumption of ephemeral-storage. I just wondering why the ephemeral storage that emptyDir consumes grows rapidly at this short period and I'm sure there is neither restic backup(pv backup) nor object backup performed. So when does velero or restic store data to the emptyDir?