vmware-tanzu / velero

Backup and migrate Kubernetes applications and their persistent volumes
https://velero.io
Apache License 2.0
8.78k stars 1.41k forks source link

Velero backup partial failure with error: /etcdserver: leader changed #8310

Closed dbebarta closed 2 days ago

dbebarta commented 1 month ago

We have an EKS cluster in prod where our velero backup is failing with error

Errors:
  Velero:    message: /Error listing resources error: /etcdserver: leader changed
  Cluster:    <none>
  Namespaces:
    <namespace-1>:   resource: /trafficsplits message: /Error listing items error: /etcdserver: leader changed

We connected with AWS EKS team and they mentioned there wasn't any ETCD leader changed but they identified that there was a spike in ETCDRequestsReceived activity.

We have seen this issue in Prod multiple times. Can someone please help us report it

blackpiglet commented 1 month ago

This is still related to how the kube-apiserver handles Velero's requests. Could you check the EKS control plane node's resource allocation? We also need to know what is your backup scenario and the EKS cluster scale.

It's better to collect the debug bundle by velero debug to help investigate this issue.

dbebarta commented 1 month ago

We have 4 clusters with each having 100+ namespaces and we take the backups of all the resources of each namespace. Each namespace has 3 PVs and around 21 pods each.

Velera backup is scheduled to run every 12hrs for each cluster

We have more than 150+ worker nodes in each clusters.

Also wanted to point out when we checked with AWS during the time when we saw the issue `Per our internal investigations, we analyzed ETCD health and identified that there was a spike in ETCDRequestsReceived activity (spiked to ~4.3K requests).

However, overall control plane metrics show no evidence of etcd leader change occurred when investigating. Additionally, we confirmed that no control plane scaling or recycling of control plane nodes occurred during the time frame. ` Will share the velero debug tar file for the backup after verifying the data.

dbebarta commented 2 days ago

Its resolved, there was leader election at backend which was getting reported in kube logs itself as leader changed