Closed dbebarta closed 2 days ago
This is still related to how the kube-apiserver handles Velero's requests. Could you check the EKS control plane node's resource allocation? We also need to know what is your backup scenario and the EKS cluster scale.
It's better to collect the debug bundle by velero debug
to help investigate this issue.
We have 4 clusters with each having 100+ namespaces and we take the backups of all the resources of each namespace. Each namespace has 3 PVs and around 21 pods each.
Velera backup is scheduled to run every 12hrs for each cluster
We have more than 150+ worker nodes in each clusters.
Also wanted to point out when we checked with AWS during the time when we saw the issue `Per our internal investigations, we analyzed ETCD health and identified that there was a spike in ETCDRequestsReceived activity (spiked to ~4.3K requests).
However, overall control plane metrics show no evidence of etcd leader change occurred when investigating. Additionally, we confirmed that no control plane scaling or recycling of control plane nodes occurred during the time frame. ` Will share the velero debug tar file for the backup after verifying the data.
Its resolved, there was leader election at backend which was getting reported in kube logs itself as leader changed
We have an EKS cluster in prod where our velero backup is failing with error
We connected with AWS EKS team and they mentioned there wasn't any ETCD leader changed but they identified that there was a spike in ETCDRequestsReceived activity.
We have seen this issue in Prod multiple times. Can someone please help us report it