[SURE-8366] Rancher keeps sending delete request for an already deleted EKS cluster

kkaempf commented 3 months ago

SURE-8366

Issue description:

Customer is reporting that the eks-operator is constantly sending DeleteCluster calls to the AWS API to delete clusters that have already been deleted (from Rancher). They would restart Rancher, but it continues every 150 seconds.

Business impact:

This isn't causing workload outages, but it's more of an annoyance for them.

Troubleshooting steps:

We had a few calls trying to find where the requests were coming from, and we found that in the ekscc object, some mentions of clusters were present, but in some of the newer logs, the original clusters they were concerned about were no longer present.

Actual behavior:

The cluster in question was removed from Rancher and EKS. However rancher continues to send requests to delete it.

Expected behavior:

When removing the cluster from Rancher, the cluster should be deleted, and cluster deleting messages should not be sent to AWS.

Files, logs, traces:

(See JIRA)

Additional notes:

It's important to note that they are doing some odd things with the permissions on the AWS side "For security reasons" that we couldn't get more explanation on. That's why we were seeing those errors in the AWS logs.

See SURE-8366 for the rest of the logs & the impacted cluster list

kkaempf commented 2 months ago

Waiting for customer feedback.

mjura commented 2 months ago

It is still waiting for customer feedback, solution was provided we could consider to close it

mjura commented 1 month ago

I have asked about providing output from following commands:

kubectl get clusters.management.cattle.io -A
kubectl get clusters.provisioning.cattle.io -A
kubectl get eksclusterconfigs.eks.cattle.io -A

kubectl get clusters.management.cattle.io -A -o yaml
kubectl get clusters.provisioning.cattle.io -A -o yaml
kubectl get eksclusterconfigs.eks.cattle.io -A -o yaml

kubectl logs -n cattle-system eks-operator-ID

mjura commented 2 weeks ago

We have got confirmation that it can be closed

rancher / eks-operator