Integration Test Flake: DependencyViolation NodeSecurityGroup has a dependent object

flostadler commented 2 months ago

We started getting an elevated amount of flakes related to the node SecurityGroup failing to be deleted e.g. https://github.com/pulumi/pulumi-eks/issues/1219.

    aws:ec2:SecurityGroup (example-cluster-py-3-nodeSecurityGroup):
      error: deleting urn:pulumi:p-it-fv-az1210--cluster-py-37ea6979::example-cluster-py::eks:index:Cluster$aws:ec2/securityGroup:SecurityGroup::example-cluster-py-3-nodeSecurityGroup: 1 error occurred:
        * deleting Security Group (sg-0e8b6d9244b79c723): DependencyViolation: resource sg-0e8b6d9244b79c723 has a dependent object
        status code: 400, request id: 1b401e09-4dda-4402-8a4e-95c43e03dbec

There might be missing dependency links between resources (e.g. nodes depending on the node SecurityGroup).

rquitales commented 2 months ago

As a quick workaround for now so we don't have to manually re-run failed jobs, we could just ignore the success of the teardown operation perhaps?

Linking an older issue that might be of relevance: https://github.com/pulumi/pulumi-eks/issues/194

t0yv0 commented 2 months ago

Poking around here trying to understand this space. It sounds like we cannot completely rule out end-user impact, this may not be just a test-level flake correct?

It appears that the nodeSecurityGroup failing to delete is this one. It's defined with revokeRulesOnDelete: true (possibly not relevant).

https://github.com/pulumi/pulumi-eks/blob/master/nodejs/eks/cluster.ts#L1906 https://github.com/pulumi/pulumi-eks/blob/master/nodejs/eks/securitygroup.ts#L92

It doesn't appear that there is any obvious missing source-level dependency, all the objects using the nodeSecurityGroup seem to be deleted before we hit this problem, unless the deletes are eventual it seems like it should work OK.

Suspiciously, only ClusterPy is failing, but Cluster (node JS) is not failing, nor are the Go and C# variants.

What's different about ClusterPy? One difference I spotted is access_entries and an additional iam Role:

https://github.com/pulumi/pulumi-eks/blob/master/examples/cluster-py/__main__.py#L57

The older issues point to ENI (Network Interface) sporadically causing the problem, but the fact it only surfaces for Python makes me suspect it's something else.

Perhaps enabling CloudTrail in integration account and chasing down the exact dependent object is worth the while?

rquitales commented 2 months ago

The test flake isn't strictly limited to just the ClusterPy test. I've noticed other tests failing for the same reason in the past, and sometimes during PR pre-submit CI runs.

Examples: https://github.com/pulumi/pulumi-eks/actions/runs/9559840213/job/26351075166 (StorageClasses nodejs) https://github.com/pulumi/pulumi-eks/actions/runs/8948485838/job/24581728839 (Oidcclaim nodejs)

The last time I did a quick look into this, I came across several relevant issues in upstream AWS TF provider, and eksctl. Mainly, that ENIs created by resources within the cluster (and thus would be outside of Pulumi's management), are not properly deleted when the cluster is terminated. When I poked around in the AWS console once after a failed test, it was indeed a dangling ENI that was the problem.

A potential solution might be that when we destroy an EKS cluster, we could first delete any workloads on cluster to ensure that external resources (eg. ENIs) can be also be deleted by the on-cluster controllers.

Ref: https://github.com/eksctl-io/eksctl/issues/4214

flostadler commented 2 months ago

I don't think we'd need to delete all workloads. The culprit here seems to be the vpc-cni (i.e. aws-node DaemonSet). It should be enough if we ensure that's deleted before the NodeGroups.

Right now I don't think we can enforce that within the provider. What might work is leveraging the kubelet's graceful shutdown feature. It can be activated like this here: https://github.com/kubernetes-sigs/aws-ebs-csi-driver/blob/master/docs/faq.md#what-steps-can-be-taken-to-mitigate-this-issue But this is very custom and involves using custom launch templates.

Karpenter (#1114) is not affected by this because it has native cordon & drain capabilities.

flostadler commented 2 months ago

I also did some digging whether those ENIs get cleaned up eventually, but they won't because they're essentially dangling. They'll need to be cleaned up manually

t0yv0 commented 2 months ago

Thank you all for the context!

I've opened https://github.com/pulumi/pulumi-eks/issues/1226 to track end-user issue to collect upvotes, I understand from the discussion that deleting eks.Cluster is not a frequent operation and we are just asking the users to retry, that might explain why this issue received less attention over time.

What would be the recommended suggestion on how to workaround test flakiness in our CI:

increase outer level retry (just try the test 5 times if this error is detected)
add retainOnDelete on a group of objects https://www.pulumi.com/docs/concepts/options/retainondelete/ and let AWS account cleanup take care of the detritus; looking briefly it looks like we can still have the cluster shutdown which is the costly bit
remove vpc-cni add-on - it looks like one hypothesis here is that the issue is with dangling ENIs created by vpc-cni; if it's not needed for the tests can we just remove it?

t0yv0 commented 2 months ago

We could also just detect this particular failure and ignore it for the purposes of CI, relying on the account cleanup. Thoughts?

t0yv0 commented 1 month ago

This interfered with release attempts today.

pulumi-bot commented 1 month ago

This issue has been addressed in PR #1277 and shipped in release v2.7.7.

corymhall commented 1 month ago

Just saw this again on a new test (OidcIam) https://github.com/pulumi/pulumi-eks/issues/1288

t0yv0 commented 1 month ago

I've not enrolled that one into "ignore destroy errors" treatment, perhaps we can do that, or just enroll all of them.

flostadler commented 4 weeks ago

IMO we should enroll all of them. Just checking though, will the dangling ENIs get cleaned up? If not we're accumulating costs (public IPs have a usage charge associated with them) and eventually hit the account quota for ENIs

t0yv0 commented 4 weeks ago

We have a sweeper job that cleans up test accounts, it needs to be strong enough to cover up for inadequate test cleanup. I'd assume it does this OK.

t0yv0 commented 4 weeks ago

I'll enroll all of them and send a PR.

pulumi-bot commented 2 weeks ago

This issue has been addressed in PR #1296 and shipped in release v2.7.9.

pulumi / pulumi-eks

Integration Test Flake: DependencyViolation NodeSecurityGroup has a dependent object #1222