mlevit / aws-auto-cleanup

Programmatically delete AWS resources based on an allowlist and time to live (TTL) settings
MIT License
495 stars 55 forks source link

Provisioned Kafka clusters are detected / deleted, but serverless clusters are not #113

Closed atqhg23 closed 2 years ago

atqhg23 commented 2 years ago

Describe the bug Serverless Kafka clusters are not detected / deleted, only provisioned clusters get detected / deleted.

To Reproduce Steps to reproduce the behavior:

  1. Go to the MSK console
  2. Create a serverless Kafka cluster
  3. Create a provisioned Kafka cluster
  4. Run the cleanup in dry run or destroy mode
  5. Check the execution log to see the Kafka cluster that was detected / deleted

Expected behavior Serverless Kafka clusters should be detected / deleted as well.

AWS (please complete the following information):

Additional context Trying to troubleshoot the issue on my end and the issue appears to be related to the Kafka paginator that's being used: list_clusters which only detects provisioned clusters, but list_clusters_v2 detects both serverless and provisioned clusters.

Haven't been able to get this working on my end though when changing the paginator.

mlevit commented 2 years ago

Thanks for the info. Can you clone and test https://github.com/servian/aws-auto-cleanup/tree/list-clusters-v2

mlevit commented 2 years ago

Bump @atqhg23

atqhg23 commented 2 years ago

Sorry for the delay, was out for a few. These are actually the changes I initially tested when trying to get the provisioned Kafka clusters to be detected / deleted as well.

With these changes, now provisioned and serverless clusters are not shown in the execution log, the exec log is empty.

I am seeing this error in the app lambda logs: [ERROR] 'list_clusters_v2' (kafka_cleanup.py, clusters(), line 51)

mlevit commented 2 years ago

Hmmm interesting. Any other errors accompanying that one? When I first commented back, I didn't add the right IAM permissions for list_clusters_v2, that was fixed shortly after. Make sure you pull the latest change from that branch.

My logs are

[DEBUG] Kafka Cluster 'marat-test' was created 3 days ago (less than TTL setting) and has not been deleted. (kafka_cleanup.py, clusters(), line 81)
atqhg23 commented 2 years ago

Yeah I did replace the kafka:ListClusters permission with kafka:ListClustersV2. Here are all the logs that mention Kafka:

[DEBUG] Started cleanup of Kafka Clusters. (kafka_cleanup.py, clusters(), line 33)

[ERROR] Could not list all Kafka Clusters. (kafka_cleanup.py, clusters(), line 50)

[ERROR] 'list_clusters_v2' (kafka_cleanup.py, clusters(), line 51)

I could not get this working. Really not sure what's causing the issue. Will take another look at it tomorrow.

Here is what I tried

What I found

I attached the output of when I run aws kafka list-clusters and aws kafka list-clusters-v2: list-clusters.txt list-clusters-v2.txt

mlevit commented 2 years ago

Don't see why this would fail. The API call is correct and the three fields required (ClusterName, ClusterArn, and CreationTime) are exactly the same between the two API calls.

atqhg23 commented 2 years ago

Looked more into this today and still could not get it working. Opened a support case with AWS as well and they're not seeing any issues on their end.

Got a recommendation to create another Lambda specifically to list the clusters with the code snippet below and this one worked which most likely points to the issue being somewhere in my setup.

I will just leave provisioned clusters enabled for now, and will dig into this later on. Appreciate the help with this. Should be good to resolve this issue and merge the PR. Thanks again.

Code snippet used to list Kafka clusters using list_clusters_v2 paginator that did end up working:

import json
import boto3

client_kafka = boto3.client('kafka')

 def lambda_handler(event, context):
        paginator = client_kafka.get_paginator('list_clusters_v2')

        page_iterator = paginator.paginate()

        for i in page_iterator:
                print(i)

        return 0
mlevit commented 2 years ago

Thanks for coming back @atqhg23. I havent' seen the issues in my account so I am assuming there's something not right there, potentially SCPs? Keep me posted anyway.

atqhg23 commented 2 years ago

The weird thing is it worked when using that code snippet to list the clusters using the V2 paginator and I was using the same role, but yeah will take a look at the account level settings to see if it’s something there.

Will do, thanks

atqhg23 commented 2 years ago

Wanted to post an update for this. Was able to get this working after AWS support found the issue. Our setup for the cleanup is configured differently and was outdated. It was using Boto3 version 1.16.25 when the version needs to be equal to or greater than 1.20.18 to support list_versions_v2. Updating the python dependencies bundle to the following got this working:

boto3==1.21.25
botocore==1.24.25
dynamodb_json==1.3
func-timeout==4.3.5

One thing to add is that access to ec2:DeleteVpcEndpoints is needed to be able to delete the serverless clusters. I created a PR to add this.

Thanks again for the help with this issue.