ToolkitCleaner: CleanObjects operation times out after 300 seconds

Dzhuneyt commented 1 week ago

I tried using the ToolkitCleaner construct in our staging environment which is shared by 5 or so developers for the past 2+ years.

Obviously, that account has accumulated lots of junk in terms of S3 artifacts and ECR images in the Bootstrap stack.

I use the construct in the following way:

new ToolkitCleaner(this, 'ToolkitCleaner', {
    // Use the dryRun prop to only output the number of assets and total size that would be deleted but without actually deleting assets.
    dryRun: true,
    // Do not delete assets created in the last 30 days even if unused
    retainAssetsNewerThan: Duration.days(30),
});

After invoking the State Machine manually, I get the following error:

RequestId: af66aa63-3b46-4385-ac8c-1077d3b9d14c Error: Task timed out after 300.00 seconds

jogold commented 1 week ago

Hi @Dzhuneyt,

Does it ultimately succeed if your rerun the state machine multiple times?

Dzhuneyt commented 1 week ago

I've ran it with the Dry Run flag just to see how much space would be reclaimed, so I doubt a re-run will fix it in this case because it's just repeating the same operation.

I will try now without the dry run flag, which I suspect might get it to work after a few retries.

jogold commented 1 week ago

I've ran it with the Dry Run flag just to see how much space would be reclaimed, so I doubt a re-run will fix it in this case because it's just repeating the same operation.

of course

Maybe I can expose an option to set the Lambda timeout here.

Dzhuneyt commented 1 week ago

I tried re-running a couple of times even without the dryRun flag (so it actually deletes resources while it reaches the 300 seconds timeout). But it seems to timeout even with that scenario.

Is it possible that the Lambda for cleaning actually does some "data collecting" first, before going forward with deletions of these assets? If that's the case, it's possible that it times out during this first phase, so it never actually deletes anything, not making the job of subsequent re-runs faster.

Dzhuneyt commented 1 week ago

Either case, allowing to configure the timeout to consumers of the ToolkitCleaner construct, or setting the default to 15 minutes, are both good solutions.

jogold commented 1 week ago

Is it possible that the Lambda for cleaning actually does some "data collecting" first, before going forward with deletions of these assets? If that's the case, it's possible that it times out during this first phase, so it never actually deletes anything, not making the job of subsequent re-runs faster.

It lists up to 1,000 objects versions, deletes the objects that satisfy the conditions to be deleted and then loops until there are no more objects versions to list (by listing again, etc.), see https://github.com/jogold/cloudstructs/blob/master/src/toolkit-cleaner/clean-objects.lambda.ts#L16. I expect that after 5 minutes some of your objects should be deleted.

Dzhuneyt commented 1 week ago

I also confirm that the total size of the S3 bucket drops by 2-5 GB after every run of the State Machine. This means it is actually deleting stuff, it's just too much stuff to delete. The total size of the bucket right now is 50GB+.

Just thinking out loud - rather than bumping up the timeouts of that Lambda, another option could be to float the loop to the State Machine itself, so it chunks the big array of S3 objects and passes them to a downstream Lambda that works on X items at a time, making the Step Machine split the big array and call that Lambda recursively.

jogold commented 1 week ago

Just thinking out loud - rather than bumping up the timeouts of that Lambda, another option could be to float the loop to the State Machine itself, so it chunks the big array of S3 objects and passes them to a downstream Lambda that works on X items at a time, making the Step Machine split the big array and call that Lambda recursively.

Will have a look at this. Reopening here.

jogold commented 1 week ago

I also confirm that the total size of the S3 bucket drops by 2-5 GB after every run of the State Machine. This means it is actually deleting stuff, it's just too much stuff to delete. The total size of the bucket right now is 50GB+

@Dzhuneyt how is it now with a 15 minutes timeout?

Dzhuneyt commented 1 week ago

Unfortunately, I can no longer easily reproduce, because after plenty of re-runs of the State Machine, I was able to purge a lot of junk assets, so now the operation completes without any timeouts, even with the previous timeout configuration.

Reproduction would mean I will have to artificially inflate the junk in the assets S3 bucket for CDK, which is not ideal.

Sorry about the dead end. I'm sure other people (or my future self) will find that configurable timeout beneficial.

jogold / cloudstructs

ToolkitCleaner: CleanObjects operation times out after 300 seconds #290