Open Dzhuneyt opened 1 week ago
Hi @Dzhuneyt,
Does it ultimately succeed if your rerun the state machine multiple times?
I've ran it with the Dry Run flag just to see how much space would be reclaimed, so I doubt a re-run will fix it in this case because it's just repeating the same operation.
I will try now without the dry run flag, which I suspect might get it to work after a few retries.
I've ran it with the Dry Run flag just to see how much space would be reclaimed, so I doubt a re-run will fix it in this case because it's just repeating the same operation.
of course
Maybe I can expose an option to set the Lambda timeout here.
I tried re-running a couple of times even without the dryRun flag (so it actually deletes resources while it reaches the 300 seconds timeout). But it seems to timeout even with that scenario.
Is it possible that the Lambda for cleaning actually does some "data collecting" first, before going forward with deletions of these assets? If that's the case, it's possible that it times out during this first phase, so it never actually deletes anything, not making the job of subsequent re-runs faster.
Either case, allowing to configure the timeout to consumers of the ToolkitCleaner construct, or setting the default to 15 minutes, are both good solutions.
Is it possible that the Lambda for cleaning actually does some "data collecting" first, before going forward with deletions of these assets? If that's the case, it's possible that it times out during this first phase, so it never actually deletes anything, not making the job of subsequent re-runs faster.
It lists up to 1,000 objects versions, deletes the objects that satisfy the conditions to be deleted and then loops until there are no more objects versions to list (by listing again, etc.), see https://github.com/jogold/cloudstructs/blob/master/src/toolkit-cleaner/clean-objects.lambda.ts#L16. I expect that after 5 minutes some of your objects should be deleted.
I also confirm that the total size of the S3 bucket drops by 2-5 GB after every run of the State Machine. This means it is actually deleting stuff, it's just too much stuff to delete. The total size of the bucket right now is 50GB+.
Just thinking out loud - rather than bumping up the timeouts of that Lambda, another option could be to float the loop to the State Machine itself, so it chunks the big array of S3 objects and passes them to a downstream Lambda that works on X items at a time, making the Step Machine split the big array and call that Lambda recursively.
Just thinking out loud - rather than bumping up the timeouts of that Lambda, another option could be to float the loop to the State Machine itself, so it chunks the big array of S3 objects and passes them to a downstream Lambda that works on X items at a time, making the Step Machine split the big array and call that Lambda recursively.
Will have a look at this. Reopening here.
I also confirm that the total size of the S3 bucket drops by 2-5 GB after every run of the State Machine. This means it is actually deleting stuff, it's just too much stuff to delete. The total size of the bucket right now is 50GB+
@Dzhuneyt how is it now with a 15 minutes timeout?
Unfortunately, I can no longer easily reproduce, because after plenty of re-runs of the State Machine, I was able to purge a lot of junk assets, so now the operation completes without any timeouts, even with the previous timeout configuration.
Reproduction would mean I will have to artificially inflate the junk in the assets S3 bucket for CDK, which is not ideal.
Sorry about the dead end. I'm sure other people (or my future self) will find that configurable timeout beneficial.
I tried using the ToolkitCleaner construct in our staging environment which is shared by 5 or so developers for the past 2+ years.
Obviously, that account has accumulated lots of junk in terms of S3 artifacts and ECR images in the Bootstrap stack.
I use the construct in the following way:
After invoking the State Machine manually, I get the following error: