mlevit / aws-auto-cleanup

Programmatically delete AWS resources based on an allowlist and time to live (TTL) settings
MIT License
496 stars 55 forks source link

Web - Resources Scheduled for Termination View hanging with large csv #121

Closed kjblanchard closed 1 year ago

kjblanchard commented 1 year ago

Hello thanks for the work on this project, it's great!

Currently I'm running into an issue when viewing the webpage when we have a large number of items on the termination list (~20k+).

What happens is that when you click on the box to view the resources, it will pull up the modal and start spinning as usual. It will then hang and not finish. It seems to fire off the lambda function and that runs successfully, however it is never presented on the webpage. If you wait a while, occasionally I've gotten a error:

Aw, Snap!

Something went wrong while displaying this webpage.

Error code: Out of Memory

or in the chrome console:

index.js:198 GET https://amend.. 2022%2F07%2Faws_resources_scheduled_for_termination_2022_07_07_17_30_04.csv net::ERR_FAILED 504

We run this on multiple accounts, and most of the accounts are fine; however we have a few accounts with excessive amount of log groups that get populated (the csv file has over 18k listed for one, up to 69k).

At first I thought it was some of the lambdas memory, increasing them didn't help. Also tried increasing the ecs task, and didn't seem to help as well. Currently our workaround has been to disable the specific service with all of the resources (log groups) for the affected accounts.

Not sure yet if this is a limitation on the s3 website or elsewhere. Currently looking into it and inquiring with AWS as well for the s3 website limitations.

Thanks and take care!

image

mlevit commented 1 year ago

Hey @kjblanchard. Thanks for raising this issue. I didn't think I'd ever see an execution log of that size :) but here we are.

I've made a test execution log of 20K records and whilst it does take time to load from S3 and render in Chrome, I can see that anything larger will probably kill the browser.

Let me look into this a little bit longer to see if we can do something with the DataTable that displays the rows.

mlevit commented 1 year ago

So... played around with this today. I think I managed to make it work using DataTables native functionality. Instead of creating the table in HTML then converting that table to a DataTable, we instead pass the data to DataTable which only loads records visible to the user. This should hopefully prevent Chrome from hitting the out of memeory error it did before.

Could you do me a favour and pull the change from https://github.com/servian/aws-auto-cleanup/tree/large-execution-log-support and test it for me?

kjblanchard commented 1 year ago

Awesome! Thanks a lot for looking into it. I'll test it out shortly.

kjblanchard commented 1 year ago

Hey @mlevit , thanks so much for the commits. Sorry it took me a little bit to test as we use a much older version here. Seems like I'm not getting the out of memory anymore, But, I'm now getting a CORS issue (only on the ones with large CSVs which seems strange) I've tried a few different CORS policies on s3 but haven't found the trick just yet. Still trying out different things.

Access to fetch at '[https://redacted/execution/2022%2F07%2Faws_resources_scheduled_for_termination_2022_07_11_13_48_15.csv' from origin 'https://redacted' has been blocked by CORS policy: No 'Access-Control-Allow-Origin](https://redacted/execution/2022%2F07%2Faws_resources_scheduled_for_termination_2022_07_11_13_48_15.csv'%20from%20origin%20'https://redacted'%20has%20been%20blocked%20by%20CORS%20policy:%20No%20'Access-Control-Allow-Origin)' header is present on the requested resource. If an opaque response serves your needs, set the request's mode to 'no-cors' to fetch the resource with CORS disabled.

kjblanchard commented 1 year ago

I also noticed this error in the "default-log-execution-read-api" lambda CW logs:

[ERROR] [1657552536217] LAMBDA_RUNTIME Failed to post handler success response. Http response code: 413.

According to the following page, it seems this might be a limitation of lambdas invocation payload. My CSV file is just above 6MB. https://docs.aws.amazon.com/lambda/latest/dg/gettingstarted-limits.html

mlevit commented 1 year ago

Hmmm depending on how old your previous instance was, I would suggest you update all 3 modules, app, api, and web. You shouldn't get any CORS issues tbh.

Regarding the 6MB response limit, that is still in place. A 20K file was about 3MB for me, so anything larger than say a 35/40K would probably exceed that limit. Not much I can do about that without re-architecting the whole app. Only solution would be to not clean CloudWatch Logs as you've been doing already. Sorry about that.

mlevit commented 1 year ago

@kjblanchard I just pushed a change https://github.com/servian/aws-auto-cleanup/commit/0827dc62cdc1b15503421f68317f2fa3476ca9de that should hopefully solve this issue.

Basically, to work around Lambda's 6MB response payload limitation, I've compressed the output payload using Zlib for execution logs greater than 10K lines. We then decompress using Zlib on the frontend. I've managed to work with a file over 100K lines, at around 15MB in S3... so hopefully this should help you too.

Let me know how you go.

kjblanchard commented 1 year ago

Perfect, you rock.

I've grabbed that commit and pushed out to our accounts and I'm able to get the CSV file now when attempting to view the list on the page, thanks for adding in zlib/compression! I'm getting a Error code: RESULT_CODE_HUNG though when it is trying to display the results. Seems like my cpu gets pegged at 100% prior and memory starts to fill. I saw there was a couple of commits after 0827dc6 so I'm going to go ahead and try adding the following commits and see if it fixes it and will update here.

Update: I added in the future commits, and get similar codes. Error code: Out of Memory Not sure if it helps, but I did grab a heap dump when debugging when the memory usage starts to jump up, I can send that if needed. It seems to start to use all of the memory after the getExecutionLog() function when I was stepping through the debugger. The step after seems to step into vue and then the memory usage starts to fill up.

Again, thanks a lot for everything!! Yours was loading fine with the 15mb pages? I can spend a little more time looking at this tomorrow. Also let me know if theres anything specific I can look at to get better info.

mlevit commented 1 year ago

@kjblanchard I know this may be hard for you to do, but do you think you could share one of your execution logs with me? De-identify everything in there and if you'd like, you can email me a link directly instead of posting it here (mlevit at gmail dot com).

Your logs are different to mine and I'd love to test with yours directly.

mlevit commented 1 year ago

OK so I've pushed more log processing into the backend with this commit https://github.com/servian/aws-auto-cleanup/commit/bc6284e45b0de547fa9a6f4aff0e6b7d0d956067.

The frontend would receive the log, calculate the stats and do create a smaller log list with just the columns necessary for the table. Instead of doing that all in the frontend, I've moved that into the Lambda and returned it in the response JSON. This should hopefully 🤞 reduce the strain on the browser.

Give it a go and see how you go.

mlevit commented 1 year ago

@kjblanchard any chance you've had time to test this out?

kjblanchard commented 1 year ago

Hey, sorry about not responding here. I've added in the latest changes and still seeing the issue of "out of memory" when parsing through it. I went ahead and de-identified the csv file, but just going through "approvals" to be able to send out the csv file. Thanks for reaching out about it.

kjblanchard commented 1 year ago

@mlevit Finally was able to get one. Will send it to your email now. Thanks again for everything