matanolabs / matano

Open source security data lake for threat hunting, detection & response, and cybersecurity analytics at petabyte scale on AWS
https://matano.dev
Apache License 2.0
1.46k stars 99 forks source link

large file sizes causing OOMKills and timeouts #155

Open timcosta opened 1 year ago

timcosta commented 1 year ago

hi all! i'm investigating using matano for some log ingestion, and some of the ALB log files i'm looking at are extremely large - 100MB compressed, multiple GB decompressed. we're running into resource exhaustion issues for memory usage, even after manually adjusting limits in the console to the maximum of 10240MB of memory. this happens in multiple lambdas, most notably the transform and writer.

the specific issues we're seeing in the writer are basically that it logs INFO lake_writer: Starting 25 downloads from S3 and then 20s later it's killed by lambda for exceeding 10240 MB of memory used. can this 25 number be tuned or tweaked to take into account size?

the transformer and databatcher issues we were able to resolve by increasing the timeout and memory, which should be covered by https://github.com/matanolabs/matano/issues/85 when it's included. i may be able to contribute this depending on how our discovery goes, but not sure how long it would be until that could happen.

from the investigation i've done into this problem for a custom processing solution, that "best" resolutions appear to be either loading the data and processing it as a stream rather than loading it all into memory at once, or have some sort of pre-processor that splits large files into smaller chunks before they get to the loader.

do y'all have any thoughts on the best path forward here, or if matano would ever consider handling situations like this where the inputs/batches cannot be processed due to size?

Samrose-Ahmed commented 1 year ago

Hi thanks for the issue.

Generally we don't recommend interesting such large files but this should be possible, the lake writer logic just needs to be modified to be a modified to be a bit more intelligent and file size aware. Optimal parquet sizes are 100-500MB so it shouldn't need to bring more than that in memory at a time.

I'd also like to see why it's ending up with that much data in lake writer and not flushing earlier, let me do some testing and update.

timcosta commented 1 year ago

awesome, thanks! this is the managed AWS_ELB managed ingestion pipeline using files written directly by the ALB. what would you recommend in a situation like this? a pre-processor we write that splits these files into smaller ones before putting them into a bucket matano ingests from?

Samrose-Ahmed commented 1 year ago

Splitting would work but we would probably want to support it out of the box in this case.

I will take a closer look at the code, you can watch this issue.

Samrose-Ahmed commented 1 year ago

Were you able to test this out? I tested 2GB uncompressed ALB logs in the linked PR.