matanolabs / matano

Open source security data lake for threat hunting, detection & response, and cybersecurity analytics at petabyte scale on AWS
https://matano.dev
Apache License 2.0
1.46k stars 99 forks source link

Make transformer able to handle larger files by streaming #156

Closed Samrose-Ahmed closed 1 year ago

Samrose-Ahmed commented 1 year ago

Can handle larger files by buffering and flushing to S3.

Memory stays low and can handle large files by processing longer.

Timeouts are still an issue, a more optimal approach is to pre split work (even virtually with byte offsets) and have predictable data size. That's a more involved change that will likely not be worked on right now, but this code is still useful.

Related: #155