matanolabs / matano

Open source security data lake for threat hunting, detection & response, and cybersecurity analytics at petabyte scale on AWS
https://matano.dev
Apache License 2.0
1.42k stars 98 forks source link

Feature Request: Add options pertaining to snapshot expire schedule as part of config #151

Open rams3sh opened 1 year ago

rams3sh commented 1 year ago

As discussed in discord, some community members including me have been facing inconsistent timeouts and errors during the snapshot expiry process.

There seems to be some bug with Athena and a case has been raised by me with regards to it. In parallel, to overcome the issue of timeouts, I tried experimenting by changing the schedule and stepfunction timeout to 30m from 1hr and it worked well for me. Expiry averages to 16 mins of running before failing with ICEBERG_VACUUM_MORE_RUNS_NEEDED and then subsequent query sucesssfully executes with average of 30 seconds for my volume of logs with this new setting!!

It would be helpful, if the schedule and stepfunction timeout is kept as part of config so that the consumer can find the sweet spot where the expiry works as expected depending on the size of the logs they ingest. This will also help in managing the athena related issue until it gets resolved.

B161851 commented 4 months ago

@rams3sh I am also facing same issue, but I didn't get what you are trying to say. Can you please tell me how to avoid the timeout errors while running the vacuum from stepfunction. please post a code snippet how to add timeout variable

rams3sh commented 4 months ago

@B161851 There is an inherent issue with Athena because of which timeout issues are happening.

As a workaround, I manually updated the event bridge time scheduled to run the VACUUM command as there exists no parameter in matano config to do it from CLI. Decreasing the time of VACUUM ensures that cleanup data does not get accumulated faster. Please note, this is only a temporary fix. Permanent fix can only be provided by Athena team from AWS.

As and when you add more data sources, you may start witnessing the timeout even within the small duration.