After getting the monitoring solution to work in our infrastructure with a few event log file types, we added the full list of event log files we’re interested in. After doing so, the event log file monitoring process started crashing with OOMKilled (meaning that it exceeded its allocated memory on kubernetes). At this point, the service was running with a maximum memory allocation of 2GB of RAM which got consumed ~1 minute after application startup.
After increasing the RAM allocation to 3Gb I was not able to see any more OOMKilled, however, the pod would crash with no exception message (corresponding kubernetes event: Back-off restarting failed container).
Desired Behavior
Description from customer
The CSV parsing needs to have more memory awareness baked into it. For example, once the log object reaches X % of available memory, then we should call post_logsfunction to POST logs to NewRelic instead of waiting for the whole CSV file to be parsed.
Additionally, the log array needs to be cleared in between iterations to release memory.
Possible solution:
The solution doesn’t take into account the reality of how huge salesforce event log files can be. As highlighted by Scott as well, it seems like we’re ready entire CSVs into memory potentially contributing to the OOM issues
It also seems like download_response is downloading the whole event log file CSV directly into memory, before we even call parse_csv. The parse_csv output therefore additionally allocates even more memory for the very same event log file when its output is stored in csv_rows. If I get that right, it means we’re double allocating memory for the same event log file processing. Lastly, we seem to be executing download_response and parse_csv for all event log files supplied by user prior to calling NewRelic.post_logs. I believe streaming the downloaded content and processing it in a pipeline end to end as proposed by Scott can help ease that.
Summary
Description from customer
Desired Behavior
Description from customer