For batch requests with 5000 domains this results in a memory usage of 1GB, for 10k domain almost 2GB, etc. Requiring the worker performing this task to have this much memory available for this short time it takes to generate the reports. Furthermore this memory is retained by the worker until the next report generation is run where the memory will be reused but not freed.
Related to https://github.com/internetstandards/Internet.nl/issues/1395
During batch result report generation the result is stored in a variable before being written to a file: https://github.com/internetstandards/Internet.nl/blob/9e4d2502d029fba96cf5d82f4409bb944d60df19/interface/batch/util.py#L270-L274
For batch requests with 5000 domains this results in a memory usage of 1GB, for 10k domain almost 2GB, etc. Requiring the worker performing this task to have this much memory available for this short time it takes to generate the reports. Furthermore this memory is retained by the worker until the next report generation is run where the memory will be reused but not freed.
Suggest to refactor the generation logic to write the report file to disk in a streaming fashon in
gather_batch_results
to eliminate thedom_results
(https://github.com/internetstandards/Internet.nl/blob/9e4d2502d029fba96cf5d82f4409bb944d60df19/interface/batch/util.py#L291) variable which contains the bulk of the memory used.Because the
dom_results
(domains
field in the report https://github.com/internetstandards/Internet.nl/blob/9e4d2502d029fba96cf5d82f4409bb944d60df19/interface/batch/util.py#L340) is a dictionary/object, existing JSON encoders might not be able to handle this in a streaming manner. But because the report structure is simple enough it might be best to just write a custom encoder or write the JSON directory without any encoder/library.