internetstandards / Internet.nl

Internet standards compliance test suite
https://internet.nl
164 stars 36 forks source link

Refactor batch generation logic to allow large reports to be generated #1444

Open aequitas opened 1 week ago

aequitas commented 1 week ago

Related to https://github.com/internetstandards/Internet.nl/issues/1395

During batch result report generation the result is stored in a variable before being written to a file: https://github.com/internetstandards/Internet.nl/blob/9e4d2502d029fba96cf5d82f4409bb944d60df19/interface/batch/util.py#L270-L274

For batch requests with 5000 domains this results in a memory usage of 1GB, for 10k domain almost 2GB, etc. Requiring the worker performing this task to have this much memory available for this short time it takes to generate the reports. Furthermore this memory is retained by the worker until the next report generation is run where the memory will be reused but not freed.

Suggest to refactor the generation logic to write the report file to disk in a streaming fashon in gather_batch_results to eliminate the dom_results (https://github.com/internetstandards/Internet.nl/blob/9e4d2502d029fba96cf5d82f4409bb944d60df19/interface/batch/util.py#L291) variable which contains the bulk of the memory used.

Because the dom_results (domains field in the report https://github.com/internetstandards/Internet.nl/blob/9e4d2502d029fba96cf5d82f4409bb944d60df19/interface/batch/util.py#L340) is a dictionary/object, existing JSON encoders might not be able to handle this in a streaming manner. But because the report structure is simple enough it might be best to just write a custom encoder or write the JSON directory without any encoder/library.