nextflow-io / nf-prov

Apache License 2.0
23 stars 11 forks source link

Render large JSON files in a memory-efficient way #7

Open bentsherman opened 11 months ago

bentsherman commented 11 months ago

The usual pattern to render a JSON file is to create the equivalent data structure in Groovy code, render it to a JSON string, and write the entire string to a file. For large runs with thousands of tasks, the JSON string could be quite large and cause Nextflow to run out of memory.

First we need to evaluate whether this is actually a real problem. Do some large runs and see how large the resulting prov reports are. If they get into the 100 MB - 1 GB range, then we should probably optimize the rendering code.

The memory-efficient approach is to write the JSON output directly and save it to the file in pieces, so that we never have to allocate the entire report in memory and the memory usage does not increase with the number of tasks / outputs.

bentsherman commented 11 months ago

rnaseq-nf BCO is ~7 KB

nf-core/rnaseq (-profile test) BCO is ~423 KB