[BUG] Memory issues with `anvi-estimate-metabolism` on large metagenomes in `--metagenome-mode`

ivagljiva commented 7 months ago

Short description of the problem

This bug report is based on a recent question on Discord (see thread here) regarding the memory issues that people have run into with using anvi-estimate-metabolism on large metagenome samples with metagenome mode (ie, contig-level estimates). The questions particularly highlighted issues when running in 'multi-mode' (ie, the -M parameter), but the memory issue could still apply to individual samples.

Besides the inconvenience of the program crashing, this memory issue could also result in incomplete results in the output (which is generated for one contig at a time, so if the crash happens before all contigs have been processed, then the estimates for the rest of the contigs will be missing from the output file).

anvi'o version

Reported for v8, but likely also applies to v7.

Detailed description of the issue

anvi-estimate-metabolism keeps a couple of big dictionaries in memory to store modules data (and KO annotation data) as it's estimating on a given input. When it runs in 'metagenome-mode' (ie, with the --metagenome-mode parameter), these dictionaries are even larger because we store information for each contig separately. With very large input metagenomes (ie, large number of contigs), the memory usage can get so big that the program crashes or is killed by the kernel for using too many resources.

The high memory usage is coming from the function estimate_for_contigs_db_for_metagenome(), in which we do this for every contig:

metagenome_metabolism_superdict[contig] = metabolism_dict_for_contig
metagenome_ko_superdict[contig] = ko_dict_for_contig

These two dictionaries are later returned to the driver function estimate_metabolism(), which in turn decides whether or not to return those dictionaries to any function that called it. That return only happens if the programmer asks for it, or if we are running in 'multi-mode' with the --matrix-format flag (in which case a subset of the data is returned for use in making the matrix once all inputs have been individually processed).

In the issue description from the Discord thread, metagenome mode was used with the -M parameter to run on multiple samples at a time, but without the --matrix-format flag. The memory crash happened while working on the first sample, but because we suppress the progress output on individual contigs when running on multiple samples, we don't know which contig it crashed at (unless --debug is used). This collectively means that a) the code never made it to the point where the data from the initial sample was removed to make room from the next sample and b) we don't know if all contigs were even finished processing, meaning that the output obtained so far for the initial sample could be incomplete.

Current workarounds

I've advised people experiencing memory issues to try a combination of the following: 1) try to avoid using --metagenome-mode if you don't really need it, since keeping data for each individual contig around is especially resource intensive 2) run on metagenome samples one at a time rather than using -M with a metagenomes file. If those jobs are able to finish (a guarantee that their results are complete) then it is possible to combine the individual results together later since each sample/contig is independent 3) if you can run it on an HPC, do that, and try to give each job more memory than allocated by default (if possible)

Possible solutions

I think it would help to include another return_superdicts flag variable as an argument to the estimate_for_contigs_db_for_metagenome() function. If we don't really need to return those two dictionaries (ie, long format output), we set that flag to False and we don't even bother to keep any of that data in memory. If we do need to return it (ie, matrix format, or JSON output), well, then there is really nothing we can do.

I also want to update the help output to explicitly indicate that --metagenome-mode will perform contig-level estimation so that people don't accidentally choose that option just because they happen to be working with metagenomes. AND, it would probably be a good idea to add a warning about --metagenome-mode memory issues at the start of the program (perhaps referencing this issue).

ivagljiva commented 7 months ago

A test (pre-fix)

I want to reproduce this issue on a large metagenome so that I can use that to test my fix. I happen to have access to the public ocean metagenomes from this paper, which I have used previously for testing anvi-estimate-metabolism, so I picked the largest contigs database from the set (sample N27, which has 103,978 contigs) and ran the following command via SLURM on an HPC with 20 Gb of memory allocated to the job:

anvi-estimate-metabolism -c $N27_DB_PATH --metagenome-mode --output-modes modules --debug

And, as expected, slurm gave me this error after processing 4,392 of the contigs (it ran for about 30 min before being killed).

error: Detected 1 oom_kill event in StepId=35598057.batch. Some of the step tasks have been OOM Killed.

In theory, my proposed solution would avoid this error, because we would clear the contig-specific data from memory after processing a given contig. I will open a branch, implement the solution, and test again with the same command to see if it works.

ivagljiva commented 7 months ago

The fix worked

I implemented the proposed solution, and ran the same test from above on sample N27, and it worked. It took 6.5 hours for the job to complete, but it worked 😇

✓ anvi-estimate-metabolism took 6:28:52.487417

I will update the help output for metagenome mode and add a warning about high memory usage when running in metagenome mode with --matrix-format or JSON output, and then I will make a PR.

ivagljiva commented 7 months ago

Addressed with https://github.com/merenlab/anvio/pull/2229 :)

meren commented 7 months ago

Thank you for this, Iva!

merenlab / anvio