merenlab / anvio

An analysis and visualization platform for 'omics data
http://merenlab.org/software/anvio
GNU General Public License v3.0
444 stars 145 forks source link

[BUG] Gigantic files created by diamond search #1713

Open genomesandMGEs opened 3 years ago

genomesandMGEs commented 3 years ago

Short description of the problem

I created a genomes storage consisting of ~2k genomes, and when I tried to run anvi-pan-genome, it exceeds my disk quota on the cluster and the job fails. It creates a gigantic file 'diamond-search-results.txt.ununiqued' (~2TB) and 'diamond-search-results.txt' (~26GB). Is there a way to limit the size of the intermediate files? Here's the command I ran

anvi-pan-genome -g PSAE2009-GENOMES.db -n PSAE2009 -T 32 --exclude-partial-gene-calls --mcl-inflation 10 -o PSAE2009_pangenome

anvi'o version

Anvi'o .......................................: hope (v7) Profile database .............................: 35 Contigs database .............................: 20 Pan database .................................: 14 Genome data storage ..........................: 7 Auxiliary data storage .......................: 2 Structure database ...........................: 2 Metabolic modules database ...................: 2 tRNA-seq database ............................: 1

System info

Using WSL, and installed with conda.

Detailed description of the issue

After discussing with Meren on its merenlab webpage, it was proposed mmseqs for large pangenome analysis, or adjust the default parameters so Diamond does not report every single hit.

meren commented 3 years ago

Thank you for reporting this! Notes for anyone who is interested in working on this issue: