Open genomesandMGEs opened 3 years ago
Thank you for reporting this! Notes for anyone who is interested in working on this issue:
--e-value
flag in anvi-pan-genome
program to eliminate the vast majority of extremely weak hits prior to the MCL step.
Short description of the problem
I created a genomes storage consisting of ~2k genomes, and when I tried to run
anvi-pan-genome
, it exceeds my disk quota on the cluster and the job fails. It creates a gigantic file 'diamond-search-results.txt.ununiqued' (~2TB) and 'diamond-search-results.txt' (~26GB). Is there a way to limit the size of the intermediate files? Here's the command I rananvi-pan-genome -g PSAE2009-GENOMES.db -n PSAE2009 -T 32 --exclude-partial-gene-calls --mcl-inflation 10 -o PSAE2009_pangenome
anvi'o version
Anvi'o .......................................: hope (v7) Profile database .............................: 35 Contigs database .............................: 20 Pan database .................................: 14 Genome data storage ..........................: 7 Auxiliary data storage .......................: 2 Structure database ...........................: 2 Metabolic modules database ...................: 2 tRNA-seq database ............................: 1
System info
Using WSL, and installed with conda.
Detailed description of the issue
After discussing with Meren on its merenlab webpage, it was proposed mmseqs for large pangenome analysis, or adjust the default parameters so Diamond does not report every single hit.