soedinglab / MMseqs2

MMseqs2: ultra fast and sensitive search and clustering suite
https://mmseqs.com
GNU General Public License v3.0
1.4k stars 194 forks source link

mmseqs v 14-7e284 bioconda hangs indefinitely on search #625

Open gbouras13 opened 2 years ago

gbouras13 commented 2 years ago

Hi,

I have written a tool called pharokka (https://github.com/gbouras13/pharokka) that annotates bacteriophage genomes. Pharokka uses mmseqs2 to match predicted CDS to the PHROGs (https://phrogs.lmge.uca.fr), CARD and VFDB databases using mmseqs2 which follows the method at https://phrogs.lmge.uca.fr/READMORE.php. Mmseqs2 is amazingly fast especially for large input metaviromes, so it's a brilliant choice for this clustering - thanks for developing it!

I am coming across a problem with the new version released 10 days ago.

With v 13.4511, the 3 mmseqs2 searches take approximately 5-10 minutes in total depending on input, architecture and threads used. However, since mmseqs2 v14-7e284 has been released, users of pharokka have reported that the mmseqs2 step hangs indefinitely (at least 20+ hours) when pharokka with mmseqs2 v14-7e284 is installed with bioconda. I have replicated the issue on my machines also.

The relevant lines in pharokka are 358-369:

https://github.com/gbouras13/pharokka/blob/3b8f7ae207b367366765f482c9dce1dd2cccee80/bin/processes.py#L358

create target db

"mmseqs", "createdb", os.path.join(out_dir, amino_acid_fasta), os.path.join(target_db_dir, "target_seqs")

search

"mmseqs", "search", "-e", evalue ,os.path.join(phrog_db_dir, "phrogs_profile_db"), os.path.join(target_db_dir, "target_seqs"), os.path.join(mmseqs_dir, "results_mmseqs"), tmp_dir, "-s", "8.5", "--threads", threads

tsp output

"mmseqs", "createtsv", os.path.join(phrog_db_dir, "phrogs_profile_db"), os.path.join(target_db_dir, "target_seqs"), os.path.join(mmseqs_dir, "results_mmseqs"), os.path.join(out_dir,"mmseqs_results.tsv"), "--full-header", "--threads", threads

Lines 458-69 and 496-507 do the same method for the CARD and VFDB databases.

Expected Behavior

Command should take 5-10 minutes to run.

Current Behavior

Command hangs indefinitely. As you can see in the log file for v14-7e284 the prefilter step took 36 minutes, then the prefiltering scores calculation hung for 20 hours until the program was killed.

I have attached 2 log files - one for each version of mmseqs2. The log files include all the mmseqs2 output written to stdout.

Steps to Reproduce (for bugs)

conda create -n pharokkaenv pharokka mmseqs2==14.7e284 conda activate pharokkaenv install_databases.py -d pharokka.py -i lambda.fasta -o lambda_out -t 8

Input file attached

MMseqs Output (for bugs)

Log files attached with "correct" output (13.4511) showing mmseqs2 run 3 times takes approximately 6 mins, vs 14.7e284 which takes 36 minutes to prefilter on the first step, then hangs (for 20 hours).

Context

Your Environment

I have tested this on MAC OSX (intel and M1) and also Linux ubuntu environments with bioconda installations. I get the same issue.

lambda.fasta.txt pharokka_mmseqs2_13.45111.log pharokka_mmseqs2_14_7e284.log

George

milot-mirdita commented 2 years ago

Hi George,

I think the issue is that the internal MMseqs2 profile format has changed, the phrogs database needs to be rebuilt with the newest MMseqs2 version. I think we can download the MSA and convert them directly to a profile database within the databases module.

How/where do you download the phrogs in phrokka?

milot-mirdita commented 2 years ago

Essentially something like this:

aria2c -x 16 https://phrogs.lmge.uca.fr/downloads_from_website/MSA_phrogs.tar.gz
mmseqs tar2db MSA_phrogs.tar.gz phrogs_msa --output-dbtype 11 --tar-include '.+\.fma$'
mmseqs msa2profile phrogs_msa phrogs_prof

Also, @ClovisG it would be great if you guys could take a look at updating the phrogs profiles. Or better yet, only offer MSAs and instructions how to build profiles, in-case the format changes again in the future.

gbouras13 commented 2 years ago

Hi Milot,

Thanks for the rapid reply!

I download the mmseqs2 formatted database from this link https://phrogs.lmge.uca.fr/downloads_from_website/phrogs_mmseqs_db.tar.gz found on this site https://phrogs.lmge.uca.fr/READMORE.php

The MSAs and other formats are available at the bottom of this link https://phrogs.lmge.uca.fr (specifically https://phrogs.lmge.uca.fr/downloads_from_website/MSA_phrogs.tar.gz)

I'll absolutely give this a crack and let you know how I go.

George