Open gbouras13 opened 2 years ago
Hi George,
I think the issue is that the internal MMseqs2 profile format has changed, the phrogs database needs to be rebuilt with the newest MMseqs2 version. I think we can download the MSA and convert them directly to a profile database within the databases
module.
How/where do you download the phrogs in phrokka?
Essentially something like this:
aria2c -x 16 https://phrogs.lmge.uca.fr/downloads_from_website/MSA_phrogs.tar.gz
mmseqs tar2db MSA_phrogs.tar.gz phrogs_msa --output-dbtype 11 --tar-include '.+\.fma$'
mmseqs msa2profile phrogs_msa phrogs_prof
Also, @ClovisG it would be great if you guys could take a look at updating the phrogs profiles. Or better yet, only offer MSAs and instructions how to build profiles, in-case the format changes again in the future.
Hi Milot,
Thanks for the rapid reply!
I download the mmseqs2 formatted database from this link https://phrogs.lmge.uca.fr/downloads_from_website/phrogs_mmseqs_db.tar.gz found on this site https://phrogs.lmge.uca.fr/READMORE.php
The MSAs and other formats are available at the bottom of this link https://phrogs.lmge.uca.fr (specifically https://phrogs.lmge.uca.fr/downloads_from_website/MSA_phrogs.tar.gz)
I'll absolutely give this a crack and let you know how I go.
George
Hi,
I have written a tool called pharokka (https://github.com/gbouras13/pharokka) that annotates bacteriophage genomes. Pharokka uses mmseqs2 to match predicted CDS to the PHROGs (https://phrogs.lmge.uca.fr), CARD and VFDB databases using mmseqs2 which follows the method at https://phrogs.lmge.uca.fr/READMORE.php. Mmseqs2 is amazingly fast especially for large input metaviromes, so it's a brilliant choice for this clustering - thanks for developing it!
I am coming across a problem with the new version released 10 days ago.
With v 13.4511, the 3 mmseqs2 searches take approximately 5-10 minutes in total depending on input, architecture and threads used. However, since mmseqs2 v14-7e284 has been released, users of pharokka have reported that the mmseqs2 step hangs indefinitely (at least 20+ hours) when pharokka with mmseqs2 v14-7e284 is installed with bioconda. I have replicated the issue on my machines also.
The relevant lines in pharokka are 358-369:
https://github.com/gbouras13/pharokka/blob/3b8f7ae207b367366765f482c9dce1dd2cccee80/bin/processes.py#L358
create target db
"mmseqs", "createdb", os.path.join(out_dir, amino_acid_fasta), os.path.join(target_db_dir, "target_seqs")
search
"mmseqs", "search", "-e", evalue ,os.path.join(phrog_db_dir, "phrogs_profile_db"), os.path.join(target_db_dir, "target_seqs"), os.path.join(mmseqs_dir, "results_mmseqs"), tmp_dir, "-s", "8.5", "--threads", threads
tsp output
"mmseqs", "createtsv", os.path.join(phrog_db_dir, "phrogs_profile_db"), os.path.join(target_db_dir, "target_seqs"), os.path.join(mmseqs_dir, "results_mmseqs"), os.path.join(out_dir,"mmseqs_results.tsv"), "--full-header", "--threads", threads
Lines 458-69 and 496-507 do the same method for the CARD and VFDB databases.
Expected Behavior
Command should take 5-10 minutes to run.
Current Behavior
Command hangs indefinitely. As you can see in the log file for v14-7e284 the prefilter step took 36 minutes, then the prefiltering scores calculation hung for 20 hours until the program was killed.
I have attached 2 log files - one for each version of mmseqs2. The log files include all the mmseqs2 output written to stdout.
Steps to Reproduce (for bugs)
conda create -n pharokkaenv pharokka mmseqs2==14.7e284 conda activate pharokkaenv install_databases.py -d pharokka.py -i lambda.fasta -o lambda_out -t 8
Input file attached
MMseqs Output (for bugs)
Log files attached with "correct" output (13.4511) showing mmseqs2 run 3 times takes approximately 6 mins, vs 14.7e284 which takes 36 minutes to prefilter on the first step, then hangs (for 20 hours).
Context
Your Environment
I have tested this on MAC OSX (intel and M1) and also Linux ubuntu environments with bioconda installations. I get the same issue.
lambda.fasta.txt pharokka_mmseqs2_13.45111.log pharokka_mmseqs2_14_7e284.log
George