Closed nick-youngblut closed 5 years ago
What kind of MMseqs2 version do you use? @milot-mirdita do you know what version you used for the taxonomy workflow?
My recommendation would be to use the workflow explained here: https://github.com/soedinglab/mmseqs2/wiki#taxonomy-assignment-using-mmseqs-taxonomy
mmseqs createtaxdb "${DB}" tmp
mmseqs taxonomy "${SEQDB}" "${DB}" "${TMPOUT}/taxa_db" "${TMPOUT}/tmp_lca"
--start-sens 1 -s 6 --sens-steps 3 --lca-ranks "phylum:superphylum:subkingdom:kingdom:superkingdom"
I'm using mmseqs2 7.4e23d h21aa3a5_1 bioconda
. Thanks for the suggestion! Your suggested method is much simpler that all of the steps used in the taxonomy.sh file. So you think the removeStopCodon
and other steps that are in the taxonomy.sh workflow are not necessary?
Sorry to bug you about this, but I'm just trying to determine what is the best way to get a taxonomy for my plass
-assembled sequences.
You are not bugging. Thanks for trying Plass and MMseqs2! I would remove the stop codons if you want to map back the reads to the assemblies. We consider alignment coverage for mapping. The stop coding decrease the mapping rate of reads since the '*' can not be aligned.
Be aware that this taxonomy search can take quite long. You can speed it up by decreasing the sensitivity to -s 3.
plass and mmseqs are great! I'm really liking how well they scale for large datasets. Thanks for the clarification on the codon removal! I didn't see that in the mmseqs docs, but I probably just missed it.
Do you recommend using uniclust90_2017_10
for the taxonomy (as in the taxonomy.sh
workflow) or maybe a different db is optimal for general taxonomic classification of plass-assembled sequences from metagenome samples (sorry if that's in the docs and I missed it too)?
The MMseqs2 documentation is behind the development. We need to change this.
I think the uniclust90 is a great compromise between speed and accuracy. The Uniprot has just too much redundancies.
When running the taxonomy.sh (after making changes to get it fully running), I found that the
mmseqs convertkb
generates millions "Could not find accession" warnings. Should this be expected? I'm guessing that there is not a full overlap between the KB and DB.lookup files, but the millions of warnings is troubling.