taxonomy.sh: Could not find accession

martin-steinegger / plass-analysis

Benchmark for PLASS paper

7 stars 2 forks source link

taxonomy.sh: Could not find accession #1

Closed nick-youngblut closed 5 years ago

nick-youngblut commented 5 years ago

When running the taxonomy.sh (after making changes to get it fully running), I found that the mmseqs convertkb generates millions "Could not find accession" warnings. Should this be expected? I'm guessing that there is not a full overlap between the KB and DB.lookup files, but the millions of warnings is troubling.

martin-steinegger commented 5 years ago

What kind of MMseqs2 version do you use? @milot-mirdita do you know what version you used for the taxonomy workflow?

My recommendation would be to use the workflow explained here: https://github.com/soedinglab/mmseqs2/wiki#taxonomy-assignment-using-mmseqs-taxonomy

mmseqs createtaxdb "${DB}" tmp
mmseqs taxonomy "${SEQDB}" "${DB}" "${TMPOUT}/taxa_db" "${TMPOUT}/tmp_lca"
--start-sens 1 -s 6 --sens-steps 3 --lca-ranks "phylum:superphylum:subkingdom:kingdom:superkingdom"

nick-youngblut commented 5 years ago

I'm using mmseqs2 7.4e23d h21aa3a5_1 bioconda. Thanks for the suggestion! Your suggested method is much simpler that all of the steps used in the taxonomy.sh file. So you think the removeStopCodon and other steps that are in the taxonomy.sh workflow are not necessary?

Sorry to bug you about this, but I'm just trying to determine what is the best way to get a taxonomy for my plass-assembled sequences.

martin-steinegger commented 5 years ago

You are not bugging. Thanks for trying Plass and MMseqs2! I would remove the stop codons if you want to map back the reads to the assemblies. We consider alignment coverage for mapping. The stop coding decrease the mapping rate of reads since the '*' can not be aligned.

Be aware that this taxonomy search can take quite long. You can speed it up by decreasing the sensitivity to -s 3.

nick-youngblut commented 5 years ago

plass and mmseqs are great! I'm really liking how well they scale for large datasets. Thanks for the clarification on the codon removal! I didn't see that in the mmseqs docs, but I probably just missed it.

Do you recommend using uniclust90_2017_10 for the taxonomy (as in the taxonomy.sh workflow) or maybe a different db is optimal for general taxonomic classification of plass-assembled sequences from metagenome samples (sorry if that's in the docs and I missed it too)?

martin-steinegger commented 5 years ago

The MMseqs2 documentation is behind the development. We need to change this.

I think the uniclust90 is a great compromise between speed and accuracy. The Uniprot has just too much redundancies.