soedinglab / MMseqs2

MMseqs2: ultra fast and sensitive search and clustering suite
https://mmseqs.com
MIT License
1.46k stars 198 forks source link

a wrapper for MMseq2 in qiime2 #235

Open splaisan opened 5 years ago

splaisan commented 5 years ago

This more a feature request!

would someone have time and competence to create a python module similar to the one for vsearch (https://github.com/qiime2/q2-feature-classifier/tree/master/q2_feature_classifier) so that we can classify with multithreading in qiime2?

blast or vsearch runs typically take over 1 day and more for 50k long ONT reads which is really very long and I am dreaming of the speedup seen in the mmseqs2 paper

my current qiime2 execution looks like this in top but I have little knowledge of what it should translate to with mmseqs2; if I could have an equivalent, I may try to hack the vsearch wrapper code but my python skills are not that great.

qiime feature-classifier classify-consensus-vsearch --i-query rep-seqs.qza --i-reference-reads /data/biodata/MetONTIIME_DB/rrnDB_operons_sequence.qza --i-reference-taxonomy /data/biodata/MetONTIIME_DB/rrnDB_operons_taxonomy.qza --p-perc-identity 0.77 --p-query-cov 0.8 --p-maxaccepts 1 --p-strand both --p-min-consensus 0.51 --p-unassignable-label Unassigned --p-threads 24 --o-classification taxonomy.qza

Thanks for any help on this

PS:I do not dare to double post on the qiime2 page as this is often seen as offending by developers.

colinbrislawn commented 4 years ago

How well does MMseqs2 work on 50k long ONT reads?

If this is not a use case for MMseqs2, any other suggestions?

martin-steinegger commented 4 years ago

@colinbrislawn I have tested linclust with ONT reads. It should be possible to cluster them. However, we needed to tweak the parameters used for the banded alignment to account for the high error rate.

How do you want to use MMseqs2?

colinbrislawn commented 4 years ago

@colinbrislawn I have tested linclust with ONT reads. It should be possible to cluster them. However, we needed to tweak the parameters used for the banded alignment to account for the high error rate.

Awesome!

How do you want to use MMseqs2?

Existing Qiime 2 plugins offer several options for clustering and classifying short RNA sequences... but no plugins support clustering or classifying long, noisy sequences, or proteins of any kind.

I think an MMseqs2 plugin could bring a ton of functionality to Qiime 2. A method for taxonomic classification of ONT reads would help @splaisan and others.

milot-mirdita commented 4 years ago

We would be happy to assist members of the Qiime community with integrating MMseqs2. We felt it was a bit too much for us to tackle alone.

colinbrislawn commented 4 years ago

Sounds like a plan!

Building a plugin is a pretty big lift as it requires close integration with Qiime 2 semantic types. But at least the docs are good!

I don't think I'm the right person to lead development, but I would be happy to contribute methods to the plugin.