soedinglab / MMseqs2

MMseqs2: ultra fast and sensitive search and clustering suite
https://mmseqs.com
MIT License
1.47k stars 200 forks source link

How do you run linclust on nucleotide sequences? #715

Closed fluhus closed 1 year ago

fluhus commented 1 year ago

Hi, thanks for making this toolkit! I'm excited to start using it with my data.

I have a set of viral genomes that I would like to cluster. From the wiki and the paper, I understand that linclust by default runs a process that's optimized for protein sequences (using blosum64, kmer length..). Can it run on nucleotide sequences? What would be the way to go about it?

milot-mirdita commented 1 year ago

It should just work, I don't think you need to change parameters. Just call easy-linclust or easy-cluster on your nucleotide input.

MMseqs2 does have issues with long sequences and internally splits them, but for viruses it should work pretty well.

fluhus commented 1 year ago

Thank you!

I ran it and noticed that it identified automatically that the dataset was nucleotides, so all good.