soedinglab / MMseqs2

MMseqs2: ultra fast and sensitive search and clustering suite
https://mmseqs.com
GNU General Public License v3.0
1.37k stars 192 forks source link

How to remove duplicated sequences #270

Open SilasK opened 4 years ago

SilasK commented 4 years ago

Expected Behavior

I'm using gene predictions form Refseq. They unified the gene names, so that the same gene in different genomes has the same name, e.g. WP_012419350.1

Now I can pass this perfectly trough linclust, createtsv and result2repseq . The two identical proteins get clustered to the same cluster. And some end up as cluster representative. Which then put them multiple times in the result2flat output.

I wondered if mmseqs uses the sequence names, what if the sequences are different but the names are the same?

milot-mirdita commented 4 years ago

We give each sequence an internal identifier and cluster based on these. The accession coming from FASTA headers is only printed out when formatting plain text results (i.e. with convertalis or createtsv). So it doesn't affect the clustering, but makes downstream processing more difficult.

I would recommend to add some suffix to each accession in the input FASTA with e.g. awk:

awk '/^>/ { cnt++; $1=$1"_"cnt } { print; }' input.fasta > input_suffix.fasta
mmseqs easy-search input_suffix.fasta targetDB result.m8 tmp

Or with current git MMseqs2 you can pipe the awk input directly to MMseqs2:

awk '/^>/ { cnt++; $1=$1"_"cnt } { print; }' input.fasta | mmseqs easy-search stdin targetDB result.m8 tmp