Open SilasK opened 4 years ago
We give each sequence an internal identifier and cluster based on these. The accession coming from FASTA headers is only printed out when formatting plain text results (i.e. with convertalis
or createtsv
). So it doesn't affect the clustering, but makes downstream processing more difficult.
I would recommend to add some suffix to each accession in the input FASTA with e.g. awk
:
awk '/^>/ { cnt++; $1=$1"_"cnt } { print; }' input.fasta > input_suffix.fasta
mmseqs easy-search input_suffix.fasta targetDB result.m8 tmp
Or with current git MMseqs2 you can pipe the awk
input directly to MMseqs2:
awk '/^>/ { cnt++; $1=$1"_"cnt } { print; }' input.fasta | mmseqs easy-search stdin targetDB result.m8 tmp
Expected Behavior
I'm using gene predictions form Refseq. They unified the gene names, so that the same gene in different genomes has the same name, e.g.
WP_012419350.1
Now I can pass this perfectly trough linclust, createtsv and result2repseq . The two identical proteins get clustered to the same cluster. And some end up as cluster representative. Which then put them multiple times in the
result2flat
output.I wondered if mmseqs uses the sequence names, what if the sequences are different but the names are the same?