soedinglab / MMseqs2

MMseqs2: ultra fast and sensitive search and clustering suite
https://mmseqs.com
GNU General Public License v3.0
1.36k stars 190 forks source link

alignall #378

Open acpguedes opened 3 years ago

acpguedes commented 3 years ago

Hi,

This is more of a question than an issue.

I was playing around with mmseqs alignall intending to see how it works. I think maybe it is in development.

I took a cluster I had and run alignall followed by createtsv and result2flat.

# creating links
mmseqs lndb /home/acpguedes/projects/sig_trans/work/SBP_5/20190730/test2/sbps.DB sbp.DB  
mmseqs lndb /home/acpguedes/projects/sig_trans/work/SBP_5/20190730/test2/sbps.CLU sbp.CLU  
# run all-vs-all withing clusters
mmseqs alignall sbp.DB sbp.CLU sbp.MSA_ALL --threads 30 -c 0.8 -a 1
# extracting results
mmseqs createtsv sbp.DB sbp.DB sbp.MSA_ALL sbp.MSA_ALL.tsv
mmseqs result2flat sbp.DB sbp.DB sbp.MSA_ALL sbp_MSA_ALL.flat

1- I didn't found how to extract the alignments in fasta format 2- the TSV has many different alignments with the same pairs 3- the FLAT has a fasta header with the representative sequence followed by the same type of table I got in TSV

It seems that it doesn't run a global alignment as Needleman-Wunsch. There is a way to do this all-vs-all with the Needleman-Wunsch algorithm? Also, there is a way to extract the sequences? I know I may process it using the CIGAR format reported in the TSV. And the last question, what is the 3rd column in the tsv?

"query" "target" "?" "score" "identity" "evalue" "qstart" "qend" "qlen" "tstart" "tend" "tlen" "cigaraln"

WP_013559137.1/75-347   WP_013559137.1/75-347   142684  552 1.00    3.074E-178  0   272 273 0   272 273 273M
WP_100884676.1/127-401  WP_100884676.1/127-401  176224  556 1.00    1.444E-179  0   274 275 0   274 275 275M
WP_100884676.1/127-401  WP_100884676.1/127-401  25526   450 0.816   6.976E-143  0   260 275 0   260 261 261M
WP_100884676.1/127-401  WP_100884676.1/127-401  29929   445 0.816   2.233E-141  0   260 275 0   260 261 261M
WP_100884676.1/127-401  WP_100884676.1/127-401  139077  440 0.804   1.839E-139  1   261 275 0   260 261 261M
WP_100884676.1/127-401  WP_100884676.1/127-401  102797  439 0.800   3.454E-139  1   260 275 0   259 260 260M
WP_100884676.1/127-401  WP_100884676.1/127-401  76310   214 0.405   1.851E-61   15  273 275 6   259 260 159M5I95M
WP_100884676.1/127-401  WP_100884676.1/127-401  117250  209 0.412   1.487E-59   12  273 275 3   259 260 162M5I95M
WP_100884676.1/127-401  WP_100884676.1/127-401  131144  212 0.400   1.213E-60   9   268 275 4   258 260 165M5I90M
WP_100884676.1/127-401  WP_100884676.1/127-401  9506    176 0.335   2.798E-48   10  274 275 7   266 267 164M5I96M

Finally, very interesting new useful applications, thanks. Greets

acpguedes commented 3 years ago

Actually, the 3rd column seems the index of the second column in the DB

>WP_013559137.1/75-347
142684  142684  552     1.00    3.074E-178      0       272     273     0       272     273     273M
>WP_100884676.1/127-401
176224  176224  556     1.00    1.444E-179      0       274     275     0       274     275     275M
176224  25526   450     0.816   6.976E-143      0       260     275     0       260     261     261M
176224  29929   445     0.816   2.233E-141      0       260     275     0       260     261     261M
176224  139077  440     0.804   1.839E-139      1       261     275     0       260     261     261M
176224  102797  439     0.800   3.454E-139      1       260     275     0       259     260     260M
176224  76310   214     0.405   1.851E-61       15      273     275     6       259     260     159M5I95M
176224  117250  209     0.412   1.487E-59       12      273     275     3       259     260     162M5I95M
176224  131144  212     0.400   1.213E-60       9       268     275     4       258     260     165M5I90M
milot-mirdita commented 3 years ago

alignall doesn't actually create a real alignment database, it's first two columns are 1) query key, 2) target key instead of 1) target key. convertalis doesn't really now what to do with it. We didn't need this functionality yet, so it's kind of unfinished.

acpguedes commented 3 years ago

What is the overall purpose of this functionality? Perhaps I may help to dev in my free time.

milot-mirdita commented 3 years ago

Inside one cluster it produces alignment results for all possible pairs.

Ah I forgot the Needleman-Wunsch part of your question. We don't use global alignments basically anywhere. For proteins everything is smith-waterman and for nucleotides the banded alignment algorithm is a global one, but used in a way to compute local alignments.

If you are interested to work on this: You should replace the currently used DB type Parameters::DBTYPE_GENERIC_DB in alignall with a new one. And then in convertalis check if it's a normal alignment database or this new database type and introduce some special case to deal with the presence for both a query and target key.

acpguedes commented 3 years ago

Ok, I'll play with it. I'll keep you updated on any progress. Thanks