refresh-bio / FAMSA

Algorithm for ultra-scale multiple sequence alignments (3M protein sequences in 5 minutes and 24 GB of RAM)
GNU General Public License v3.0
150 stars 25 forks source link

Sequence ordering in -dist_export #17

Closed maressyl closed 4 years ago

maressyl commented 4 years ago

Hi,

I find FAMSA very promising, however I am a bit frustrated with the distance matrix returned with -dist_export in the last release. How are sequences ordered in the output ? Building a tree in R from this matrix I easily identify identical sequences, however it is clear rows are not sorted as the provided FASTA file was, as identical sequences are now next to each others.

Is there a way to presort my FASTA file to match the matrix ordering, or to add labels in the distance matrix ?

Best regards, Sylvain

PS : Here is the command and input file I use famsa-1.3.2-linux-static -gt upgma -dist_export out.dist -gt_export out.newick test.fa out.fa test.zip

agudys commented 4 years ago

Hello Sylvain,

You have found a really serious bug. Indeed, matrix rows are in different order than in FASTA file (its because originally we needed only the distribution of distances so we didn't care about the ordering and somehow we forgot about it when pushing the changes). I'll fix this urgently. Thanks for reporting and sorry for your wasted time for tracking the bug.

Regards, Adam

agudys commented 4 years ago

Hi, I added the bugfix in the experimental branch in the repository (1.5.12 release). Now, the matrix rows are named after the sequences and are in the same order as in FASTA. Btw, the command line for producing matrix has been simplified in this release. This is because the guide tree and alignment are not produced when using -dist_export switch and output file is used for storing the matrix. In your case it would be:

famsa-1.5.12-linux-static -dist_export test.fa matrix.out

Please let me know if it works.

Regards, Adam

maressyl commented 4 years ago

Dear Adam,

Thanks a lot for your answer and this quick fix, it seems the problem is solved, at least on the example data I provided. I will continue to play a bit with FAMSA and let you know if I encounter another problem.

Best regards, Sylvain

agudys commented 4 years ago

Bugfix has been incorporated to master branch.