rvaser / spoa

SIMD partial order alignment tool/library
MIT License
158 stars 32 forks source link

Some questions regarding multiple alignment behavior #18

Closed paoloczi closed 5 years ago

paoloczi commented 5 years ago

This is not a description of an issue found with the repository - just some questions about the multiple alignment functionality:

rvaser commented 5 years ago
paoloczi commented 5 years ago

I am attaching a test program that shows the common prefixes/suffixes not being preserved, and also shows two dramatically different alignments depending on ordering. The program runs two separate global alignments in which the same 5 distinct input strings are used. Each of the 5 strings is used the same number of times in the two alignments, but in different orders. The 5 strings all have a common 10-base prefix and a common 10-base suffix. The prefix/suffix are not preserved in the alignment, and the alignments computed with the two orderings are very different. See comments in the C++ code for more details.

I use the library, not the executable. Is the -d option equivalent to Graph::print_dot?

I see now that Graph::add_alignment has the weight as an optional argument. In my usage case, many of the input strings occur multiple times, so I could call add_alignment once for each distinct string, specifying as the weight the number of times each string occurs. Do you expect that this will improve performance? Or is the compute cost dominated by the final pass over the graph in generate_multiple_sequence_alignment?

Based on what I know now, my inclination would be to call add_alignment for each of my distinct sequences, in order of decreasing frequency (without regards to length), and specifying the weight as the frequency. By presenting the most frequent sequences first (like in the first ordering used in the test program), my hope is that those sequences get a better chance to determine the final alignment.

testSpoa.cpp.gz

rvaser commented 5 years ago

Yes, option -d evokes Graph::print_dot.

If you beforehand known which sequences are equal (as this example), you should just add weights to them equal to their frequency which will decrease the total execution time significantly (time needed for consensus and MSA generation is negligible).

I noticed that you are using local alignment (kSW) in your code instead of global alignment (kNW) which should partially solve your inquiry (see bellow). I used the latest version (v2.0.3) which produces a bit different alignments than the one you are using (in backtrack, matches/mismatches are checked first).

Computed alignment for ordering 0:
-------------------CACA-T-------ACACACAGACAC
-------------------CACA-T-------ACACACAGACAC
-------------------CACA-T-------ACACACAGACAC
-------------------CACA-T-------ACACACAGACAC
-------------------CACA-T-------ACACACAGACAC
-------------------CACA-T-------ACACACAGACAC
-------------------CACA-T-------ACACACAGACAC
-------------------CACA-T-------ACACACAGACAC
-------------------CACA-T-------ACACACAGACAC
-------------------CACA-T-------ACACACAGACAC
-------------------CACA-T-------ACACACAGACAC
-------------------CACA-T-------ACACACAGACAC
-------------------CACA-T-------ACACACAGACAC
-------------------CACA-T-------ACACACAGACAC
-------------------CACA-T-------ACACACAGACAC
-------------------CACA-T-------ACACACAGACAC
-------------------CACA-T-------ACACACAGACAC
-------------------CACA-T-----ACACACACAGACAC
-------------------CACA-T-----ACACACACAGACAC
-------------------CACA-T-----ACACACACAGACAC
-------------------CACA-T---------ACACAGACAC
-------------------CACA-T---------ACACAGACAC
-------------------CACA-T---------ACACAGACAC
CACATACACA-GACATACACACACTGACACACACACACAGACAC
CACATACACACG-C--ACACAC------------ACACAGACAC
Computed alignment for ordering 1:
CACATACACAGACATACACACACTGACACAC--ACACACAGACAC
------------------CACA-T---------ACACACAGACAC
------------------CACA-T---------ACACACAGACAC
CACATACAC--AC--GCACAC------------ACACACAGACAC
CACAT----------------------------ACACACAGACAC
CACAT----------------------------ACACACAGACAC
CACAT----------------------------ACACACAGACAC
CACAT----------------------------ACACACAGACAC
CACAT----------------------------ACACACAGACAC
CACAT----------------------------ACACACAGACAC
CACAT----------------------------ACACACAGACAC
CACAT----------------------------ACACACAGACAC
CACAT--------------------------ACACACACAGACAC
CACAT------------------------------ACACAGACAC
CACAT----------------------------ACACACAGACAC
CACAT----------------------------ACACACAGACAC
CACAT--------------------------ACACACACAGACAC
CACAT----------------------------ACACACAGACAC
CACAT------------------------------ACACAGACAC
CACAT----------------------------ACACACAGACAC
CACAT--------------------------ACACACACAGACAC
CACAT----------------------------ACACACAGACAC
CACAT----------------------------ACACACAGACAC
CACAT----------------------------ACACACAGACAC
CACAT------------------------------ACACAGACAC
paoloczi commented 5 years ago

Thank you so much for pointing out my error. So I will plan on doing the following:

This was a valuable discussion for me - thank you for your help and attention!

rvaser commented 5 years ago

No problem!:)

rlorigro commented 5 years ago

Hi rvaser,

I just wanted to clarify how the weights are used: is it true that the alignment itself will not be affected by the weights? From what I understand, weights are used for consensus and quality.

rvaser commented 5 years ago

Hello Ryan, yes, the weights are only used to generate a consensus from the graph.

Best regards, Robert

rlorigro commented 5 years ago

Ok thanks for the clarification