Some questions regarding multiple alignment behavior

paoloczi commented 5 years ago

This is not a description of an issue found with the repository - just some questions about the multiple alignment functionality:

The result is generally dependent on the order in which the input sequences are added to the alignment. Is this expected? If so, are there any suggestions on the ordering most likely to result in the best alignment quality? E. g. longest first, shortest first, anything else?
If a sequence is added to an alignment multiple times consecutively, does the number of times affect the result? That is, does the alignment give more weight to input strings that are added many times? If not, it seems that it would be ok to enter each distinct input string only once.
I have cases where all of the input strings have a common prefix, but the common prefix is not preserved in the alignment. The same for a common suffix. Is this behavior expected? This also seems to be order dependent.

rvaser commented 5 years ago

Yes, the result greatly depends on the ordering. I got reasonable results by adding the longest sequence first and aligning the sequences globally (if the sequences are similar).
Yes, each sequence adds a weight to the edges it passes through. If you are using FASTQ files, the edge weight transform to phred quality.
That is a bit odd. Could you provide an example? Which mode are you using? You can also draw the resulting graph and see what is happening with option -d <path>.

paoloczi commented 5 years ago

I am attaching a test program that shows the common prefixes/suffixes not being preserved, and also shows two dramatically different alignments depending on ordering. The program runs two separate global alignments in which the same 5 distinct input strings are used. Each of the 5 strings is used the same number of times in the two alignments, but in different orders. The 5 strings all have a common 10-base prefix and a common 10-base suffix. The prefix/suffix are not preserved in the alignment, and the alignments computed with the two orderings are very different. See comments in the C++ code for more details.

I use the library, not the executable. Is the -d option equivalent to Graph::print_dot?

I see now that Graph::add_alignment has the weight as an optional argument. In my usage case, many of the input strings occur multiple times, so I could call add_alignment once for each distinct string, specifying as the weight the number of times each string occurs. Do you expect that this will improve performance? Or is the compute cost dominated by the final pass over the graph in generate_multiple_sequence_alignment?

Based on what I know now, my inclination would be to call add_alignment for each of my distinct sequences, in order of decreasing frequency (without regards to length), and specifying the weight as the frequency. By presenting the most frequent sequences first (like in the first ordering used in the test program), my hope is that those sequences get a better chance to determine the final alignment.

testSpoa.cpp.gz

rvaser commented 5 years ago

Yes, option -d evokes Graph::print_dot.

If you beforehand known which sequences are equal (as this example), you should just add weights to them equal to their frequency which will decrease the total execution time significantly (time needed for consensus and MSA generation is negligible).

I noticed that you are using local alignment (kSW) in your code instead of global alignment (kNW) which should partially solve your inquiry (see bellow). I used the latest version (v2.0.3) which produces a bit different alignments than the one you are using (in backtrack, matches/mismatches are checked first).

Computed alignment for ordering 0:
-------------------CACA-T-------ACACACAGACAC
-------------------CACA-T-------ACACACAGACAC
-------------------CACA-T-------ACACACAGACAC
-------------------CACA-T-------ACACACAGACAC
-------------------CACA-T-------ACACACAGACAC
-------------------CACA-T-------ACACACAGACAC
-------------------CACA-T-------ACACACAGACAC
-------------------CACA-T-------ACACACAGACAC
-------------------CACA-T-------ACACACAGACAC
-------------------CACA-T-------ACACACAGACAC
-------------------CACA-T-------ACACACAGACAC
-------------------CACA-T-------ACACACAGACAC
-------------------CACA-T-------ACACACAGACAC
-------------------CACA-T-------ACACACAGACAC
-------------------CACA-T-------ACACACAGACAC
-------------------CACA-T-------ACACACAGACAC
-------------------CACA-T-------ACACACAGACAC
-------------------CACA-T-----ACACACACAGACAC
-------------------CACA-T-----ACACACACAGACAC
-------------------CACA-T-----ACACACACAGACAC
-------------------CACA-T---------ACACAGACAC
-------------------CACA-T---------ACACAGACAC
-------------------CACA-T---------ACACAGACAC
CACATACACA-GACATACACACACTGACACACACACACAGACAC
CACATACACACG-C--ACACAC------------ACACAGACAC
Computed alignment for ordering 1:
CACATACACAGACATACACACACTGACACAC--ACACACAGACAC
------------------CACA-T---------ACACACAGACAC
------------------CACA-T---------ACACACAGACAC
CACATACAC--AC--GCACAC------------ACACACAGACAC
CACAT----------------------------ACACACAGACAC
CACAT----------------------------ACACACAGACAC
CACAT----------------------------ACACACAGACAC
CACAT----------------------------ACACACAGACAC
CACAT----------------------------ACACACAGACAC
CACAT----------------------------ACACACAGACAC
CACAT----------------------------ACACACAGACAC
CACAT----------------------------ACACACAGACAC
CACAT--------------------------ACACACACAGACAC
CACAT------------------------------ACACAGACAC
CACAT----------------------------ACACACAGACAC
CACAT----------------------------ACACACAGACAC
CACAT--------------------------ACACACACAGACAC
CACAT----------------------------ACACACAGACAC
CACAT------------------------------ACACAGACAC
CACAT----------------------------ACACACAGACAC
CACAT--------------------------ACACACACAGACAC
CACAT----------------------------ACACACAGACAC
CACAT----------------------------ACACACAGACAC
CACAT----------------------------ACACACAGACAC
CACAT------------------------------ACACAGACAC

paoloczi commented 5 years ago

Thank you so much for pointing out my error. So I will plan on doing the following:

Use kNW instead of kSW.
Switch to the latest release.
Call add_alignment once for each distinct input sequence, specifying the appropriate weight for each.
Present the distinct input sequences in order of decreasing weight.

This was a valuable discussion for me - thank you for your help and attention!

rvaser commented 5 years ago

No problem!:)

rlorigro commented 5 years ago

Hi rvaser,

I just wanted to clarify how the weights are used: is it true that the alignment itself will not be affected by the weights? From what I understand, weights are used for consensus and quality.

rvaser commented 5 years ago

Hello Ryan, yes, the weights are only used to generate a consensus from the graph.

Best regards, Robert

rlorigro commented 5 years ago

Ok thanks for the clarification

rvaser / spoa

Some questions regarding multiple alignment behavior #18