rvaser / spoa

SIMD partial order alignment tool/library
MIT License
158 stars 32 forks source link

Unoptimal consensus #59

Closed Dmitry-Antipov closed 3 years ago

Dmitry-Antipov commented 3 years ago

Hi, thank you for the useful tool.

I've noticed that sometimes consensus (with global alignment mode) tends to be the longest sequence from the set, even when it is clear that the "right answer" is not. For example, for the test below I receive TTATAGTATATATTATATAATATATAAATATAATATACATTAAT as an answer consensus sequence, regardless of scoring functions (tried default, edit distance, and some others - i.e. -e -1 -g -8 -l 1 -m 10 -n -8 ) or reads order. MSA itself looks OK. Moving to local alignment did not help also.

Do you have some recommendations how to overcome this issue?

Seems that this issue mostly happens when there is an insertion in the beginning of one of the sequences, i.e. in test below there is extra T on first position. Possibly in such case "correct" paths through POA graph are not scored?

>1
TATAGTATATATTATATAATATATAATATAATATACATTAAT
>2
TATAGTATATATTATATAATATATAAATATAATATACATTAAT
>3
TATAGTATATATTATATAATATATAAATAAATATACATTAAT
>4
TATAGTATATATTATATAATATATAAATATAATATACATTAAT
>5
TATAGTATATATTATATAATATATAAATATAATATACATTAAT
>6
TATAGTATATATTATATAATATATAAATATAATATACATTAAT
>7
TATAGTATATATTATATAATATATAAATATAATATACATTAAT
>8
TATAGTATATATTATATAATATATAAATATAATATACATTAAT
>9
TATAGTATATATTATATAATATATAAATATAATATACATTAAT
>10
TATAGTATATATTATATAATATATAAATATAATATACATTAAT
>11
TATAGTATATATTATATAATATATAAATATAATATACATTAAT
>12
TATAGTATATATTATATAATATATAAATATAATATACATTAAT
>13
TATAGTATATATTATATAATATATAAATATAATATACATTAAT
>14
TATAGTATATATTATATAATATATAAATATAATATACATTAAT
>15
TATAGTATATATTATATAATATATAAATATAATATACATTAAT
>16
TTATAGTATATATTATATAATATATAAATATAATATACATTAAT
>17
TATAGTATATATTATATAATATATAAATATAATATACATTAAT
>18
TATAGTATATATTATATAATATATAAATATAATATACATTAAT
>19
TATAGTATATATTATATAATATATAAATATAATATACATTAAT
rvaser commented 3 years ago

Hello, the problem with the consensus algorithm (heaviest path) is that it can pick up long insertions on both graph ends if there is no branching. In Racon, we check the coverage of each base in the consensus sequence, and trim away bases on both ends if their coverage is low. You can get coverage from MSA if you are using spoa through command line, or if you are using it as a library inside your code you can pass a vector inside https://github.com/rvaser/spoa/blob/master/include/spoa/graph.hpp#L169.

Best regards, Robert

Dmitry-Antipov commented 3 years ago

Great, thank you.