Unexpected starting bases for consensus sequence

rpadmanabhan commented 5 years ago

Hi, Thank you for developing this great library ! I am using spoa as a library to generate a consensus sequence from similar reads albeit with some errors. In the example below why does spoa skip the first two bases, 'G' and 'C' in the consensus generation ? Is it because the heaviest bundle traversal traceback stops at the 254 -G node ? Is it possible to change this behavior ? Sorry if this is an obvious question, I am not proficient at C++ but I did attempt to read the code.

I used these parameters : 2(semi global) 5(match) -4(mismatch) -10(gap open) -8(gap extend) These were the sequences : example_sequences.txt.txt

This is a snippet of the graph output : graph_first_few_bases

Thank You

rvaser commented 5 years ago

Hello, the heaviest bundle traversal picks the best edge locally without checking the total score up to current node. The reason is to fend off long insertions, i.e. if there is a long insert present in only one read it could be picked over a smaller path that is present in all other reads (this is described in: https://doi.org/10.1093/bioinformatics/btg109). You can implement a different type of algorithm directly on the library if you like or print the multiple sequence alignment, parse it and generate a consensus you prefer. If you need any help doing so, please let me know.

Best regards, Robert

rpadmanabhan commented 5 years ago

Thanks for the reply. I was able to parse the MSA strings obtained from graph->generate_multiple_sequence_alignment() to get the consensus I was looking for.

rvaser / spoa

Unexpected starting bases for consensus sequence #21