Could you add the ability to give a negative "offset" to each vertex in the graph?

wtwhite commented 6 years ago

Hi,

Thanks for making available this fast and easy-to-use POA library. I'm using it to find consensus alignments of noisy PacBio reads, which contain frequent random insertion (and deletion) errors, and finding that the resulting consensus sequence often contains very weakly supported bases resulting from these noise insertions, which I would prefer to get rid of -- if I understand correctly, this is because generate_consensus() looks for the heaviest-weight path in the graph, and all sequence bases contribute a positive weight, so a base will only be excluded if it appears to conflict with bases in one or more other sequences. Ironically, the very simplistic consensus caller I was originally using, which simply performs a heuristic MSA and then reports only bases from columns that have a strict majority, currently seems to give me better consensus sequences on 2x-10x coverage data because it does a better job of getting rid of these insertions.

I think that what I want could be accomplished by adding a negative offset to each graph vertex -- this would effectively create a "barrier to entry" that only lets the heaviest-path algorithm consider a base for the consensus sequence if it appears in sufficiently many input sequences. Do you agree that this would be a helpful approach? And would it be possible to implement something like this? I would certainly appreciate it!

rvaser commented 6 years ago

Hello, you can achieve this with spoa as well by calling generate_consensus(std::vector<uint32_t>& dst) which will fill the input vector with base coverages. You can simply ignore those which have less coverage than your desired threshold. Alternatively, you can call generate_multiple_sequence_alignment(...) which will return each row of the MSA as a string and you can call consensus on your own as you described. I am not sure how you are using spoa (binary or library) but either of the approaches described above should help you with your enquiry. Please let me know of the outcome :)

Adding negative base weights might be tricky for the graph connectivity. I think the above approaches are a cleaner solution.

Best regards, Robert

wtwhite commented 6 years ago

Thanks for the quick and helpful response!

I'm using the library. I'll certainly do as you suggest, and I think this will improve things a lot, but if I understand correctly, calling generate_consensus() to build a consensus using the heaviest-path algorithm and then deleting a subset of bases based on coverage may in general give an overall lower-quality path than choosing the consensus to be the heaviest path in a graph which has had its vertices downweighted as I suggested (since a different, heavier path may have been chosen otherwise if the heaviest-path algorithm would have "known" that some of these bases would later be removed). Is that right?

In any case, this is a big improvement, so thanks again :)

rvaser commented 6 years ago

It is really hard to tell if a different path would be generated, because you would assign negative starting weights for each vertex (I assume you are not using base qualities) and the graph traversal always picks the best local edge. The whole path up to a point is considered only if the local edges have equal weights. This local choice can terminate the graph prematurely so a branch completion procedure is invoked which insists on picking the best path up to a point instead of the best local edge.

If you are using base qualities they become weights and vertices would need to have another weight which would denote their coverage (I am not sure how to incorporate coverage threshold withing base qualities, but it might be possible).

If this approach does not improve your situation, you should try with MSA and calling consensus by yourself :)

rvaser / spoa

Could you add the ability to give a negative "offset" to each vertex in the graph? #11