vgteam / GetBlunted

For bluntifying overlapped GFAs
12 stars 0 forks source link

Terminate called when aligning overlaps #35

Closed hugocarmaga closed 2 years ago

hugocarmaga commented 2 years ago

Hi!

So, I'm trying to use your tool in my analysis, but for the graphs I'm using, I get this error with most of them:

[get_blunted : 0.0 s elapsed] Reading GFA...
[get_blunted : 0.0 s elapsed] Computing adjacency components...
[get_blunted : 0.0 s elapsed] Total adjacency components: 4
[get_blunted : 0.0 s elapsed] Computing biclique covers...
[get_blunted : 0.0 s elapsed] Duplicating node termini...
[get_blunted : 0.0 s elapsed] Harmonizing biclique edge orientations...
[get_blunted : 0.0 s elapsed] Aligning overlaps...
terminate called after throwing an instance of 'std::out_of_range'
  what():  basic_string::at: __n (which is 4) >= this->size() (which is 4)
Aborted

I've used graphs generated with both Hifiasm and MBG, and the error persists. To try a diagnose for the problem, I've generated two test case graphs.

The first one was like this:

S       node1   AAAAAACCCAAAAAAAAAAAAAAAAAAAAA
S       node2   TTTTTTGGGGGGGGGGGGGGGGGGGGGGGG
L       node1   -       node2   +       6M

So, with this one, the bluntification was successful and the output was according to what was expected. But then I added a third node and made this graph:

S       node1   AAAAAACCCAAAAAAAAAAAAAAAAAAAAA
S       node2   TTTTTTGGGGGGGGGGGGGGGGGGGGGGGG
S       node3   AAAA
L       node1   -       node2   +       6M
L       node1   -       node3   +       4M
L       node2   -       node3   +       4M

With this one, I got the aforementioned error. Based on this, I think the error might be associated with overlaps the same size or bigger than the nodes. Are you aware of this? Is that what's causing the error? Do you have any solution for this?

Thanks! Hugo

rlorigro commented 2 years ago

Hi,

Thank you for the minimal reproducing example. I will take a look today. The case where the node is fully overlapped is intended to be handled, but the case where the overlap is longer than the node is not. If the overlap is longer than the node then it describes an impossible alignment, and it indicates the GFA is wrong.

The author of Hifiasm chose to use "M" operations arbitrarily to represent the length of the longest read involved in the overlap. So there is no way to know how much overlapping sequence there actually is, and for that reason GetBlunted is not suitable for Hifiasm GFAs unfortunately. I have discussed this with the author and he indicated that he would fix it in v0.14, but that was nearly a year ago, so you may want to open an issue on their github page to bring it up again. Additionally, their overlaps are many kilobases in length, which would make realignment using POA very slow.

hugocarmaga commented 2 years ago

Hi,

Thanks for the answer! Yes, I totally agree with you that large overlaps are not the ideal for this purpose, and of course the overlaps bigger than the nodes are not supposed to be present (I'm adding a preprocessing step in my pipeline to get rid of them if they're present). But my main concern was with the overlaps the same length as nodes, because those shouldn't be discarded from the graph, and therefore they'll be present for the bluntification step.

But if you intend to handle it, I guess that it would solve my problem, so I'll just wait for it.

Thank you, Hugo

hugocarmaga commented 2 years ago

Hi,

I've been looking at this again, and I'm not sure the problem is where we pointed to, the overlapping length being equal or greater than the node length. I've tried running this with a different graph without edges under the mentioned conditions (all the overlaps are smaller than the node lengths), and I still got the same error:

[get_blunted : 0.0 s elapsed] Reading GFA...
[get_blunted : 0.0 s elapsed] Computing adjacency components...
[get_blunted : 0.0 s elapsed] Total adjacency components: 771
[get_blunted : 0.0 s elapsed] Computing biclique covers...
[get_blunted : 0.0 s elapsed] Duplicating node termini...
[get_blunted : 0.0 s elapsed] Harmonizing biclique edge orientations...
[get_blunted : 0.0 s elapsed] Aligning overlaps...
terminate called after throwing an instance of 'std::out_of_range'
  what():  basic_string::at: __n (which is 18446744073709551615) >= this->size() (which is 17320)
Aborted

Also, I tried changing the mini example I've showed before, altering the node3 overlaps from "4M" to "5M". With this alteration, the tool was able to create a blunted graph where node3 is not connected to any of the other nodes, after its overlaps were skipped, as pointed by:

[get_blunted : 0.0 s elapsed] Reading GFA...
WARNING: skipping overlap for which sum of cigar operations is > SINK node length: node1->node3

WARNING: skipping overlap for which sum of cigar operations is > SINK node length: node2->node3

WARNING: skipping overlap for which sum of cigar operations is > SOURCE node length: node3->node1

So, it seems that overlaps bigger than one of the involved nodes are just being discarded (as they should) and the node is not included in the bluntified version. Only overlaps the same size as one of the nodes are having the error, but they're not alone, apparently, since this other graph with all the overlaps smaller than the nodes is also getting the same error.

I don't exactly have an idea on what's causing the error, I just wanted to add this information so you could be aware of it, if you weren't already.

Thanks, Hugo

rlorigro commented 2 years ago

Hi Hugo, you are right that the bug is not exactly an issue with the full overlap itself, rather that it was causing an error in the cigar string iterator to be manifested. I recently added a step that checks if the cigar M operations are = or X and that check was iterating the indexes of the sequence incorrectly, so this part of the code was reaching past the end of the node for the small node3.

I have updated that part of the code, but I am working on fixing another one of my regression tests, so I apologize for the delay. I will update the release when it is fixed.

hugocarmaga commented 2 years ago

Thanks Ryan!

rlorigro commented 2 years ago

Hi Hugo, can you please try out the latest release and see how that works for you? On my end the test output looks reasonable. I added an even simpler version of your test case to my set.

hugocarmaga commented 2 years ago

Hi Ryan, Yes, it works now! Thank you very much!

rlorigro commented 2 years ago

No worries, let me know if anything else comes up.

rlorigro commented 2 years ago

In other good news, the author of Hifiasm says the cigar fix should be coming in the next minor release.