vgteam / vg

tools for working with genome variation graphs
https://biostars.org/tag/vg/
Other
1.09k stars 193 forks source link

vg crashes in GSSW when trying to do a long pinned alignment #3840

Open adamnovak opened 1 year ago

adamnovak commented 1 year ago

Grab the files in problem.tar.gz and run command.sh which does:

vg align -s "$(cat seq.txt)" --pinned --json cutnode.vg

seq.txt is 58122 bp of sequence, and the graph is a bit more graph than that. I'm not entirely sure the sequence is in the right orientation here (I think it's supposed to really have its end base pinned to the reverse strand of node 28436835 in the internal Giraffe operation I am trying to match), but regardless I've managed to trigger the bug:

error:[gssw] Stuck in main matrix!
vg: src/gssw.c:3572: gssw_alignment_trace_back_word: Assertion `0' failed.
Crash report for vg v1.45.0-77-ga26d0a579 "Alpicella"
Stack trace (most recent call last):
#13   Object "", at 0xffffffffffffffff, in 
#12   Object "/public/home/anovak/build/vg/bin/vg", at 0xb7621e, in _start
#11   Object "/usr/lib64/libc-2.17.so", at 0x7f75d30c7554, in __libc_start_main
#10   Object "/public/home/anovak/build/vg/bin/vg", at 0x129dae7, in vg::subcommand::Subcommand::operator()(int, char**) const
    | Source "src/subcommand/subcommand.cpp", line 75, in operator()
    |    74: const int Subcommand::operator()(int argc, char** argv) const {
    | >  75:     return main_function(argc, argv);
    |    76: }
      Source "/public/home/anovak/.local/include/c++/11.2.0/bits/std_function.h", line 560, in operator() [0x129dae7]
        557:       {
        558:    if (_M_empty())
        559:      __throw_bad_function_call();
      > 560:    return _M_invoker(_M_functor, std::forward<_ArgTypes>(__args)...);
        561:       }
        562: 
        563: #if __cpp_rtti
#9    Object "/public/home/anovak/build/vg/bin/vg", at 0x12b45ee, in main_align(int, char**)
      Source "src/subcommand/align_main.cpp", line 238, in main_align [0x12b45ee]
        236:         alignment.set_sequence(seq);
        237:         if (pinned_alignment) {
      > 238:             aligner.align_pinned(alignment, dag, pin_left);
        239:         }
        240:         else if (banded_global) {
        241:             aligner.align_global_banded(alignment, dag, 1, true);
#8    Object "/public/home/anovak/build/vg/bin/vg", at 0x15338b9, in vg::Aligner::align_pinned(vg::Alignment&, handlegraph::HandleGraph const&, bool, bool, unsigned short) const
      Source "src/aligner.cpp", line 1408, in align_pinned [0x15338b9]
       1405:         }
       1406:     }
       1407:     else {
      >1408:         align_internal(alignment, nullptr, g, true, pin_left, 1, true);
       1409:     }
       1410: }
#7    Object "/public/home/anovak/build/vg/bin/vg", at 0x1530dc8, in vg::Aligner::align_internal(vg::Alignment&, std::vector<vg::Alignment, std::allocator<vg::Alignment> >*, handlegraph::HandleGraph const&, bool, bool, int, bool) const
      Source "src/aligner.cpp", line 1147, in align_internal [0x1530dc8]
       1144:                 }
       1145:                 
       1146:                 // trace back pinned alignment
      >1147:                 gms = gssw_graph_trace_back_pinned_multi (graph,
       1148:                                                           max_alt_alns,
       1149:                                                           true,
       1150:                                                           align_sequence->c_str(),
#6    Object "/public/home/anovak/build/vg/bin/vg", at 0x19db910, in gssw_graph_trace_back_pinned_multi
      Source "src/gssw.c", line 4819, in gssw_graph_trace_back_pinned_multi [0x19db910]
#5    Object "/public/home/anovak/build/vg/bin/vg", at 0x19d968f, in gssw_graph_trace_back_internal
      Source "src/gssw.c", line 3910, in gssw_graph_trace_back_internal [0x19d968f]
#4    Object "/public/home/anovak/build/vg/bin/vg", at 0x19d8ec0, in gssw_alignment_trace_back_word
      Source "src/gssw.c", line 3572, in gssw_alignment_trace_back_word [0x19d8ec0]
#3    Object "/usr/lib64/libc-2.17.so", at 0x7f75d30d4251, in __assert_fail
#2    Object "/usr/lib64/libc-2.17.so", at 0x7f75d30d41a5, in __assert_fail_base
#1    Object "/usr/lib64/libc-2.17.so", at 0x7f75d30dca77, in abort
#0    Object "/usr/lib64/libc-2.17.so", at 0x7f75d30db387, in raise

My first guess is that GSSW is overflowing because the alignment is about as long as a 16-bit int can count.

adamnovak commented 1 year ago

Yeah, GSSW has 8-bit and 16-bit scoring modes: https://github.com/vgteam/gssw/blame/14b4d43736bb606c3fc97c4724d1959d13550d37/src/gssw.h#L28

I think it just can't deal with sequences this long.

We need logic in Giraffe that will ball out on or fall back from the fallback alignment code path if the sequence is more than... 15 kbp? 30 kbp?

jeizenga commented 1 year ago

We could do something like in the BandedGlobalAligner wrapper in Aligner, where it chooses based on the maximum score. I'm not sure what the fallback would be though, we don't really have an alignment module that can do arbitrary POA with that large of integers.

adamnovak commented 1 year ago

I tried this problem with gfatools ed and I got like a 26000 edit distance in a couple of minutes. I should try the reverse orientation for the sequence to see if that is plausibly fast.