vgteam / vg

tools for working with genome variation graphs
https://biostars.org/tag/vg/
Other
1.08k stars 191 forks source link

vg giraffe can crash in the Banded Global Aligner #4177

Open adamnovak opened 7 months ago

adamnovak commented 7 months ago

Konstantinos managed to crash my Long Read Giraffe branch (commit 2c6d15569), and the resulting stack trace didn't include the read name for the thread (maybe we can't get it for a signal?).

━━━━━━━━━━━━━━━━━━━━
Crash report for vg v1.51.0-599-g2c6d15569 "Quellenhof"
Caught signal 11 raised at address 0xf8d8d4; tracing with backward-cpp
Stack trace (most recent call last) in thread 2330385:
#17   Object "", at 0xffffffffffffffff, in
#16   Object "/private/home/anovak/workspace2/vg/bin/vg", at 0x22370ff, in __clone3
#15   Object "/private/home/anovak/workspace2/vg/bin/vg", at 0x21905ba, in start_thread
#14   Object "/private/home/anovak/workspace2/vg/bin/vg", at 0x2132e9d, in gomp_thread_start
#13   Object "/private/home/anovak/workspace2/vg/bin/vg", at 0xf5d50c, in unsigned long vg::io::unpaired_for_each_parallel<vg::Alignment>(std::function<bool (vg::Alignment&)>, std::function<void (vg::Alignment&)>, unsigned long) [clone ._omp_fn.0]
#12   Object "/private/home/anovak/workspace2/vg/bin/vg", at 0x212e9d3, in GOMP_task
#11   Object "/private/home/anovak/workspace2/vg/bin/vg", at 0xf5d193, in unsigned long vg::io::unpaired_for_each_parallel<vg::Alignment>(std::function<bool (vg::Alignment&)>, std::function<void (vg::Alignment&)>, unsigned long) [clone ._omp_fn.1]
#10   Object "/private/home/anovak/workspace2/vg/bin/vg", at 0xdae022, in std::_Function_handler<void (vg::Alignment&), main_giraffe(int, char**)::{lambda()#1}::operator()() const::{lambda(vg::Alignment&)#8}>::_M_invoke(std::_Any_data const&, vg::Alignment&)
#9    Object "/private/home/anovak/workspace2/vg/bin/vg", at 0x121cbfd, in vg::MinimizerMapper::map(vg::Alignment&, vg::io::AlignmentEmitter&)
#8    Object "/private/home/anovak/workspace2/vg/bin/vg", at 0x124134f, in vg::MinimizerMapper::map_from_chains(vg::Alignment&)
#7    Object "/private/home/anovak/workspace2/vg/bin/vg", at 0x124ae3a, in vg::MinimizerMapper::map_from_chains(vg::Alignment&)::{lambda(unsigned long)#8}::operator()(unsigned long) const
#6    Object "/private/home/anovak/workspace2/vg/bin/vg", at 0x1246b4d, in vg::MinimizerMapper::find_chain_alignment(vg::Alignment const&, vg::VectorView<vg::algorithms::Anchor> const&, std::vector<unsigned long, std::allocator<unsigned long> > const&) const
#5    Object "/private/home/anovak/workspace2/vg/bin/vg", at 0x12464c4, in vg::MinimizerMapper::align_sequence_between(std::tuple<long long, bool, unsigned long> const&, std::tuple<long long, bool, unsigned long> const&, unsigned long, unsigned long, handlegraph::HandleGraph const*, vg::GSSWAligner const*, vg::Alignment&, std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > const*, unsigned long, std::function<unsigned long (vg::Alignment const&, handlegraph::HandleGraph const&)> const&)
#4    Object "/private/home/anovak/workspace2/vg/bin/vg", at 0x12455c3, in vg::MinimizerMapper::with_dagified_local_graph(std::tuple<long long, bool, unsigned long> const&, std::tuple<long long, bool, unsigned long> const&, unsigned long, handlegraph::HandleGraph const&, std::function<void (handlegraph::DeletableHandleGraph&, std::function<std::pair<long long, bool> (handlegraph::handle_t const&)> const&)> const&)
#3    Object "/private/home/anovak/workspace2/vg/bin/vg", at 0x124bb66, in vg::MinimizerMapper::align_sequence_between(std::tuple<long long, bool, unsigned long> const&, std::tuple<long long, bool, unsigned long> const&, unsigned long, unsigned long, handlegraph::HandleGraph const*, vg::GSSWAligner const*, vg::Alignment&, std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > const*, unsigned long, std::function<unsigned long (vg::Alignment const&, handlegraph::HandleGraph const&)> const&)::{lambda(handlegraph::DeletableHandleGraph&, std::function<std::pair<long long, bool> (handlegraph::handle_t const&)> const&)#1}::operator()(handlegraph::DeletableHandleGraph&, std::function<std::pair<long long, bool> (handlegraph::handle_t const&)> const&) const
#2    Object "/private/home/anovak/workspace2/vg/bin/vg", at 0xf4183d, in vg::Aligner::align_global_banded(vg::Alignment&, handlegraph::HandleGraph const&, int, bool) const
#1    Object "/private/home/anovak/workspace2/vg/bin/vg", at 0xfb2e1b, in vg::BandedGlobalAligner<short>::align(signed char*, signed char*, signed char, signed char)
#0    Object "/private/home/anovak/workspace2/vg/bin/vg", at 0xf8d8d4, in vg::BandedGlobalAligner<short>::BAMatrix::fill_matrix(handlegraph::HandleGraph const&, signed char*, signed char*, signed char, signed char, bool, short)
ERROR: Signal 11 occurred. VG has crashed. Visit https://github.com/vgteam/vg/issues/new/choose to report a bug.
Please include this entire error log in your bug report!
━━━━━━━━━━━━━━━━━━━━

We need to collect the data to reproduce this and ID the offending read, and figure out what is going wrong inside the Banded Global Aligner. I think it is running with some variant of:

vg giraffe -t32 --parameter-preset lr --progress -Z /private/groups/patenlab/anovak/projects/hprc/lr-giraffe/graphs/hprc-v1.1-mc-chm13.d9.gbz -d /private/groups/patenlab/anovak/projects/hprc/lr-giraffe/graphs/hprc-v1.1-mc-chm13.d9.dist -m /private/groups/patenlab/anovak/projects/hprc/lr-giraffe/graphs/hprc-v1.1-mc-chm13.d9.k31.w50.W.withzip.min -z /private/groups/patenlab/anovak/projects/hprc/lr-giraffe/graphs/hprc-v1.1-mc-chm13.d9.k31.w50.W.zipcodes -f whatever.fastq >whatever.gam
kokyriakidis commented 7 months ago

This is the command to reproduce the issue:

/private/home/anovak/workspace/vg/bin/vg giraffe \
-t32 \
--parameter-preset lr \
--progress \
-Z /private/groups/patenlab/anovak/projects/hprc/lr-giraffe/graphs/hprc-v1.1-mc-chm13.d9.gbz \
-d /private/groups/patenlab/anovak/projects/hprc/lr-giraffe/graphs/hprc-v1.1-mc-chm13.d9.dist \
-m /private/groups/patenlab/anovak/projects/hprc/lr-giraffe/graphs/hprc-v1.1-mc-chm13.d9.k31.w50.W.withzip.min \
-z /private/groups/patenlab/anovak/projects/hprc/lr-giraffe/graphs/hprc-v1.1-mc-chm13.d9.k31.w50.W.zipcodes \
-f /private/groups/migalab/kkyriaki/rotation/batman/data/HG002_R1041_Duplex/1_3_23_R1041_Duplex_HG002_3_Dorado_v0.4.1_400bps_sup_v4.1.0_simplex.fastq.gz \
> /private/groups/patenlab/kkyriaki/ec_giraffe/3_simplex.gam
adamnovak commented 7 months ago

I haven't been able to reproduce this issue with the debug logic added to dump the offending read names (in commit d35bbfd6cbc40dfcfdb5db46aa8c70905be040b5). This is what I get at the tail end of my log, on emerald.prism.

Mapped 2512113 reads across 32 threads in 56143.2 seconds with 0 additional single-threaded seconds.
Mapping speed: 1.39827 reads per second per thread
Used 1.70112e+06 CPU-seconds (including output).
Achieved 1.47674 reads per CPU-second (including output)
Memory footprint: 160.934 GB
        Command being timed: "vg giraffe --log-reads -t32 --parameter-preset lr --progress -Z /private/groups/patenlab/anovak/projects/hprc/lr-giraffe/graphs/hprc-v1.1-mc-chm13.d9.gbz -d /private/groups/patenlab/anovak/projects/hprc/lr-
giraffe/graphs/hprc-v1.1-mc-chm13.d9.dist -m /private/groups/patenlab/anovak/projects/hprc/lr-giraffe/graphs/hprc-v1.1-mc-chm13.d9.k31.w50.W.withzip.min -z /private/groups/patenlab/anovak/projects/hprc/lr-giraffe/graphs/hprc-v1.1-mc-chm
13.d9.k31.w50.W.zipcodes -f /private/groups/migalab/kkyriaki/rotation/batman/data/HG002_
R1041_Duplex/1_3_23_R1041_Duplex_HG002_3_Dorado_v0.4.1_400bps_sup_v4.1.0_simplex.fastq.gz"
        User time (seconds): 688262.07
        System time (seconds): 1013041.16
        Percent of CPU this job got: 3023%
        Elapsed (wall clock) time (h:mm:ss or m:ss): 15:37:42
        Average shared text size (kbytes): 0
        Average unshared data size (kbytes): 0
        Average stack size (kbytes): 0
        Average total size (kbytes): 0
        Maximum resident set size (kbytes): 168751208
        Average resident set size (kbytes): 0
        Major (requiring I/O) page faults: 677
        Minor (reclaiming a frame) page faults: 53628081121
        Voluntary context switches: 126311
        Involuntary context switches: 24701861
        Swaps: 0
        File system inputs: 64681864
        File system outputs: 432520
        Socket messages sent: 0
        Socket messages received: 0
        Signals delivered: 0
        Page size (bytes): 4096
        Exit status: 0

I did manage to get the program to be killed by the OOM-killer at one point, on crimson.prism which has less memory. In theory the Banded Global Aligner code implicated by your stack trace should handle a failed allocation and abort the program, but maybe we've missed a check and the program can sometimes segfault when memory allocation attempts return nullptr? @kokyriakidis which machine did you run your failing run on? Can you reproduce the problem with my new build of vg, which should log the offending read names when it crashes?

It could be that the extra synchronization introduced by using my new --log-reads option is somehow stopping the problem from happening. I can try running again without that option and actually saving the result this time, and maybe something different will happen.

kokyriakidis commented 7 months ago

Using the latest binary:

adamnovak commented 7 months ago

Here's another crash log @kokyriakidis sent me, which looks to show up in a different place. Is this the same file? If so maybe the same undefined behavior is causing it to happen sometimes and not other times?

━━━━━━━━━━━━━━━━━━━━
Unhandled exception of type St12out_of_range: vector::_M_range_check: __n (which is 18446744073709551583) >= this->size() (which is 11)
Exception context: 8c8e8ca1-46ff-4bea-9f02-f71b6a528847
━━━━━━━━━━━━━━━━━━━━
Crash report for vg v1.51.0-604-gd35bbfd6c "Quellenhof"
Caught signal 6 raised at address 0x219399c; tracing with backward-cpp
Stack trace (most recent call last) in thread 208337:
#11   Object "", at 0xffffffffffffffff, in
warning[MinimizerMapper::find_chain_alignment]: Refusing to align 31408 bp right tail against 50281091-78 in b0f535c5-cc02-485a-9681-f53d9ec1d231 to avoid overflow
warning[MinimizerMapper::find_chain_alignment]: Refusing to align 33682 bp left tail against 50281227-26 in b0f535c5-cc02-485a-9681-f53d9ec1d231 to avoid overflow
T6: warning[MinimizerMapper::map_from_chains]: No seeds found for e501f52f-594d-4677-b7a4-ea4a72924e30!
warning[MinimizerMapper::find_chain_alignment]: Refusing to align 38508 bp left tail against 11378216-0 in 9b67b001-f4cc-47e1-b385-a82929b7679f to avoid overflow
#10   Object "/private/home/anovak/workspace2/vg/bin/vg", at 0x2238b7f, in __clone3
#9    Object "/private/home/anovak/workspace2/vg/bin/vg", at 0x219203a, in start_thread
#8    Object "/private/home/anovak/workspace2/vg/bin/vg", at 0x213491d, in gomp_thread_start
#7    Object "/private/home/anovak/workspace2/vg/bin/vg", at 0x2137267, in gomp_team_barrier_wait_end
#6    Object "/private/home/anovak/workspace2/vg/bin/vg", at 0x212eb6a, in gomp_barrier_handle_tasks
#5    Object "/private/home/anovak/workspace2/vg/bin/vg", at 0xf5ea33, in unsigned long vg::io::unpaired_for_each_parallel<vg::Alignment>(std::function<bool (vg::Alignment&)>, std::function<void (vg::Alignment&)>, unsigned long) [clone ._omp_fn.1]
#4    Object "/private/home/anovak/workspace2/vg/bin/vg", at 0x52f0b4, in std::_Function_handler<void (vg::Alignment&), main_giraffe(int, char**)::{lambda()#1}::operator()() const::{lambda(vg::Alignment&)#8}>::_M_invoke(std::_Any_data const&, vg::Alignment&) [clone .cold]
#3    Object "/private/home/anovak/workspace2/vg/bin/vg", at 0x5579a1, in vg::report_exception(std::exception const&) [clone .cold]
#2    Object "/private/home/anovak/workspace2/vg/bin/vg", at 0x60438b, in abort
#1    Object "/private/home/anovak/workspace2/vg/bin/vg", at 0x2166e55, in raise
#0    Object "/private/home/anovak/workspace2/vg/bin/vg", at 0x219399c, in __pthread_kill
ERROR: Signal 6 occurred. VG has crashed. Visit https://github.com/vgteam/vg/issues/new/choose to report a bug.
━━━━━━━━━━━━━━━━━━━━
Context dump:
    Thread 0: 59a49960-7fb6-49d5-80b6-3aba561346b4
    Thread 1: e70e54eb-139f-453f-92e4-76b798ee4fe8
    Thread 2: ac5075be-5424-43bd-aba9-a242c203ee0c
    Thread 3: 0c64d86d-aac9-4ce9-8d6c-6bae0903e9fe
    Thread 4: 34d15e2f-c3c1-4ea6-8e88-8a3640f47951
    Thread 5: cfa3335d-1ec6-4a05-9ebe-b86858807200
    Thread 6: 4312bd1c-6458-49c7-a3f0-46f7109c6d47
    Thread 7: 88c4020c-a682-4a91-9e8f-35e4e0ff087a
    Thread 8: ecde837f-04c0-4de8-af87-1c307631465d
    Thread 9: fc414e5d-45cb-44f5-9d7e-62ba0aa5e71c
    Thread 10: 1c85b24e-c153-4876-ada3-49c9923d4c0f
    Thread 11: e3f73005-fd7c-4e0e-809f-c985d88d6fa1
    Thread 12: b37a39dd-7ddc-41cf-abd9-78e24a53201f
    Thread 13: fe1c1e8a-ef7d-4567-823d-390fd70f78bf
    Thread 14: bd393798-2a97-46cc-adfd-4fdc248b0cb0
    Thread 15: bcee0e12-116d-4099-b516-1c486aba0747
    Thread 16: 57e250df-25f3-4131-afb6-9e103faa42d7
    Thread 17: e2278f4e-7bfe-4de2-8df5-883a57b4634a
    Thread 18: 9464d855-f691-41ad-81a0-be0b0c75baca
    Thread 19: b285772f-d242-497b-9f0f-003e8b4df957
    Thread 20: 76da7eeb-b1ff-4fef-831a-6aeb4a7aa2d5
    Thread 21: 8abae9d5-db97-456d-9586-cf444ecad4f8
    Thread 22: b4f32bad-35b1-400b-85f9-00f7223f0ab9
    Thread 23: eeed2bff-6e61-477f-8f86-c6dec2f9b218
    Thread 24: 7d461cf8-cfae-4059-a9a8-c0bf7df18295
    Thread 25: 8c8e8ca1-46ff-4bea-9f02-f71b6a528847
    Thread 26: 54ca5a60-fb08-4358-af8f-e994001eafa4
    Thread 27: 7752ee64-7dde-47ca-bb62-09919a1a528d
    Thread 28: 1cfe2d65-3d50-4114-af87-a615c11bc054
    Thread 29: b0f535c5-cc02-485a-9681-f53d9ec1d231
    Thread 30: 5d726df5-92a1-4d30-8425-421cabfc8a43
    Thread 31: 3fdb181e-2b1a-4ccc-b1d7-ea8696200ea5
Found 32 threads with context.
━━━━━━━━━━━━━━━━━━━━
Please include this entire error log in your bug report!
━━━━━━━━━━━━━━━━━━━━
adamnovak commented 6 months ago

The second crash is apparently in /private/groups/migalab/kkyriaki/rotation/batman/data/corrected/3_simplex_EC_noIUPAC.fastq. @jeizenga noted he can't get any memory errors to happen in a block of reads around the offending read, under valgrind.

adamnovak commented 6 months ago

The first crash was observed on phoenix-08.prism according to /private/groups/migalab/kkyriaki/logs/3S_GIRAFFE.30112023.1436904.log. The second crash was observed on phoenix-07.prism according to /private/groups/migalab/kkyriaki/logs/3S_EC_GIRAFFE_500_NOLOG.04122023.1450947.log.

xchang1 commented 6 months ago

I'm not sure if this is the same bug, but giraffe is also crashing when mapping long reads with my branch, https://github.com/xchang1/vg/tree/8434ebe82cc1ec4b858dba8521dc838a0d46e267

I ran time vg giraffe --parameter-preset lr --progress --track-provenance -Z ${GRAPH_BASE}.gbz -d ${GRAPH_BASE}.dist -m ${GRAPH_BASE}.min -z ${GRAPH_BASE}.zips -G ${READ_FILE} -t 1 > mapped.gam This was on mustard with my copy of the hprc_1.1 graph (/private/groups/patenlab/xhchang/graphs/hprc_1.1_d9/hprc-v1.1-mc-chm13.d9) and Adam's simulated r10 reads (/private/groups/patenlab/anovak/projects/hprc/lr-giraffe/reads/sim/r10/HG002/HG002-sim-r10-1m.gam)

Here is the last bit of the stacktrace:

#2    Object "/private/home/xhchang/GitHub/vg/bin/vg", at 0x55970680aca2, in std::_Function_handler<void (vg::Alignment&), main_giraffe(int, char**)::{lambda()#1}::operator()() const::{lambda(vg::Alignment&)#8}>::_M_invoke(std::_Any_data const&, vg::Alignment&)
    | Source "/usr/include/c++/11/bits/std_function.h", line 290, in __invoke_r<void, main_giraffe(int, char**)::<lambda()>::<lambda(vg::Alignment&)>&, vg::Alignment&>
    |   288:       _M_invoke(const _Any_data& __functor, _ArgTypes&&... __args)
    |   289:       {
    | > 290:    return std::__invoke_r<_Res>(*_Base::_M_get_pointer(__functor),
    |   291:                                 std::forward<_ArgTypes>(__args)...);
    |   292:       }
    | Source "/usr/include/c++/11/bits/invoke.h", line 154, in __invoke_impl<void, main_giraffe(int, char**)::<lambda()>::<lambda(vg::Alignment&)>&, vg::Alignment&>
    |   152:       using __type = typename __result::type;
    |   153:       using __tag = typename __result::__invoke_type;
    | > 154:       std::__invoke_impl<__type>(__tag{}, std::forward<_Callable>(__fn),
    |   155:                             std::forward<_Args>(__args)...);
    |   156:     }
    | Source "/usr/include/c++/11/bits/invoke.h", line 61, in operator()
    |    59:     constexpr _Res
    |    60:     __invoke_impl(__invoke_other, _Fn&& __f, _Args&&... __args)
    | >  61:     { return std::forward<_Fn>(__f)(std::forward<_Args>(__args)...); }
    |    62:
    |    63:   template<typename _Res, typename _MemFun, typename _Tp, typename... _Args>
      Source "src/subcommand/giraffe_main.cpp", line 1639, in _M_invoke [0x55970680aca2]
#1    Object "/private/home/xhchang/GitHub/vg/bin/vg", at 0x559706c8a6ad, in vg::MinimizerMapper::map(vg::Alignment&, vg::io::AlignmentEmitter&)
    | Source "src/minimizer_mapper.cpp", line 579, in map
      Source "src/minimizer_mapper.cpp", line 584, in map [0x559706c8a6ad]
#0    Object "/private/home/xhchang/GitHub/vg/bin/vg", at 0x559706cab1e9, in vg::MinimizerMapper::map_from_chains(vg::Alignment&)
      Source "src/minimizer_mapper_from_chains.cpp", line 205, in map_from_chains [0x559706cab1e9]
ERROR: Signal 11 occurred. VG has crashed. Visit https://github.com/vgteam/vg/issues/new/choose to report a bug.
Please include this entire error log in your bug report!
━━━━━━━━━━━━━━━━━━━━

real    3476m37.344s
user    4167m33.794s
sys     2688m15.632s

This is the line it claims to be crashing on: https://github.com/xchang1/vg/blob/8434ebe82cc1ec4b858dba8521dc838a0d46e267/src/minimizer_mapper_from_chains.cpp#L205 This line should always work, so it seems like a memory bug to me, but I ran this twice multi-threaded and once single-threaded and it always crashed on the same line

I'm not sure what read it's crashing on, but right before the crash, I got a bunch of the Refusing to align messages for read S187_4633, so it was probably that one or the one right after, S180_17054, which is the last read in the set. I re-ran the last four reads in the file and it ran through without crashing.

adamnovak commented 6 months ago

I was able to replicate @xchang1's crash repeatedly this morning, and tracked it down how far into the file I would have to start to get it with a convincing binary search on Mustard, using 128 threads.

Mon Dec 11 09:19:59 AM PST 2023
Testing to cut 0 to 1000000 at 500000, into trash/searchstep-xian.500000.txt
Problem found starting at 500000
Mon Dec 11 09:36:49 AM PST 2023
Testing to cut 500000 to 1000000 at 750000, into trash/searchstep-xian.750000.txt
Problem found starting at 750000
Mon Dec 11 09:46:55 AM PST 2023
Testing to cut 750000 to 1000000 at 875000, into trash/searchstep-xian.875000.txt
Problem found starting at 875000
Mon Dec 11 10:07:14 AM PST 2023
Testing to cut 875000 to 1000000 at 937500, into trash/searchstep-xian.937500.txt
Problem found starting at 937500
Mon Dec 11 10:34:16 AM PST 2023
Testing to cut 937500 to 1000000 at 968750, into trash/searchstep-xian.968750.txt
Problem found starting at 968750
Mon Dec 11 10:45:38 AM PST 2023
Testing to cut 968750 to 1000000 at 984375, into trash/searchstep-xian.984375.txt
Problem found starting at 984375
Mon Dec 11 11:04:22 AM PST 2023
Testing to cut 984375 to 1000000 at 992187, into trash/searchstep-xian.992187.txt
Problem found starting at 992187
Mon Dec 11 11:15:53 AM PST 2023
Testing to cut 992187 to 1000000 at 996093, into trash/searchstep-xian.996093.txt
Problem found starting at 996093
Mon Dec 11 11:22:13 AM PST 2023
Testing to cut 996093 to 1000000 at 998046, into trash/searchstep-xian.998046.txt
Problem found starting at 998046
Mon Dec 11 11:27:03 AM PST 2023
Testing to cut 998046 to 1000000 at 999023, into trash/searchstep-xian.999023.txt
No problem found starting at 999023
Mon Dec 11 11:34:35 AM PST 2023
Testing to cut 998046 to 999023 at 998534, into trash/searchstep-xian.998534.txt
Problem found starting at 998534
Mon Dec 11 11:38:26 AM PST 2023
Testing to cut 998534 to 999023 at 998778, into trash/searchstep-xian.998778.txt
No problem found starting at 998778
Mon Dec 11 11:48:04 AM PST 2023
Testing to cut 998534 to 998778 at 998656, into trash/searchstep-xian.998656.txt
No problem found starting at 998656
Mon Dec 11 11:56:01 AM PST 2023
Testing to cut 998534 to 998656 at 998595, into trash/searchstep-xian.998595.txt
No problem found starting at 998595
Mon Dec 11 12:05:27 PM PST 2023
Testing to cut 998534 to 998595 at 998564, into trash/searchstep-xian.998564.txt
Problem found starting at 998564
Mon Dec 11 12:09:38 PM PST 2023
Testing to cut 998564 to 998595 at 998579, into trash/searchstep-xian.998579.txt
Problem found starting at 998579
Mon Dec 11 12:14:24 PM PST 2023
Testing to cut 998579 to 998595 at 998587, into trash/searchstep-xian.998587.txt
Problem found starting at 998587
Mon Dec 11 12:18:23 PM PST 2023
Testing to cut 998587 to 998595 at 998591, into trash/searchstep-xian.998591.txt
No problem found starting at 998591
Mon Dec 11 12:27:38 PM PST 2023
Testing to cut 998587 to 998591 at 998589, into trash/searchstep-xian.998589.txt
No problem found starting at 998589
Mon Dec 11 12:37:31 PM PST 2023
Testing to cut 998587 to 998589 at 998588, into trash/searchstep-xian.998588.txt
Problem found starting at 998588
Mon Dec 11 12:41:48 PM PST 2023
Stopping at 998588

But now around Mon Dec 11 02:12:40 PM PST 2023 I haven't been able to replicate any of the last couple failures, either manually or under the search script. I can run reads 998587 to the end in a loop and have it pass every time.

I also got nothing useful out of the debug version of jemalloc with its assertions to catch double frees and its ability to fill fresh or deallocated memory with garbage.

xchang1 commented 6 months ago

@adamnovak I have no idea if this is the same bug, but I'm also getting inconsistent segfaults running short reads with the long read code path, and it might be faster to run through with the short reads. I'm running the same giraffe command as before with the same branch but with these reads /private/groups/patenlab/xhchang/graphs/1000gp/novaseq6000.gam and minimizer/zipcode files .k29.w11

adamnovak commented 6 months ago

Using Valgrind, I found an out of bounds read in the zipcode tree code:

==2021789== Invalid read of size 1
==2021789==    at 0x168B4FB: get_type (zip_code_tree.hpp:149)
==2021789==    by 0x168B4FB: vg::ZipCodeForest::close_chain(vg::ZipCodeForest::forest_growing_state_t&, unsigned long const&, vg::SnarlDistanceIndexClusterer::Seed const&, bool) (zip_code_tree.cpp:90)
==2021789==    by 0x9E7124: void vg::ZipCodeForest::fill_in_forest<vg::MinimizerMapper::Minimizer>(std::vector<vg::SnarlDistanceIndexClusterer::Seed, std::allocator<vg::SnarlDistanceIndexClusterer::Seed> > const&, vg::VectorView<vg::MinimizerMapper::Minimizer> const&, bdsg::SnarlDistanceIndex const&, unsigned long, unsigned long) (zip_code_tree.hpp:1010)
==2021789==    by 0x134AB08: vg::MinimizerMapper::map_from_chains(vg::Alignment&) (minimizer_mapper_from_chains.cpp:176)
==2021789==    by 0x132A19D: map (minimizer_mapper.cpp:584)
==2021789==    by 0x132A19D: vg::MinimizerMapper::map(vg::Alignment&, vg::io::AlignmentEmitter&) (minimizer_mapper.cpp:579)
==2021789==    by 0xEA9086: operator() (giraffe_main.cpp:1656)
==2021789==    by 0xEA9086: __invoke_impl<void, main_giraffe(int, char**)::<lambda()>::<lambda(vg::Alignment&)>&, vg::Alignment&> (invoke.h:61)
==2021789==    by 0xEA9086: __invoke_r<void, main_giraffe(int, char**)::<lambda()>::<lambda(vg::Alignment&)>&, vg::Alignment&> (invoke.h:154)
==2021789==    by 0xEA9086: std::_Function_handler<void (vg::Alignment&), main_giraffe(int, char**)::{lambda()#1}::operator()() const::{lambda(vg::Alignment&)#8}>::_M_invoke(std::_Any_data const&, vg::Alignment&) (std_function.h:290)
==2021789==    by 0x106A8FE: operator() (std_function.h:590)
==2021789==    by 0x106A8FE: unsigned long vg::io::unpaired_for_each_parallel<vg::Alignment>(std::function<bool (vg::Alignment&)>, std::function<void (vg::Alignment&)>, unsigned long) [clone ._omp_fn.0] (alignment_io.hpp:169)
==2021789==    by 0x4E7AA15: GOMP_parallel (in /usr/lib/x86_64-linux-gnu/libgomp.so.1.0.0)
==2021789==    by 0x106196C: unpaired_for_each_parallel<vg::Alignment> (alignment_io.hpp:124)
==2021789==    by 0x106196C: vg::fastq_unpaired_for_each_parallel(std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > const&, std::function<void (vg::Alignment&)>, unsigned long) (alignment.cpp:243)
==2021789==    by 0xEB0771: main_giraffe(int, char**)::{lambda()#1}::operator()() const (giraffe_main.cpp:1691)
==2021789==    by 0xF5254C: operator() (std_function.h:590)
==2021789==    by 0xF5254C: operator() (options.cpp:52)
==2021789==    by 0xF5254C: __invoke_impl<void, vg::subcommand::TickChainLink::get_iterator()::<lambda(const std::function<void()>&)>&, const std::function<void()>&> (invoke.h:61)
==2021789==    by 0xF5254C: __invoke_r<void, vg::subcommand::TickChainLink::get_iterator()::<lambda(const std::function<void()>&)>&, const std::function<void()>&> (invoke.h:154)
==2021789==    by 0xF5254C: std::_Function_handler<void (std::function<void ()> const&), vg::subcommand::TickChainLink::get_iterator()::{lambda(std::function<void ()> const&)#1}>::_M_invoke(std::_Any_data const&, std::function<void ()> const&) (std_function.h:290)
==2021789==    by 0xEB5B8E: operator() (std_function.h:590)
==2021789==    by 0xEB5B8E: main_giraffe(int, char**) (giraffe_main.cpp:1282)
==2021789==    by 0xF973AB: operator() (std_function.h:590)
==2021789==    by 0xF973AB: vg::subcommand::Subcommand::operator()(int, char**) const (subcommand.cpp:75)
==2021789==  Address 0x26921e8f0 is 16 bytes before a block of size 128 alloc'd
==2021789==    at 0x4849013: operator new(unsigned long) (in /usr/libexec/valgrind/vgpreload_memcheck-amd64-linux.so)
==2021789==    by 0x168E63F: allocate (new_allocator.h:127)
==2021789==    by 0x168E63F: allocate (alloc_traits.h:464)
==2021789==    by 0x168E63F: _M_allocate (stl_vector.h:346)
==2021789==    by 0x168E63F: _M_realloc_insert<vg::ZipCodeTree::tree_item_type_t, long unsigned int const&, bool> (vector.tcc:440)
==2021789==    by 0x168E63F: emplace_back<vg::ZipCodeTree::tree_item_type_t, long unsigned int const&, bool> (vector.tcc:121)
==2021789==    by 0x168E63F: vg::ZipCodeForest::add_child_to_chain(vg::ZipCodeForest::forest_growing_state_t&, unsigned long const&, unsigned long const&, bool, bool) (zip_code_tree.cpp:452)
==2021789==    by 0x9E7737: void vg::ZipCodeForest::fill_in_forest<vg::MinimizerMapper::Minimizer>(std::vector<vg::SnarlDistanceIndexClusterer::Seed, std::allocator<vg::SnarlDistanceIndexClusterer::Seed> > const&, vg::VectorView<vg::MinimizerMapper::Minimizer> const&, bdsg::SnarlDistanceIndex const&, unsigned long, unsigned long) (zip_code_tree.hpp:1165)
==2021789==    by 0x134AB08: vg::MinimizerMapper::map_from_chains(vg::Alignment&) (minimizer_mapper_from_chains.cpp:176)
==2021789==    by 0x132A19D: map (minimizer_mapper.cpp:584)
==2021789==    by 0x132A19D: vg::MinimizerMapper::map(vg::Alignment&, vg::io::AlignmentEmitter&) (minimizer_mapper.cpp:579)
==2021789==    by 0xEA9086: operator() (giraffe_main.cpp:1656)
==2021789==    by 0xEA9086: __invoke_impl<void, main_giraffe(int, char**)::<lambda()>::<lambda(vg::Alignment&)>&, vg::Alignment&> (invoke.h:61)
==2021789==    by 0xEA9086: __invoke_r<void, main_giraffe(int, char**)::<lambda()>::<lambda(vg::Alignment&)>&, vg::Alignment&> (invoke.h:154)
==2021789==    by 0xEA9086: std::_Function_handler<void (vg::Alignment&), main_giraffe(int, char**)::{lambda()#1}::operator()() const::{lambda(vg::Alignment&)#8}>::_M_invoke(std::_Any_data const&, vg::Alignment&) (std_function.h:290)
==2021789==    by 0x106A8FE: operator() (std_function.h:590)
==2021789==    by 0x106A8FE: unsigned long vg::io::unpaired_for_each_parallel<vg::Alignment>(std::function<bool (vg::Alignment&)>, std::function<void (vg::Alignment&)>, unsigned long) [clone ._omp_fn.0] (alignment_io.hpp:169)
==2021789==    by 0x4E7AA15: GOMP_parallel (in /usr/lib/x86_64-linux-gnu/libgomp.so.1.0.0)
==2021789==    by 0x106196C: unpaired_for_each_parallel<vg::Alignment> (alignment_io.hpp:124)
==2021789==    by 0x106196C: vg::fastq_unpaired_for_each_parallel(std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > const&, std::function<void (vg::Alignment&)>, unsigned long) (alignment.cpp:243)
==2021789==    by 0xEB0771: main_giraffe(int, char**)::{lambda()#1}::operator()() const (giraffe_main.cpp:1691)
==2021789==    by 0xF5254C: operator() (std_function.h:590)
==2021789==    by 0xF5254C: operator() (options.cpp:52)
==2021789==    by 0xF5254C: __invoke_impl<void, vg::subcommand::TickChainLink::get_iterator()::<lambda(const std::function<void()>&)>&, const std::function<void()>&> (invoke.h:61)
==2021789==    by 0xF5254C: __invoke_r<void, vg::subcommand::TickChainLink::get_iterator()::<lambda(const std::function<void()>&)>&, const std::function<void()>&> (invoke.h:154)
==2021789==    by 0xF5254C: std::_Function_handler<void (std::function<void ()> const&), vg::subcommand::TickChainLink::get_iterator()::{lambda(std::function<void ()> const&)#1}>::_M_invoke(std::_Any_data const&, std::function<void ()>const&) (std_function.h:290)
==2021789==    by 0xEB5B8E: operator() (std_function.h:590)
==2021789==    by 0xEB5B8E: main_giraffe(int, char**) (giraffe_main.cpp:1282)

This is with jemalloc=off so Valgrind can work, and commit 7b11d89622f7ccc227e75c3577e07970f7240d1b.

I also, separately, got a crash under gdb where the allocator (not jemalloc but the normal one, since I had jemalloc=off) detected somebody tampering with the free list metadata that it puts before each block. It looks like it had gone to allocate from a free block, but the allocated block right before it had overwritten the first 8 bytes of the next block header. So someone wrote 8 bytes past the end of their memory allocation. That looked like this in memory:

        0x7ff9c0b10bf0: 0x10    0x76    0x10    0xc0    0xf9    0x7f    0x00    0x00
        0x7ff9c0b10bf8: 0x2e    0x04    0x00    0x00    0x00    0x00    0x00    0x00
        0x7ff9c0b10c00: 0x70    0x09    0xc5    0xc0    0xf9    0x7f    0x00    0x00
        0x7ff9c0b10c08: 0x70    0x09    0xc5    0xc0    0xf9    0x7f    0x00    0x00
        0x7ff9c0b10c10: 0x2c    0x04    0x00    0x00    0x00    0x00    0x00    0x00
        0x7ff9c0b10c18: 0x10    0xeb    0xc4    0xc0    0xf9    0x7f    0x00    0x00
        0x7ff9c0b10c20: 0x10    0xeb    0xc4    0xc0    0xf9    0x7f    0x00    0x00
        0x7ff9c0b10c28: 0x2a    0x04    0x00    0x00    0x00    0x00    0x00    0x00
        0x7ff9c0b10c30: 0x50    0xd7    0xb0    0xc0    0xf9    0x7f    0x00    0x00
        0x7ff9c0b10c38: 0x50    0xd7    0xb0    0xc0    0xf9    0x7f    0x00    0x00
        0x7ff9c0b10c40: 0x28    0x04    0x00    0x00    0x00    0x00    0x00    0x00
        0x7ff9c0b10c48: 0x80    0x81    0x0c    0xc0    0xf9    0x7f    0x00    0x00
        0x7ff9c0b10c50: 0x80    0x81    0x0c    0xc0    0xf9    0x7f    0x00    0x00 <-- p (malloc data structure starts here)
        0x7ff9c0b10c58: 0x91    0x0f    0x00    0x00    0x00    0x00    0x00    0x00
        0x7ff9c0b10c60: 0xa0    0x06    0x00    0xc0    0xf9    0x7f    0x00    0x00
        0x7ff9c0b10c68: 0xb0    0xbf    0x04    0xc0    0xf9    0x7f    0x00    0x00

It looks like someone is writing triples of (number, pointer, same pointer again), and the last one has its final pointer out of bounds of the allocated block of memory.

adamnovak commented 6 months ago

For the valgrind run I used /private/home/anovak/workspace/vg/trash/min-first.fq and one thread, and for running under gdb I used /private/groups/patenlab/anovak/projects/hprc/lr-giraffe/reads/sim/r10/HG002/HG002-sim-r10-1m.fq and 128 threads.

adamnovak commented 6 months ago

I also noticed that if it runs out of memory the Banded Global Aligner might double-free? It frees in its allocation-has-failed code and it also frees when the object is destroyed. But I don't think that's the real problem.

adamnovak commented 6 months ago

OK, in f692a15ad9047efe2db7102ff72e9fa24533dc9d I have a make jemalloc=off asan=on that we can do to use the address sanitizer, which is much faster than valgrind and actually lets us multithread but also catches the out of bounds zipcode tree read (which it looks like actually predates the fancy new dagification logic).

I also have a possible fix in that line of commits to just not make the access to the vector it is looking at if it is empty. That lets me seemingly reliably run through:

vg giraffe -t128 --parameter-preset lr --progress -Z /private/groups/patenlab/xhchang/graphs/hprc_1.1_d9/hprc-v1.1-mc-chm13.d9.gbz -d /private/groups/patenlab/xhchang/graphs/hprc_1.1_d9/hprc-v1.1-mc-chm13.d9.dist -m /private/groups/patenlab/xhchang/graphs/hprc_1.1_d9/hprc-v1.1-mc-chm13.d9.min.k29.w11 -z /private/groups/patenlab/xhchang/graphs/hprc_1.1_d9/hprc-v1.1-mc-chm13.d9.zips.k29.w11 -G /private/groups/patenlab/xhchang/graphs/1000gp/novaseq6000.gam --discard

And I don't crash or trip the sanitizer.

Next I will try the long reads again.

adamnovak commented 6 months ago

I think the out of bounds read in close_chain problem sort of fits the profile for what we're looking for, though: on the rare occasion that the out of bounds read turns up an apparent EDGE type, it then goes and pops something off a vector which might not even exist, which is going to do exciting things at weird places in memory, very rarely.

adamnovak commented 6 months ago

With a normal build (jemalloc on, sanitizer off) of 8ec899112978a01fbbf454c2da8208909e1b685b I have put something like 300 CPU hours into:

/usr/bin/time -v vg giraffe -t128 --parameter-preset lr --progress \
-Z /private/groups/patenlab/anovak/projects/hprc/lr-giraffe/graphs/hprc-v1.1-mc-chm13.d9.gbz \
-d /private/groups/patenlab/anovak/projects/hprc/lr-giraffe/graphs/hprc-v1.1-mc-chm13.d9.dist \
-m /private/groups/patenlab/anovak/projects/hprc/lr-giraffe/graphs/hprc-v1.1-mc-chm13.d9.k31.w50.W.withzip.min \
-z /private/groups/patenlab/anovak/projects/hprc/lr-giraffe/graphs/hprc-v1.1-mc-chm13.d9.k31.w50.W.zipcodes \
-f /private/groups/migalab/kkyriaki/rotation/batman/data/corrected/3_simplex_EC_noIUPAC.fastq \
> trash/crash2.gam

On the one hand, that's good because it hasn't crashed yet. On the other hand that's bad because it seems way slower than it should be. I think b0aa460546beafe9e4babf4629b4c756134c468d without @xchang1's new non-DAG snarl handling is probably much faster and should have the same bugfix.

Hopefully it will finish overnight and I can produce a speed number.

adamnovak commented 6 months ago

OK, I managed to run through in 6 hours on the crash 2 read set:

Mapped 2512113 reads across 128 threads in 22562.9 seconds with 0 additional single-threaded seconds.
Mapping speed: 0.869831 reads per second per thread
Used 2.84944e+06 CPU-seconds (including output).
Achieved 0.881616 reads per CPU-second (including output)
Memory footprint: 481.022 GB
        Command being timed: "vg giraffe -t128 --parameter-preset lr --progress -Z /private/groups/patenlab/anovak/projects/hprc/lr-giraffe/graphs/hprc-v1.1-mc-chm13.d9.gbz -d /private/groups/patenlab/anovak/projects/hprc/lr-giraffe/graphs/hprc-v1.1-mc-chm13.d9.dist -m /private/groups/patenlab/anovak/projects/hprc/lr-giraffe/graphs/hprc-v1.1-mc-chm13.d9.k31.w50.W.withzip.min -z /private/groups/patenlab/anovak/projects/hprc/lr-giraffe/graphs/hprc-v1.1-mc-chm13.d9.k31.w50.W.zipcodes -f /private/groups/migalab/kkyriaki/rotation/batman/data/corrected/3_simplex_EC_noIUPAC.fastq"
        User time (seconds): 827686.63
        System time (seconds): 2021886.16
        Percent of CPU this job got: 12561%
        Elapsed (wall clock) time (h:mm:ss or m:ss): 6:18:04
        Average shared text size (kbytes): 0
        Average unshared data size (kbytes): 0
        Average stack size (kbytes): 0
        Average total size (kbytes): 0
        Maximum resident set size (kbytes): 504388028
        Average resident set size (kbytes): 0
        Major (requiring I/O) page faults: 397
        Minor (reclaiming a frame) page faults: 70774938350
        Voluntary context switches: 663949
        Involuntary context switches: 7295548
        Swaps: 0
        File system inputs: 145756024
        File system outputs: 35642192
        Socket messages sent: 0
        Socket messages received: 0
        Signals delivered: 0
        Page size (bytes): 4096
        Exit status: 0

It isn't even super slow in reads-per-second, just in total CPU time.

I think this is fixed; I will kick a binary over to @kokyriakidis to try.