Open adamnovak opened 7 months ago
This is the command to reproduce the issue:
/private/home/anovak/workspace/vg/bin/vg giraffe \
-t32 \
--parameter-preset lr \
--progress \
-Z /private/groups/patenlab/anovak/projects/hprc/lr-giraffe/graphs/hprc-v1.1-mc-chm13.d9.gbz \
-d /private/groups/patenlab/anovak/projects/hprc/lr-giraffe/graphs/hprc-v1.1-mc-chm13.d9.dist \
-m /private/groups/patenlab/anovak/projects/hprc/lr-giraffe/graphs/hprc-v1.1-mc-chm13.d9.k31.w50.W.withzip.min \
-z /private/groups/patenlab/anovak/projects/hprc/lr-giraffe/graphs/hprc-v1.1-mc-chm13.d9.k31.w50.W.zipcodes \
-f /private/groups/migalab/kkyriaki/rotation/batman/data/HG002_R1041_Duplex/1_3_23_R1041_Duplex_HG002_3_Dorado_v0.4.1_400bps_sup_v4.1.0_simplex.fastq.gz \
> /private/groups/patenlab/kkyriaki/ec_giraffe/3_simplex.gam
I haven't been able to reproduce this issue with the debug logic added to dump the offending read names (in commit d35bbfd6cbc40dfcfdb5db46aa8c70905be040b5). This is what I get at the tail end of my log, on emerald.prism
.
Mapped 2512113 reads across 32 threads in 56143.2 seconds with 0 additional single-threaded seconds.
Mapping speed: 1.39827 reads per second per thread
Used 1.70112e+06 CPU-seconds (including output).
Achieved 1.47674 reads per CPU-second (including output)
Memory footprint: 160.934 GB
Command being timed: "vg giraffe --log-reads -t32 --parameter-preset lr --progress -Z /private/groups/patenlab/anovak/projects/hprc/lr-giraffe/graphs/hprc-v1.1-mc-chm13.d9.gbz -d /private/groups/patenlab/anovak/projects/hprc/lr-
giraffe/graphs/hprc-v1.1-mc-chm13.d9.dist -m /private/groups/patenlab/anovak/projects/hprc/lr-giraffe/graphs/hprc-v1.1-mc-chm13.d9.k31.w50.W.withzip.min -z /private/groups/patenlab/anovak/projects/hprc/lr-giraffe/graphs/hprc-v1.1-mc-chm
13.d9.k31.w50.W.zipcodes -f /private/groups/migalab/kkyriaki/rotation/batman/data/HG002_
R1041_Duplex/1_3_23_R1041_Duplex_HG002_3_Dorado_v0.4.1_400bps_sup_v4.1.0_simplex.fastq.gz"
User time (seconds): 688262.07
System time (seconds): 1013041.16
Percent of CPU this job got: 3023%
Elapsed (wall clock) time (h:mm:ss or m:ss): 15:37:42
Average shared text size (kbytes): 0
Average unshared data size (kbytes): 0
Average stack size (kbytes): 0
Average total size (kbytes): 0
Maximum resident set size (kbytes): 168751208
Average resident set size (kbytes): 0
Major (requiring I/O) page faults: 677
Minor (reclaiming a frame) page faults: 53628081121
Voluntary context switches: 126311
Involuntary context switches: 24701861
Swaps: 0
File system inputs: 64681864
File system outputs: 432520
Socket messages sent: 0
Socket messages received: 0
Signals delivered: 0
Page size (bytes): 4096
Exit status: 0
I did manage to get the program to be killed by the OOM-killer at one point, on crimson.prism
which has less memory. In theory the Banded Global Aligner code implicated by your stack trace should handle a failed allocation and abort the program, but maybe we've missed a check and the program can sometimes segfault when memory allocation attempts return nullptr
? @kokyriakidis which machine did you run your failing run on? Can you reproduce the problem with my new build of vg, which should log the offending read names when it crashes?
It could be that the extra synchronization introduced by using my new --log-reads
option is somehow stopping the problem from happening. I can try running again without that option and actually saving the result this time, and maybe something different will happen.
Using the latest binary:
--log-reads
giraffe crashes again even when assigning 500GB of RAM. --log-reads
giraffe DOES NOT crash even when assigning 250GB of RAM. Here's another crash log @kokyriakidis sent me, which looks to show up in a different place. Is this the same file? If so maybe the same undefined behavior is causing it to happen sometimes and not other times?
━━━━━━━━━━━━━━━━━━━━
Unhandled exception of type St12out_of_range: vector::_M_range_check: __n (which is 18446744073709551583) >= this->size() (which is 11)
Exception context: 8c8e8ca1-46ff-4bea-9f02-f71b6a528847
━━━━━━━━━━━━━━━━━━━━
Crash report for vg v1.51.0-604-gd35bbfd6c "Quellenhof"
Caught signal 6 raised at address 0x219399c; tracing with backward-cpp
Stack trace (most recent call last) in thread 208337:
#11 Object "", at 0xffffffffffffffff, in
warning[MinimizerMapper::find_chain_alignment]: Refusing to align 31408 bp right tail against 50281091-78 in b0f535c5-cc02-485a-9681-f53d9ec1d231 to avoid overflow
warning[MinimizerMapper::find_chain_alignment]: Refusing to align 33682 bp left tail against 50281227-26 in b0f535c5-cc02-485a-9681-f53d9ec1d231 to avoid overflow
T6: warning[MinimizerMapper::map_from_chains]: No seeds found for e501f52f-594d-4677-b7a4-ea4a72924e30!
warning[MinimizerMapper::find_chain_alignment]: Refusing to align 38508 bp left tail against 11378216-0 in 9b67b001-f4cc-47e1-b385-a82929b7679f to avoid overflow
#10 Object "/private/home/anovak/workspace2/vg/bin/vg", at 0x2238b7f, in __clone3
#9 Object "/private/home/anovak/workspace2/vg/bin/vg", at 0x219203a, in start_thread
#8 Object "/private/home/anovak/workspace2/vg/bin/vg", at 0x213491d, in gomp_thread_start
#7 Object "/private/home/anovak/workspace2/vg/bin/vg", at 0x2137267, in gomp_team_barrier_wait_end
#6 Object "/private/home/anovak/workspace2/vg/bin/vg", at 0x212eb6a, in gomp_barrier_handle_tasks
#5 Object "/private/home/anovak/workspace2/vg/bin/vg", at 0xf5ea33, in unsigned long vg::io::unpaired_for_each_parallel<vg::Alignment>(std::function<bool (vg::Alignment&)>, std::function<void (vg::Alignment&)>, unsigned long) [clone ._omp_fn.1]
#4 Object "/private/home/anovak/workspace2/vg/bin/vg", at 0x52f0b4, in std::_Function_handler<void (vg::Alignment&), main_giraffe(int, char**)::{lambda()#1}::operator()() const::{lambda(vg::Alignment&)#8}>::_M_invoke(std::_Any_data const&, vg::Alignment&) [clone .cold]
#3 Object "/private/home/anovak/workspace2/vg/bin/vg", at 0x5579a1, in vg::report_exception(std::exception const&) [clone .cold]
#2 Object "/private/home/anovak/workspace2/vg/bin/vg", at 0x60438b, in abort
#1 Object "/private/home/anovak/workspace2/vg/bin/vg", at 0x2166e55, in raise
#0 Object "/private/home/anovak/workspace2/vg/bin/vg", at 0x219399c, in __pthread_kill
ERROR: Signal 6 occurred. VG has crashed. Visit https://github.com/vgteam/vg/issues/new/choose to report a bug.
━━━━━━━━━━━━━━━━━━━━
Context dump:
Thread 0: 59a49960-7fb6-49d5-80b6-3aba561346b4
Thread 1: e70e54eb-139f-453f-92e4-76b798ee4fe8
Thread 2: ac5075be-5424-43bd-aba9-a242c203ee0c
Thread 3: 0c64d86d-aac9-4ce9-8d6c-6bae0903e9fe
Thread 4: 34d15e2f-c3c1-4ea6-8e88-8a3640f47951
Thread 5: cfa3335d-1ec6-4a05-9ebe-b86858807200
Thread 6: 4312bd1c-6458-49c7-a3f0-46f7109c6d47
Thread 7: 88c4020c-a682-4a91-9e8f-35e4e0ff087a
Thread 8: ecde837f-04c0-4de8-af87-1c307631465d
Thread 9: fc414e5d-45cb-44f5-9d7e-62ba0aa5e71c
Thread 10: 1c85b24e-c153-4876-ada3-49c9923d4c0f
Thread 11: e3f73005-fd7c-4e0e-809f-c985d88d6fa1
Thread 12: b37a39dd-7ddc-41cf-abd9-78e24a53201f
Thread 13: fe1c1e8a-ef7d-4567-823d-390fd70f78bf
Thread 14: bd393798-2a97-46cc-adfd-4fdc248b0cb0
Thread 15: bcee0e12-116d-4099-b516-1c486aba0747
Thread 16: 57e250df-25f3-4131-afb6-9e103faa42d7
Thread 17: e2278f4e-7bfe-4de2-8df5-883a57b4634a
Thread 18: 9464d855-f691-41ad-81a0-be0b0c75baca
Thread 19: b285772f-d242-497b-9f0f-003e8b4df957
Thread 20: 76da7eeb-b1ff-4fef-831a-6aeb4a7aa2d5
Thread 21: 8abae9d5-db97-456d-9586-cf444ecad4f8
Thread 22: b4f32bad-35b1-400b-85f9-00f7223f0ab9
Thread 23: eeed2bff-6e61-477f-8f86-c6dec2f9b218
Thread 24: 7d461cf8-cfae-4059-a9a8-c0bf7df18295
Thread 25: 8c8e8ca1-46ff-4bea-9f02-f71b6a528847
Thread 26: 54ca5a60-fb08-4358-af8f-e994001eafa4
Thread 27: 7752ee64-7dde-47ca-bb62-09919a1a528d
Thread 28: 1cfe2d65-3d50-4114-af87-a615c11bc054
Thread 29: b0f535c5-cc02-485a-9681-f53d9ec1d231
Thread 30: 5d726df5-92a1-4d30-8425-421cabfc8a43
Thread 31: 3fdb181e-2b1a-4ccc-b1d7-ea8696200ea5
Found 32 threads with context.
━━━━━━━━━━━━━━━━━━━━
Please include this entire error log in your bug report!
━━━━━━━━━━━━━━━━━━━━
The second crash is apparently in /private/groups/migalab/kkyriaki/rotation/batman/data/corrected/3_simplex_EC_noIUPAC.fastq
. @jeizenga noted he can't get any memory errors to happen in a block of reads around the offending read, under valgrind.
The first crash was observed on phoenix-08.prism
according to /private/groups/migalab/kkyriaki/logs/3S_GIRAFFE.30112023.1436904.log
. The second crash was observed on phoenix-07.prism
according to /private/groups/migalab/kkyriaki/logs/3S_EC_GIRAFFE_500_NOLOG.04122023.1450947.log
.
I'm not sure if this is the same bug, but giraffe is also crashing when mapping long reads with my branch, https://github.com/xchang1/vg/tree/8434ebe82cc1ec4b858dba8521dc838a0d46e267
I ran time vg giraffe --parameter-preset lr --progress --track-provenance -Z ${GRAPH_BASE}.gbz -d ${GRAPH_BASE}.dist -m ${GRAPH_BASE}.min -z ${GRAPH_BASE}.zips -G ${READ_FILE} -t 1 > mapped.gam
This was on mustard with my copy of the hprc_1.1 graph (/private/groups/patenlab/xhchang/graphs/hprc_1.1_d9/hprc-v1.1-mc-chm13.d9
) and Adam's simulated r10 reads (/private/groups/patenlab/anovak/projects/hprc/lr-giraffe/reads/sim/r10/HG002/HG002-sim-r10-1m.gam
)
Here is the last bit of the stacktrace:
#2 Object "/private/home/xhchang/GitHub/vg/bin/vg", at 0x55970680aca2, in std::_Function_handler<void (vg::Alignment&), main_giraffe(int, char**)::{lambda()#1}::operator()() const::{lambda(vg::Alignment&)#8}>::_M_invoke(std::_Any_data const&, vg::Alignment&)
| Source "/usr/include/c++/11/bits/std_function.h", line 290, in __invoke_r<void, main_giraffe(int, char**)::<lambda()>::<lambda(vg::Alignment&)>&, vg::Alignment&>
| 288: _M_invoke(const _Any_data& __functor, _ArgTypes&&... __args)
| 289: {
| > 290: return std::__invoke_r<_Res>(*_Base::_M_get_pointer(__functor),
| 291: std::forward<_ArgTypes>(__args)...);
| 292: }
| Source "/usr/include/c++/11/bits/invoke.h", line 154, in __invoke_impl<void, main_giraffe(int, char**)::<lambda()>::<lambda(vg::Alignment&)>&, vg::Alignment&>
| 152: using __type = typename __result::type;
| 153: using __tag = typename __result::__invoke_type;
| > 154: std::__invoke_impl<__type>(__tag{}, std::forward<_Callable>(__fn),
| 155: std::forward<_Args>(__args)...);
| 156: }
| Source "/usr/include/c++/11/bits/invoke.h", line 61, in operator()
| 59: constexpr _Res
| 60: __invoke_impl(__invoke_other, _Fn&& __f, _Args&&... __args)
| > 61: { return std::forward<_Fn>(__f)(std::forward<_Args>(__args)...); }
| 62:
| 63: template<typename _Res, typename _MemFun, typename _Tp, typename... _Args>
Source "src/subcommand/giraffe_main.cpp", line 1639, in _M_invoke [0x55970680aca2]
#1 Object "/private/home/xhchang/GitHub/vg/bin/vg", at 0x559706c8a6ad, in vg::MinimizerMapper::map(vg::Alignment&, vg::io::AlignmentEmitter&)
| Source "src/minimizer_mapper.cpp", line 579, in map
Source "src/minimizer_mapper.cpp", line 584, in map [0x559706c8a6ad]
#0 Object "/private/home/xhchang/GitHub/vg/bin/vg", at 0x559706cab1e9, in vg::MinimizerMapper::map_from_chains(vg::Alignment&)
Source "src/minimizer_mapper_from_chains.cpp", line 205, in map_from_chains [0x559706cab1e9]
ERROR: Signal 11 occurred. VG has crashed. Visit https://github.com/vgteam/vg/issues/new/choose to report a bug.
Please include this entire error log in your bug report!
━━━━━━━━━━━━━━━━━━━━
real 3476m37.344s
user 4167m33.794s
sys 2688m15.632s
This is the line it claims to be crashing on: https://github.com/xchang1/vg/blob/8434ebe82cc1ec4b858dba8521dc838a0d46e267/src/minimizer_mapper_from_chains.cpp#L205 This line should always work, so it seems like a memory bug to me, but I ran this twice multi-threaded and once single-threaded and it always crashed on the same line
I'm not sure what read it's crashing on, but right before the crash, I got a bunch of the Refusing to align
messages for read S187_4633, so it was probably that one or the one right after, S180_17054, which is the last read in the set. I re-ran the last four reads in the file and it ran through without crashing.
I was able to replicate @xchang1's crash repeatedly this morning, and tracked it down how far into the file I would have to start to get it with a convincing binary search on Mustard, using 128 threads.
Mon Dec 11 09:19:59 AM PST 2023
Testing to cut 0 to 1000000 at 500000, into trash/searchstep-xian.500000.txt
Problem found starting at 500000
Mon Dec 11 09:36:49 AM PST 2023
Testing to cut 500000 to 1000000 at 750000, into trash/searchstep-xian.750000.txt
Problem found starting at 750000
Mon Dec 11 09:46:55 AM PST 2023
Testing to cut 750000 to 1000000 at 875000, into trash/searchstep-xian.875000.txt
Problem found starting at 875000
Mon Dec 11 10:07:14 AM PST 2023
Testing to cut 875000 to 1000000 at 937500, into trash/searchstep-xian.937500.txt
Problem found starting at 937500
Mon Dec 11 10:34:16 AM PST 2023
Testing to cut 937500 to 1000000 at 968750, into trash/searchstep-xian.968750.txt
Problem found starting at 968750
Mon Dec 11 10:45:38 AM PST 2023
Testing to cut 968750 to 1000000 at 984375, into trash/searchstep-xian.984375.txt
Problem found starting at 984375
Mon Dec 11 11:04:22 AM PST 2023
Testing to cut 984375 to 1000000 at 992187, into trash/searchstep-xian.992187.txt
Problem found starting at 992187
Mon Dec 11 11:15:53 AM PST 2023
Testing to cut 992187 to 1000000 at 996093, into trash/searchstep-xian.996093.txt
Problem found starting at 996093
Mon Dec 11 11:22:13 AM PST 2023
Testing to cut 996093 to 1000000 at 998046, into trash/searchstep-xian.998046.txt
Problem found starting at 998046
Mon Dec 11 11:27:03 AM PST 2023
Testing to cut 998046 to 1000000 at 999023, into trash/searchstep-xian.999023.txt
No problem found starting at 999023
Mon Dec 11 11:34:35 AM PST 2023
Testing to cut 998046 to 999023 at 998534, into trash/searchstep-xian.998534.txt
Problem found starting at 998534
Mon Dec 11 11:38:26 AM PST 2023
Testing to cut 998534 to 999023 at 998778, into trash/searchstep-xian.998778.txt
No problem found starting at 998778
Mon Dec 11 11:48:04 AM PST 2023
Testing to cut 998534 to 998778 at 998656, into trash/searchstep-xian.998656.txt
No problem found starting at 998656
Mon Dec 11 11:56:01 AM PST 2023
Testing to cut 998534 to 998656 at 998595, into trash/searchstep-xian.998595.txt
No problem found starting at 998595
Mon Dec 11 12:05:27 PM PST 2023
Testing to cut 998534 to 998595 at 998564, into trash/searchstep-xian.998564.txt
Problem found starting at 998564
Mon Dec 11 12:09:38 PM PST 2023
Testing to cut 998564 to 998595 at 998579, into trash/searchstep-xian.998579.txt
Problem found starting at 998579
Mon Dec 11 12:14:24 PM PST 2023
Testing to cut 998579 to 998595 at 998587, into trash/searchstep-xian.998587.txt
Problem found starting at 998587
Mon Dec 11 12:18:23 PM PST 2023
Testing to cut 998587 to 998595 at 998591, into trash/searchstep-xian.998591.txt
No problem found starting at 998591
Mon Dec 11 12:27:38 PM PST 2023
Testing to cut 998587 to 998591 at 998589, into trash/searchstep-xian.998589.txt
No problem found starting at 998589
Mon Dec 11 12:37:31 PM PST 2023
Testing to cut 998587 to 998589 at 998588, into trash/searchstep-xian.998588.txt
Problem found starting at 998588
Mon Dec 11 12:41:48 PM PST 2023
Stopping at 998588
But now around Mon Dec 11 02:12:40 PM PST 2023
I haven't been able to replicate any of the last couple failures, either manually or under the search script. I can run reads 998587 to the end in a loop and have it pass every time.
I also got nothing useful out of the debug version of jemalloc with its assertions to catch double frees and its ability to fill fresh or deallocated memory with garbage.
@adamnovak I have no idea if this is the same bug, but I'm also getting inconsistent segfaults running short reads with the long read code path, and it might be faster to run through with the short reads. I'm running the same giraffe command as before with the same branch but with these reads /private/groups/patenlab/xhchang/graphs/1000gp/novaseq6000.gam
and minimizer/zipcode files .k29.w11
Using Valgrind, I found an out of bounds read in the zipcode tree code:
==2021789== Invalid read of size 1
==2021789== at 0x168B4FB: get_type (zip_code_tree.hpp:149)
==2021789== by 0x168B4FB: vg::ZipCodeForest::close_chain(vg::ZipCodeForest::forest_growing_state_t&, unsigned long const&, vg::SnarlDistanceIndexClusterer::Seed const&, bool) (zip_code_tree.cpp:90)
==2021789== by 0x9E7124: void vg::ZipCodeForest::fill_in_forest<vg::MinimizerMapper::Minimizer>(std::vector<vg::SnarlDistanceIndexClusterer::Seed, std::allocator<vg::SnarlDistanceIndexClusterer::Seed> > const&, vg::VectorView<vg::MinimizerMapper::Minimizer> const&, bdsg::SnarlDistanceIndex const&, unsigned long, unsigned long) (zip_code_tree.hpp:1010)
==2021789== by 0x134AB08: vg::MinimizerMapper::map_from_chains(vg::Alignment&) (minimizer_mapper_from_chains.cpp:176)
==2021789== by 0x132A19D: map (minimizer_mapper.cpp:584)
==2021789== by 0x132A19D: vg::MinimizerMapper::map(vg::Alignment&, vg::io::AlignmentEmitter&) (minimizer_mapper.cpp:579)
==2021789== by 0xEA9086: operator() (giraffe_main.cpp:1656)
==2021789== by 0xEA9086: __invoke_impl<void, main_giraffe(int, char**)::<lambda()>::<lambda(vg::Alignment&)>&, vg::Alignment&> (invoke.h:61)
==2021789== by 0xEA9086: __invoke_r<void, main_giraffe(int, char**)::<lambda()>::<lambda(vg::Alignment&)>&, vg::Alignment&> (invoke.h:154)
==2021789== by 0xEA9086: std::_Function_handler<void (vg::Alignment&), main_giraffe(int, char**)::{lambda()#1}::operator()() const::{lambda(vg::Alignment&)#8}>::_M_invoke(std::_Any_data const&, vg::Alignment&) (std_function.h:290)
==2021789== by 0x106A8FE: operator() (std_function.h:590)
==2021789== by 0x106A8FE: unsigned long vg::io::unpaired_for_each_parallel<vg::Alignment>(std::function<bool (vg::Alignment&)>, std::function<void (vg::Alignment&)>, unsigned long) [clone ._omp_fn.0] (alignment_io.hpp:169)
==2021789== by 0x4E7AA15: GOMP_parallel (in /usr/lib/x86_64-linux-gnu/libgomp.so.1.0.0)
==2021789== by 0x106196C: unpaired_for_each_parallel<vg::Alignment> (alignment_io.hpp:124)
==2021789== by 0x106196C: vg::fastq_unpaired_for_each_parallel(std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > const&, std::function<void (vg::Alignment&)>, unsigned long) (alignment.cpp:243)
==2021789== by 0xEB0771: main_giraffe(int, char**)::{lambda()#1}::operator()() const (giraffe_main.cpp:1691)
==2021789== by 0xF5254C: operator() (std_function.h:590)
==2021789== by 0xF5254C: operator() (options.cpp:52)
==2021789== by 0xF5254C: __invoke_impl<void, vg::subcommand::TickChainLink::get_iterator()::<lambda(const std::function<void()>&)>&, const std::function<void()>&> (invoke.h:61)
==2021789== by 0xF5254C: __invoke_r<void, vg::subcommand::TickChainLink::get_iterator()::<lambda(const std::function<void()>&)>&, const std::function<void()>&> (invoke.h:154)
==2021789== by 0xF5254C: std::_Function_handler<void (std::function<void ()> const&), vg::subcommand::TickChainLink::get_iterator()::{lambda(std::function<void ()> const&)#1}>::_M_invoke(std::_Any_data const&, std::function<void ()> const&) (std_function.h:290)
==2021789== by 0xEB5B8E: operator() (std_function.h:590)
==2021789== by 0xEB5B8E: main_giraffe(int, char**) (giraffe_main.cpp:1282)
==2021789== by 0xF973AB: operator() (std_function.h:590)
==2021789== by 0xF973AB: vg::subcommand::Subcommand::operator()(int, char**) const (subcommand.cpp:75)
==2021789== Address 0x26921e8f0 is 16 bytes before a block of size 128 alloc'd
==2021789== at 0x4849013: operator new(unsigned long) (in /usr/libexec/valgrind/vgpreload_memcheck-amd64-linux.so)
==2021789== by 0x168E63F: allocate (new_allocator.h:127)
==2021789== by 0x168E63F: allocate (alloc_traits.h:464)
==2021789== by 0x168E63F: _M_allocate (stl_vector.h:346)
==2021789== by 0x168E63F: _M_realloc_insert<vg::ZipCodeTree::tree_item_type_t, long unsigned int const&, bool> (vector.tcc:440)
==2021789== by 0x168E63F: emplace_back<vg::ZipCodeTree::tree_item_type_t, long unsigned int const&, bool> (vector.tcc:121)
==2021789== by 0x168E63F: vg::ZipCodeForest::add_child_to_chain(vg::ZipCodeForest::forest_growing_state_t&, unsigned long const&, unsigned long const&, bool, bool) (zip_code_tree.cpp:452)
==2021789== by 0x9E7737: void vg::ZipCodeForest::fill_in_forest<vg::MinimizerMapper::Minimizer>(std::vector<vg::SnarlDistanceIndexClusterer::Seed, std::allocator<vg::SnarlDistanceIndexClusterer::Seed> > const&, vg::VectorView<vg::MinimizerMapper::Minimizer> const&, bdsg::SnarlDistanceIndex const&, unsigned long, unsigned long) (zip_code_tree.hpp:1165)
==2021789== by 0x134AB08: vg::MinimizerMapper::map_from_chains(vg::Alignment&) (minimizer_mapper_from_chains.cpp:176)
==2021789== by 0x132A19D: map (minimizer_mapper.cpp:584)
==2021789== by 0x132A19D: vg::MinimizerMapper::map(vg::Alignment&, vg::io::AlignmentEmitter&) (minimizer_mapper.cpp:579)
==2021789== by 0xEA9086: operator() (giraffe_main.cpp:1656)
==2021789== by 0xEA9086: __invoke_impl<void, main_giraffe(int, char**)::<lambda()>::<lambda(vg::Alignment&)>&, vg::Alignment&> (invoke.h:61)
==2021789== by 0xEA9086: __invoke_r<void, main_giraffe(int, char**)::<lambda()>::<lambda(vg::Alignment&)>&, vg::Alignment&> (invoke.h:154)
==2021789== by 0xEA9086: std::_Function_handler<void (vg::Alignment&), main_giraffe(int, char**)::{lambda()#1}::operator()() const::{lambda(vg::Alignment&)#8}>::_M_invoke(std::_Any_data const&, vg::Alignment&) (std_function.h:290)
==2021789== by 0x106A8FE: operator() (std_function.h:590)
==2021789== by 0x106A8FE: unsigned long vg::io::unpaired_for_each_parallel<vg::Alignment>(std::function<bool (vg::Alignment&)>, std::function<void (vg::Alignment&)>, unsigned long) [clone ._omp_fn.0] (alignment_io.hpp:169)
==2021789== by 0x4E7AA15: GOMP_parallel (in /usr/lib/x86_64-linux-gnu/libgomp.so.1.0.0)
==2021789== by 0x106196C: unpaired_for_each_parallel<vg::Alignment> (alignment_io.hpp:124)
==2021789== by 0x106196C: vg::fastq_unpaired_for_each_parallel(std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > const&, std::function<void (vg::Alignment&)>, unsigned long) (alignment.cpp:243)
==2021789== by 0xEB0771: main_giraffe(int, char**)::{lambda()#1}::operator()() const (giraffe_main.cpp:1691)
==2021789== by 0xF5254C: operator() (std_function.h:590)
==2021789== by 0xF5254C: operator() (options.cpp:52)
==2021789== by 0xF5254C: __invoke_impl<void, vg::subcommand::TickChainLink::get_iterator()::<lambda(const std::function<void()>&)>&, const std::function<void()>&> (invoke.h:61)
==2021789== by 0xF5254C: __invoke_r<void, vg::subcommand::TickChainLink::get_iterator()::<lambda(const std::function<void()>&)>&, const std::function<void()>&> (invoke.h:154)
==2021789== by 0xF5254C: std::_Function_handler<void (std::function<void ()> const&), vg::subcommand::TickChainLink::get_iterator()::{lambda(std::function<void ()> const&)#1}>::_M_invoke(std::_Any_data const&, std::function<void ()>const&) (std_function.h:290)
==2021789== by 0xEB5B8E: operator() (std_function.h:590)
==2021789== by 0xEB5B8E: main_giraffe(int, char**) (giraffe_main.cpp:1282)
This is with jemalloc=off
so Valgrind can work, and commit 7b11d89622f7ccc227e75c3577e07970f7240d1b.
I also, separately, got a crash under gdb where the allocator (not jemalloc but the normal one, since I had jemalloc=off
) detected somebody tampering with the free list metadata that it puts before each block. It looks like it had gone to allocate from a free block, but the allocated block right before it had overwritten the first 8 bytes of the next block header. So someone wrote 8 bytes past the end of their memory allocation. That looked like this in memory:
0x7ff9c0b10bf0: 0x10 0x76 0x10 0xc0 0xf9 0x7f 0x00 0x00
0x7ff9c0b10bf8: 0x2e 0x04 0x00 0x00 0x00 0x00 0x00 0x00
0x7ff9c0b10c00: 0x70 0x09 0xc5 0xc0 0xf9 0x7f 0x00 0x00
0x7ff9c0b10c08: 0x70 0x09 0xc5 0xc0 0xf9 0x7f 0x00 0x00
0x7ff9c0b10c10: 0x2c 0x04 0x00 0x00 0x00 0x00 0x00 0x00
0x7ff9c0b10c18: 0x10 0xeb 0xc4 0xc0 0xf9 0x7f 0x00 0x00
0x7ff9c0b10c20: 0x10 0xeb 0xc4 0xc0 0xf9 0x7f 0x00 0x00
0x7ff9c0b10c28: 0x2a 0x04 0x00 0x00 0x00 0x00 0x00 0x00
0x7ff9c0b10c30: 0x50 0xd7 0xb0 0xc0 0xf9 0x7f 0x00 0x00
0x7ff9c0b10c38: 0x50 0xd7 0xb0 0xc0 0xf9 0x7f 0x00 0x00
0x7ff9c0b10c40: 0x28 0x04 0x00 0x00 0x00 0x00 0x00 0x00
0x7ff9c0b10c48: 0x80 0x81 0x0c 0xc0 0xf9 0x7f 0x00 0x00
0x7ff9c0b10c50: 0x80 0x81 0x0c 0xc0 0xf9 0x7f 0x00 0x00 <-- p (malloc data structure starts here)
0x7ff9c0b10c58: 0x91 0x0f 0x00 0x00 0x00 0x00 0x00 0x00
0x7ff9c0b10c60: 0xa0 0x06 0x00 0xc0 0xf9 0x7f 0x00 0x00
0x7ff9c0b10c68: 0xb0 0xbf 0x04 0xc0 0xf9 0x7f 0x00 0x00
It looks like someone is writing triples of (number, pointer, same pointer again), and the last one has its final pointer out of bounds of the allocated block of memory.
For the valgrind run I used /private/home/anovak/workspace/vg/trash/min-first.fq
and one thread, and for running under gdb I used /private/groups/patenlab/anovak/projects/hprc/lr-giraffe/reads/sim/r10/HG002/HG002-sim-r10-1m.fq
and 128 threads.
I also noticed that if it runs out of memory the Banded Global Aligner might double-free? It frees in its allocation-has-failed code and it also frees when the object is destroyed. But I don't think that's the real problem.
OK, in f692a15ad9047efe2db7102ff72e9fa24533dc9d I have a make jemalloc=off asan=on
that we can do to use the address sanitizer, which is much faster than valgrind and actually lets us multithread but also catches the out of bounds zipcode tree read (which it looks like actually predates the fancy new dagification logic).
I also have a possible fix in that line of commits to just not make the access to the vector it is looking at if it is empty. That lets me seemingly reliably run through:
vg giraffe -t128 --parameter-preset lr --progress -Z /private/groups/patenlab/xhchang/graphs/hprc_1.1_d9/hprc-v1.1-mc-chm13.d9.gbz -d /private/groups/patenlab/xhchang/graphs/hprc_1.1_d9/hprc-v1.1-mc-chm13.d9.dist -m /private/groups/patenlab/xhchang/graphs/hprc_1.1_d9/hprc-v1.1-mc-chm13.d9.min.k29.w11 -z /private/groups/patenlab/xhchang/graphs/hprc_1.1_d9/hprc-v1.1-mc-chm13.d9.zips.k29.w11 -G /private/groups/patenlab/xhchang/graphs/1000gp/novaseq6000.gam --discard
And I don't crash or trip the sanitizer.
Next I will try the long reads again.
I think the out of bounds read in close_chain
problem sort of fits the profile for what we're looking for, though: on the rare occasion that the out of bounds read turns up an apparent EDGE
type, it then goes and pops something off a vector which might not even exist, which is going to do exciting things at weird places in memory, very rarely.
With a normal build (jemalloc on, sanitizer off) of 8ec899112978a01fbbf454c2da8208909e1b685b I have put something like 300 CPU hours into:
/usr/bin/time -v vg giraffe -t128 --parameter-preset lr --progress \
-Z /private/groups/patenlab/anovak/projects/hprc/lr-giraffe/graphs/hprc-v1.1-mc-chm13.d9.gbz \
-d /private/groups/patenlab/anovak/projects/hprc/lr-giraffe/graphs/hprc-v1.1-mc-chm13.d9.dist \
-m /private/groups/patenlab/anovak/projects/hprc/lr-giraffe/graphs/hprc-v1.1-mc-chm13.d9.k31.w50.W.withzip.min \
-z /private/groups/patenlab/anovak/projects/hprc/lr-giraffe/graphs/hprc-v1.1-mc-chm13.d9.k31.w50.W.zipcodes \
-f /private/groups/migalab/kkyriaki/rotation/batman/data/corrected/3_simplex_EC_noIUPAC.fastq \
> trash/crash2.gam
On the one hand, that's good because it hasn't crashed yet. On the other hand that's bad because it seems way slower than it should be. I think b0aa460546beafe9e4babf4629b4c756134c468d without @xchang1's new non-DAG snarl handling is probably much faster and should have the same bugfix.
Hopefully it will finish overnight and I can produce a speed number.
OK, I managed to run through in 6 hours on the crash 2 read set:
Mapped 2512113 reads across 128 threads in 22562.9 seconds with 0 additional single-threaded seconds.
Mapping speed: 0.869831 reads per second per thread
Used 2.84944e+06 CPU-seconds (including output).
Achieved 0.881616 reads per CPU-second (including output)
Memory footprint: 481.022 GB
Command being timed: "vg giraffe -t128 --parameter-preset lr --progress -Z /private/groups/patenlab/anovak/projects/hprc/lr-giraffe/graphs/hprc-v1.1-mc-chm13.d9.gbz -d /private/groups/patenlab/anovak/projects/hprc/lr-giraffe/graphs/hprc-v1.1-mc-chm13.d9.dist -m /private/groups/patenlab/anovak/projects/hprc/lr-giraffe/graphs/hprc-v1.1-mc-chm13.d9.k31.w50.W.withzip.min -z /private/groups/patenlab/anovak/projects/hprc/lr-giraffe/graphs/hprc-v1.1-mc-chm13.d9.k31.w50.W.zipcodes -f /private/groups/migalab/kkyriaki/rotation/batman/data/corrected/3_simplex_EC_noIUPAC.fastq"
User time (seconds): 827686.63
System time (seconds): 2021886.16
Percent of CPU this job got: 12561%
Elapsed (wall clock) time (h:mm:ss or m:ss): 6:18:04
Average shared text size (kbytes): 0
Average unshared data size (kbytes): 0
Average stack size (kbytes): 0
Average total size (kbytes): 0
Maximum resident set size (kbytes): 504388028
Average resident set size (kbytes): 0
Major (requiring I/O) page faults: 397
Minor (reclaiming a frame) page faults: 70774938350
Voluntary context switches: 663949
Involuntary context switches: 7295548
Swaps: 0
File system inputs: 145756024
File system outputs: 35642192
Socket messages sent: 0
Socket messages received: 0
Signals delivered: 0
Page size (bytes): 4096
Exit status: 0
It isn't even super slow in reads-per-second, just in total CPU time.
I think this is fixed; I will kick a binary over to @kokyriakidis to try.
Konstantinos managed to crash my Long Read Giraffe branch (commit 2c6d15569), and the resulting stack trace didn't include the read name for the thread (maybe we can't get it for a signal?).
We need to collect the data to reproduce this and ID the offending read, and figure out what is going wrong inside the Banded Global Aligner. I think it is running with some variant of: