Open eldariont opened 4 years ago
That (unsurprisingly) works fine here. looks like it fails when it's trying to make the xg index. I assume you have the ~20G RAM free to do this. Perhaps turning off optimizations (replace -O3 by -O0 in Makefile then rebuild everything) or doing make static will help? Also, maybe running through gdb will give you a better stack (though I don't think I've seen this happen)?
Thanks, Glenn. I just wanted to try making a static build but before ran the same command as above (just to make sure it still crashes). It crashed again with the same signal 11 but this time the stack trace actually contains useful information (it's size is now 5K instead of 880B):
Crash report for vg v1.18.0-117-gfd8372336 "Zungoli"
Stack trace (most recent call last):
#8 Object "", at 0xffffffffffffffff, in
#7 Object "/project/heller_d-data/graph-genomes/bin/vg/bin/vg", at 0x83eae9, in _start
#6 Object "/lib/libc-2.27.so", at 0x7fe23d75fb5d, in __libc_start_main
#5 Object "/project/heller_d-data/graph-genomes/bin/vg/bin/vg", at 0x784120, in main
#4 Object "/project/heller_d-data/graph-genomes/bin/vg/bin/vg", at 0xca0767, in vg::subcommand::Subcommand::operator()(int, char**) const
#3 Object "/project/heller_d-data/graph-genomes/bin/vg/bin/vg", at 0xcd64f8, in main_prune(int, char**)
#2 Object "/project/heller_d-data/graph-genomes/bin/vg/bin/vg", at 0xf874a4, in vg::Paths::rebuild_mapping_aux()
#1 Object "/project/heller_d-data/graph-genomes/bin/vg/bin/vg", at 0xf99be9, in std::pair<vg::mapping_t const* const, std::pair<std::_List_iterator<vg::mapping_t>, long> >& spp::sparse_hashtable<std::pair<vg::mapping_t const* const, std::pair<std::_List_iterator<vg::mapping_t>, long> >, vg::mapping_t const*, vg::wang_hash<vg::mapping_t const*, void>, spp::sparse_hash_map<vg::mapping_t const*, std::pair<std::_List_iterator<vg::mapping_t>, long>, vg::wang_hash<vg::mapping_t const*, void>, std::equal_to<vg::mapping_t const*>, spp::libc_allocator<std::pair<vg::mapping_t const* const, std::pair<std::_List_iterator<vg::mapping_t>, long> > > >::SelectKey, spp::sparse_hash_map<vg::mapping_t const*, std::pair<std::_List_iterator<vg::mapping_t>, long>, vg::wang_hash<vg::mapping_t const*, void>, std::equal_to<vg::mapping_t const*>, spp::libc_allocator<std::pair<vg::mapping_t const* const, std::pair<std::_List_iterator<vg::mapping_t>, long> > > >::SetKey, std::equal_to<vg::mapping_t const*>, spp::libc_allocator<std::pair<vg::mapping_t const* const, std::pair<std::_List_iterator<vg::mapping_t>, long> > > >::find_or_insert<spp::sparse_hash_map<vg::mapping_t const*, std::pair<std::_List_iterator<vg::mapping_t>, long>, vg::wang_hash<vg::mapping_t const*, void>, std::equal_to<vg::mapping_t const*>, spp::libc_allocator<std::pair<vg::mapping_t const* const, std::pair<std::_List_iterator<vg::mapping_t>, long> > > >::DefaultValue>(vg::mapping_t const* const&)
#0 Object "/project/heller_d-data/graph-genomes/bin/vg/bin/vg", at 0xf995fb, in spp::sparse_hashtable<std::pair<vg::mapping_t const* const, std::pair<std::_List_iterator<vg::mapping_t>, long> >, vg::mapping_t const*, vg::wang_hash<vg::mapping_t const*, void>, spp::sparse_hash_map<vg::mapping_t const*, std::pair<std::_List_iterator<vg::mapping_t>, long>, vg::wang_hash<vg::mapping_t const*, void>, std::equal_to<vg::mapping_t const*>, spp::libc_allocator<std::pair<vg::mapping_t const* const, std::pair<std::_List_iterator<vg::mapping_t>, long> > > >::SelectKey, spp::sparse_hash_map<vg::mapping_t const*, std::pair<std::_List_iterator<vg::mapping_t>, long>, vg::wang_hash<vg::mapping_t const*, void>, std::equal_to<vg::mapping_t const*>, spp::libc_allocator<std::pair<vg::mapping_t const* const, std::pair<std::_List_iterator<vg::mapping_t>, long> > > >::SetKey, std::equal_to<vg::mapping_t const*>, spp::libc_allocator<std::pair<vg::mapping_t const* const, std::pair<std::_List_iterator<vg::mapping_t>, long> > > >::sparse_hashtable(spp::sparse_hashtable<std::pair<vg::mapping_t const* const, std::pair<std::_List_iterator<vg::mapping_t>, long> >, vg::mapping_t const*, vg::wang_hash<vg::mapping_t const*, void>, spp::sparse_hash_map<vg::mapping_t const*, std::pair<std::_List_iterator<vg::mapping_t>, long>, vg::wang_hash<vg::mapping_t const*, void>, std::equal_to<vg::mapping_t const*>, spp::libc_allocator<std::pair<vg::mapping_t const* const, std::pair<std::_List_iterator<vg::mapping_t>, long> > > >::SelectKey, spp::sparse_hash_map<vg::mapping_t const*, std::pair<std::_List_iterator<vg::mapping_t>, long>, vg::wang_hash<vg::mapping_t const*, void>, std::equal_to<vg::mapping_t const*>, spp::libc_allocator<std::pair<vg::mapping_t const* const, std::pair<std::_List_iterator<vg::mapping_t>, long> > > >::SetKey, std::equal_to<vg::mapping_t const*>, spp::libc_allocator<std::pair<vg::mapping_t const* const, std::pair<std::_List_iterator<vg::mapping_t>, long> > > >::MoveDontCopyT, spp::sparse_hashtable<std::pair<vg::mapping_t const* const, std::pair<std::_List_iterator<vg::mapping_t>, long> >, vg::mapping_t const*, vg::wang_hash<vg::mapping_t const*, void>, spp::sparse_hash_map<vg::mapping_t const*, std::pair<std::_List_iterator<vg::mapping_t>, long>, vg::wang_hash<vg::mapping_t const*, void>, std::equal_to<vg::mapping_t const*>, spp::libc_allocator<std::pair<vg::mapping_t const* const, std::pair<std::_List_iterator<vg::mapping_t>, long> > > >::SelectKey, spp::sparse_hash_map<vg::mapping_t const*, std::pair<std::_List_iterator<vg::mapping_t>, long>, vg::wang_hash<vg::mapping_t const*, void>, std::equal_to<vg::mapping_t const*>, spp::libc_allocator<std::pair<vg::mapping_t const* const, std::pair<std::_List_iterator<vg::mapping_t>, long> > > >::SetKey, std::equal_to<vg::mapping_t const*>, spp::libc_allocator<std::pair<vg::mapping_t const* const, std::pair<std::_List_iterator<vg::mapping_t>, long> > > >&, unsigned long)
I swear I didn't change anything in the command. No idea why the stacktrace looks so much different now.. Oh and I forgot to mention that I have 1TB of memory on the machine so that shouldn't be a problem.
I tried compiling with optimizations turned off. It still crashes:
ERROR: Signal 11 occurred. VG has crashed. Run 'vg bugs --new' to report a bug.
Stack trace path: /tmp/vg_crash_7bwBlS/stacktrace.txt
>cat /tmp/vg_crash_7bwBlS/stacktrace.txt
Crash report for vg v1.18.0-141-gb3ed0145a "Zungoli"
Stack trace (most recent call last):
#23 Object "", at 0xffffffffffffffff, in
#22 Object "/project/heller_d-data/graph-genomes/bin/vg_o/bin/vg", at 0x1853329, in
#21 Object "/lib/libc-2.27.so", at 0x7ffb3cd7fb5d, in __libc_start_main
#20 Object "/project/heller_d-data/graph-genomes/bin/vg_o/bin/vg", at 0x18536a5, in
#19 Object "/project/heller_d-data/graph-genomes/bin/vg_o/bin/vg", at 0x1e70577, in
#18 Object "/project/heller_d-data/graph-genomes/bin/vg_o/bin/vg", at 0x1e7135b, in
#17 Object "/project/heller_d-data/graph-genomes/bin/vg_o/bin/vg", at 0x1df21c3, in
#16 Object "/project/heller_d-data/graph-genomes/bin/vg_o/bin/vg", at 0x1ebab57, in
#15 Object "/project/heller_d-data/graph-genomes/bin/vg_o/bin/vg", at 0x21f5e0f, in
#14 Object "/project/heller_d-data/graph-genomes/bin/vg_o/bin/vg", at 0x21f616f, in
#13 Object "/project/heller_d-data/graph-genomes/bin/vg_o/bin/vg", at 0x21f7e1e, in
#12 Object "/project/heller_d-data/graph-genomes/bin/vg_o/bin/vg", at 0x1ebb5fb, in
#11 Object "/project/heller_d-data/graph-genomes/bin/vg_o/bin/vg", at 0x1eba1b6, in
#10 Object "/project/heller_d-data/graph-genomes/bin/vg_o/bin/vg", at 0x240d28a, in
#9 Object "/project/heller_d-data/graph-genomes/bin/vg_o/bin/vg", at 0x240d7f3, in
#8 Object "/project/heller_d-data/graph-genomes/bin/vg_o/bin/vg", at 0x22d3c14, in
#7 Object "/project/heller_d-data/graph-genomes/bin/vg_o/bin/vg", at 0x22de8e8, in
#6 Object "/project/heller_d-data/graph-genomes/bin/vg_o/bin/vg", at 0x22e0dd7, in
#5 Object "/project/heller_d-data/graph-genomes/bin/vg_o/bin/vg", at 0x22e5a01, in
#4 Object "/project/heller_d-data/graph-genomes/bin/vg_o/bin/vg", at 0x22ec8d9, in
#3 Object "/project/heller_d-data/graph-genomes/bin/vg_o/bin/vg", at 0x2145ebe, in
#2 Object "/project/heller_d-data/graph-genomes/bin/vg_o/bin/vg", at 0x2149fd3, in
#1 Object "/project/heller_d-data/graph-genomes/bin/vg_o/bin/vg", at 0x214d3e7, in
#0 Object "/project/heller_d-data/graph-genomes/bin/vg_o/bin/vg", at 0x214f6a4, in
Unfortunately, I cannot make a static build because I don't have all the static libraries. But running the release build downloaded from github produces a different result:
Original graph graph.vg: 7261852 nodes, 9517200 edges
Built a temporary XG index
Removed all paths
terminate called after throwing an instance of 'std::bad_alloc'
what(): std::bad_alloc
ERROR: Signal 6 occurred. VG has crashed. Run 'vg bugs --new' to report a bug.
Stack trace path: /tmp/vg_crash_Igq1Bs/stacktrace.txt
cat /tmp/vg_crash_Igq1Bs/stacktrace.txt
Crash report for vg v1.18.0 "Zungoli"
Stack trace (most recent call last):
#14 Object "/project/heller_d-data/graph-genomes/bin/v18/vg", at 0x4d5819, in
#13 Object "/project/heller_d-data/graph-genomes/bin/v18/vg", at 0x1a5d498, in
#12 Object "/project/heller_d-data/graph-genomes/bin/v18/vg", at 0x4098a2, in
#11 Object "/project/heller_d-data/graph-genomes/bin/v18/vg", at 0x9ac137, in
#10 Object "/project/heller_d-data/graph-genomes/bin/v18/vg", at 0x96985f, in
#9 Object "/project/heller_d-data/graph-genomes/bin/v18/vg", at 0xba8407, in
#8 Object "/project/heller_d-data/graph-genomes/bin/v18/vg", at 0xba7c95, in
#7 Object "/project/heller_d-data/graph-genomes/bin/v18/vg", at 0xb810c9, in
#6 Object "/project/heller_d-data/graph-genomes/bin/v18/vg", at 0x8793f9, in
#5 Object "/project/heller_d-data/graph-genomes/bin/v18/vg", at 0x199d568, in
#4 Object "/project/heller_d-data/graph-genomes/bin/v18/vg", at 0x199c1f0, in
#3 Object "/project/heller_d-data/graph-genomes/bin/v18/vg", at 0x199c1a5, in
#2 Object "/project/heller_d-data/graph-genomes/bin/v18/vg", at 0x1a36a34, in
#1 Object "/project/heller_d-data/graph-genomes/bin/v18/vg", at 0x1a6dd60, in
#0 Object "/project/heller_d-data/graph-genomes/bin/v18/vg", at 0x10da587, in
I work at the same organization as @eldariont, have access to his data and analyzed this a bit.
Its a badly handled malloc failure, because the maximum number of mmap() mappings of the process is exhausted. The jemalloc
submodule is causing a severe virtual memory fragmentation.
The segfault happens, because _alloc_group_array
from the sparsepp
submodule doesn't check for NULL values returned by the allocator [1]. The default allocator just uses malloc() [2], which, of course, can return NULL. This null pointer later gets dereferenced.
The system is not at all out of memory and the user doesn't hit a memory resource limit. But there is an exorbitant number of anonymous mappings used, which exhaust the system limit:
strace -p $(pgrep vg)
[...]
mmap(0x7f888d544000, 8192, PROT_NONE, MAP_PRIVATE|MAP_FIXED|MAP_ANONYMOUS, -1, 0) = 0x7f888d544000
mmap(0x7f888d912000, 8192, PROT_NONE, MAP_PRIVATE|MAP_FIXED|MAP_ANONYMOUS, -1, 0) = 0x7f888d912000
mmap(0x7f889cedc000, 8192, PROT_NONE, MAP_PRIVATE|MAP_FIXED|MAP_ANONYMOUS, -1, 0) = 0x7f889cedc000
mmap(0x7f889d056000, 8192, PROT_NONE, MAP_PRIVATE|MAP_FIXED|MAP_ANONYMOUS, -1, 0) = 0x7f889d056000
mmap(0x7f889d078000, 8192, PROT_NONE, MAP_PRIVATE|MAP_FIXED|MAP_ANONYMOUS, -1, 0) = 0x7f889d078000
mmap(0x7f889d0b1000, 8192, PROT_NONE, MAP_PRIVATE|MAP_FIXED|MAP_ANONYMOUS, -1, 0) = -1 ENOMEM (Cannot allocate memory)
madvise(0x7f889d0b1000, 8192, MADV_DONTNEED) = 0
mmap(0x7f889d22e000, 8192, PROT_NONE, MAP_PRIVATE|MAP_FIXED|MAP_ANONYMOUS, -1, 0) = -1 ENOMEM (Cannot allocate memory)
madvise(0x7f889d22e000, 8192, MADV_DONTNEED) = 0
mmap(0x7f889d270000, 8192, PROT_NONE, MAP_PRIVATE|MAP_FIXED|MAP_ANONYMOUS, -1, 0) = -1 ENOMEM (Cannot allocate memory)
madvise(0x7f889d270000, 8192, MADV_DONTNEED) = 0
mmap(0x7f889d291000, 8192, PROT_NONE, MAP_PRIVATE|MAP_FIXED|MAP_ANONYMOUS, -1, 0) = -1 ENOMEM (Cannot allocate memory)
madvise(0x7f889d291000, 8192, MADV_DONTNEED) = 0
[....]
buczek@kreios:~/test_vg/vg (master)$ while cat /proc/$(pgrep vg)/maps|wc;do sleep .1;done
[...]
65531 327804 3218655
65532 327809 3218704
65532 327809 3218704
65532 327809 3218704
65532 327809 3218704
65532 327809 3218704
0 0 0
0 0 0
0 0 0
cat: /proc//maps: No such file or directory
0 0 0
cat: /proc//maps: No such file or directory
[....]
buczek@kreios:~/test_vg/vg (master)$ sysctl vm.max_map_count
vm.max_map_count = 65530
After disabling jemalloc in the Makefile, the failing command completed successfully with just 622 maps.
Notes:
[1] https://github.com/greg7mdp/sparsepp/blob/master/sparsepp/spp.h#L1916 [2] https://github.com/greg7mdp/sparsepp/blob/master/sparsepp/spp_utils.h#L412
Are you perhaps running with the sysctl vm.overcommit_memory
set to 2 (i.e. never overcommit memory)? Apparently jemalloc doesn't like it when you do that, and despite that issue having been closed still doesn't really work well in that situation. I think when it sees that overcommit is off it tries to release unused but allocated memory to the OS by changing the permissions on the pages, but this then fragments the memory mapping.
There are some configuration changes we could apply to maybe make it work a bit better.
You are right. We do have vm.overcommit_memory=2
and setting it to 0 also prevents the problem from happening. Max 460 mappings in this case.
Setting MALLOC_CONF="retain:true"
as suggested in the referenced thread, does not help. Maybe the version of jemalloc
used here doesn't support that yet.
OK, it sounds like this is mostly an upstream problem with jemalloc. I've asked over there if there's a real solution to the problem; they also have some other malloc config settings that we could potentially make vg apply if it detects that memory overcommit is disabled. We could also I think just make it fall back to the system allocator in that case, somehow.
Maybe jemallocs opt.retain [1] is the option to go for:
If true, retain unused virtual memory for later reuse rather than discarding it by calling munmap(2) or equivalent (see stats.retained for related details). It also makes jemalloc use mmap(2) or equivalent in a more greedy way, mapping larger chunks in one go. This option is disabled by default unless discarding virtual memory is known to trigger platform-specific performance problems, namely 1) for [64-bit] Linux, which has a quirk in its virtual memory allocation algorithm that causes semi-permanent VM map holes under normal jemalloc operation; and 2) for [64-bit] Windows, which disallows split / merged regions with MEM_RELEASE. Although the same issues may present on 32-bit platforms as well, retaining virtual memory for 32-bit Linux and Windows is disabled by default due to the practical possibility of address space exhaustion.
However, the text suggest, that this options already defaults to true on Linux 64 bit. So I wonder if the problem still exists if the latest version of jemalloc is used. Otherwise it should be fixed there. vg as the user of the allocator should not be required to select configuration options based on the configuration of the operating system below it. These implementation details and requirements should be transparent to the user of jemalloc (in my hypothetical perfect small world, that is).
Sorry, retain
doesn't help, as retain:true
is already defaulted (at least after I've updated the submodule version of jemalloc to its 5.2.1 tag).
Compiling jemalloc with --with-lg-page=16
reduces the number of mappings with our data to ca. 21000, which is below the system limit. But it still doesn't feel safe.
Adding MALLOC_CONF=confirm_conf:true,muzzy_decay_ms:0,dirty_decay_ms:30000
does not further reduce this number.
Disabling jemalloc increases the rumtime from 627 seconds to 748 seconds in my test setting.
Hi,
What you were trying to do I was trying to install and run vg on our institute's Linux workstations. Compiling from source ran smoothly. Then I ran:
What you wanted to happen I wanted to get a pruned graph.
What actually happened Pruning crashed with this:
The stacktrace contains very little information:
The strange thing is that I compiled and ran the same version of vg on courtyard and got a pruned graph without any problem. So the crash seems to be specific to our Linux system/machine.
Output of
uname -a
:Linux elcattivo.molgen.mpg.de 4.14.87.mx64.236 #1 SMP Mon Dec 10 09:48:57 CET 2018 x86_64 GNU/Linux
I know that this is not much information given that this error cannot be reproduced easily. But I'd be happy to follow any advice on how to get a more verbose stacktrace or any other step that might help you in identifying the error.
Cheers David