vgteam / vg

tools for working with genome variation graphs
https://biostars.org/tag/vg/
Other
1.08k stars 191 forks source link

Signal 11 occurred for vg prune #2417

Open eldariont opened 4 years ago

eldariont commented 4 years ago

Hi,

  1. What you were trying to do I was trying to install and run vg on our institute's Linux workstations. Compiling from source ran smoothly. Then I ran:

    vg construct -r chr10.fa -v variants.vcf -S -a -f -p > graph.vg
    vg prune -p -r graph.vg > graph.pruned.vg
  2. What you wanted to happen I wanted to get a pruned graph.

  3. What actually happened Pruning crashed with this:

    Original graph graph.vg: 7261852 nodes, 9517200 edges
    ERROR: Signal 11 occurred. VG has crashed. Run 'vg bugs --new' to report a bug.
    Stack trace path: /tmp/vg_crash_dBTngU/stacktrace.txt

The stacktrace contains very little information:

Crash report for vg v1.18.0-117-gfd8372336 "Zungoli"
Stack trace (most recent call last):
#9    Object "", at 0xffffffffffffffff, in 
#8    Object "/project/heller_d-data/graph-genomes/bin/vg/bin/vg", at 0x83eae9, in 
#7    Object "/lib/libc-2.27.so", at 0x7fb891d10b5d, in __libc_start_main
#6    Object "/project/heller_d-data/graph-genomes/bin/vg/bin/vg", at 0x784120, in 
#5    Object "/project/heller_d-data/graph-genomes/bin/vg/bin/vg", at 0xca0767, in 
#4    Object "/project/heller_d-data/graph-genomes/bin/vg/bin/vg", at 0xcd64f8, in 
#3    Object "/project/heller_d-data/graph-genomes/bin/vg/bin/vg", at 0xf877cb, in 
#2    Object "/project/heller_d-data/graph-genomes/bin/vg/bin/vg", at 0xf864ab, in 
#1    Object "/project/heller_d-data/graph-genomes/bin/vg/bin/vg", at 0xf97b41, in 
#0    Object "/project/heller_d-data/graph-genomes/bin/vg/bin/vg", at 0xf974fe, in

The strange thing is that I compiled and ran the same version of vg on courtyard and got a pruned graph without any problem. So the crash seems to be specific to our Linux system/machine.

Output of uname -a: Linux elcattivo.molgen.mpg.de 4.14.87.mx64.236 #1 SMP Mon Dec 10 09:48:57 CET 2018 x86_64 GNU/Linux

  1. What data and command line to use to make the problem recur, if applicable chr10.fa variants.vcf

I know that this is not much information given that this error cannot be reproduced easily. But I'd be happy to follow any advice on how to get a more verbose stacktrace or any other step that might help you in identifying the error.

Cheers David

glennhickey commented 4 years ago

That (unsurprisingly) works fine here. looks like it fails when it's trying to make the xg index. I assume you have the ~20G RAM free to do this. Perhaps turning off optimizations (replace -O3 by -O0 in Makefile then rebuild everything) or doing make static will help? Also, maybe running through gdb will give you a better stack (though I don't think I've seen this happen)?

eldariont commented 4 years ago

Thanks, Glenn. I just wanted to try making a static build but before ran the same command as above (just to make sure it still crashes). It crashed again with the same signal 11 but this time the stack trace actually contains useful information (it's size is now 5K instead of 880B):

Crash report for vg v1.18.0-117-gfd8372336 "Zungoli"
Stack trace (most recent call last):
#8    Object "", at 0xffffffffffffffff, in 
#7    Object "/project/heller_d-data/graph-genomes/bin/vg/bin/vg", at 0x83eae9, in _start
#6    Object "/lib/libc-2.27.so", at 0x7fe23d75fb5d, in __libc_start_main
#5    Object "/project/heller_d-data/graph-genomes/bin/vg/bin/vg", at 0x784120, in main
#4    Object "/project/heller_d-data/graph-genomes/bin/vg/bin/vg", at 0xca0767, in vg::subcommand::Subcommand::operator()(int, char**) const
#3    Object "/project/heller_d-data/graph-genomes/bin/vg/bin/vg", at 0xcd64f8, in main_prune(int, char**)
#2    Object "/project/heller_d-data/graph-genomes/bin/vg/bin/vg", at 0xf874a4, in vg::Paths::rebuild_mapping_aux()
#1    Object "/project/heller_d-data/graph-genomes/bin/vg/bin/vg", at 0xf99be9, in std::pair<vg::mapping_t const* const, std::pair<std::_List_iterator<vg::mapping_t>, long> >& spp::sparse_hashtable<std::pair<vg::mapping_t const* const, std::pair<std::_List_iterator<vg::mapping_t>, long> >, vg::mapping_t const*, vg::wang_hash<vg::mapping_t const*, void>, spp::sparse_hash_map<vg::mapping_t const*, std::pair<std::_List_iterator<vg::mapping_t>, long>, vg::wang_hash<vg::mapping_t const*, void>, std::equal_to<vg::mapping_t const*>, spp::libc_allocator<std::pair<vg::mapping_t const* const, std::pair<std::_List_iterator<vg::mapping_t>, long> > > >::SelectKey, spp::sparse_hash_map<vg::mapping_t const*, std::pair<std::_List_iterator<vg::mapping_t>, long>, vg::wang_hash<vg::mapping_t const*, void>, std::equal_to<vg::mapping_t const*>, spp::libc_allocator<std::pair<vg::mapping_t const* const, std::pair<std::_List_iterator<vg::mapping_t>, long> > > >::SetKey, std::equal_to<vg::mapping_t const*>, spp::libc_allocator<std::pair<vg::mapping_t const* const, std::pair<std::_List_iterator<vg::mapping_t>, long> > > >::find_or_insert<spp::sparse_hash_map<vg::mapping_t const*, std::pair<std::_List_iterator<vg::mapping_t>, long>, vg::wang_hash<vg::mapping_t const*, void>, std::equal_to<vg::mapping_t const*>, spp::libc_allocator<std::pair<vg::mapping_t const* const, std::pair<std::_List_iterator<vg::mapping_t>, long> > > >::DefaultValue>(vg::mapping_t const* const&)
#0    Object "/project/heller_d-data/graph-genomes/bin/vg/bin/vg", at 0xf995fb, in spp::sparse_hashtable<std::pair<vg::mapping_t const* const, std::pair<std::_List_iterator<vg::mapping_t>, long> >, vg::mapping_t const*, vg::wang_hash<vg::mapping_t const*, void>, spp::sparse_hash_map<vg::mapping_t const*, std::pair<std::_List_iterator<vg::mapping_t>, long>, vg::wang_hash<vg::mapping_t const*, void>, std::equal_to<vg::mapping_t const*>, spp::libc_allocator<std::pair<vg::mapping_t const* const, std::pair<std::_List_iterator<vg::mapping_t>, long> > > >::SelectKey, spp::sparse_hash_map<vg::mapping_t const*, std::pair<std::_List_iterator<vg::mapping_t>, long>, vg::wang_hash<vg::mapping_t const*, void>, std::equal_to<vg::mapping_t const*>, spp::libc_allocator<std::pair<vg::mapping_t const* const, std::pair<std::_List_iterator<vg::mapping_t>, long> > > >::SetKey, std::equal_to<vg::mapping_t const*>, spp::libc_allocator<std::pair<vg::mapping_t const* const, std::pair<std::_List_iterator<vg::mapping_t>, long> > > >::sparse_hashtable(spp::sparse_hashtable<std::pair<vg::mapping_t const* const, std::pair<std::_List_iterator<vg::mapping_t>, long> >, vg::mapping_t const*, vg::wang_hash<vg::mapping_t const*, void>, spp::sparse_hash_map<vg::mapping_t const*, std::pair<std::_List_iterator<vg::mapping_t>, long>, vg::wang_hash<vg::mapping_t const*, void>, std::equal_to<vg::mapping_t const*>, spp::libc_allocator<std::pair<vg::mapping_t const* const, std::pair<std::_List_iterator<vg::mapping_t>, long> > > >::SelectKey, spp::sparse_hash_map<vg::mapping_t const*, std::pair<std::_List_iterator<vg::mapping_t>, long>, vg::wang_hash<vg::mapping_t const*, void>, std::equal_to<vg::mapping_t const*>, spp::libc_allocator<std::pair<vg::mapping_t const* const, std::pair<std::_List_iterator<vg::mapping_t>, long> > > >::SetKey, std::equal_to<vg::mapping_t const*>, spp::libc_allocator<std::pair<vg::mapping_t const* const, std::pair<std::_List_iterator<vg::mapping_t>, long> > > >::MoveDontCopyT, spp::sparse_hashtable<std::pair<vg::mapping_t const* const, std::pair<std::_List_iterator<vg::mapping_t>, long> >, vg::mapping_t const*, vg::wang_hash<vg::mapping_t const*, void>, spp::sparse_hash_map<vg::mapping_t const*, std::pair<std::_List_iterator<vg::mapping_t>, long>, vg::wang_hash<vg::mapping_t const*, void>, std::equal_to<vg::mapping_t const*>, spp::libc_allocator<std::pair<vg::mapping_t const* const, std::pair<std::_List_iterator<vg::mapping_t>, long> > > >::SelectKey, spp::sparse_hash_map<vg::mapping_t const*, std::pair<std::_List_iterator<vg::mapping_t>, long>, vg::wang_hash<vg::mapping_t const*, void>, std::equal_to<vg::mapping_t const*>, spp::libc_allocator<std::pair<vg::mapping_t const* const, std::pair<std::_List_iterator<vg::mapping_t>, long> > > >::SetKey, std::equal_to<vg::mapping_t const*>, spp::libc_allocator<std::pair<vg::mapping_t const* const, std::pair<std::_List_iterator<vg::mapping_t>, long> > > >&, unsigned long)

I swear I didn't change anything in the command. No idea why the stacktrace looks so much different now.. Oh and I forgot to mention that I have 1TB of memory on the machine so that shouldn't be a problem.

eldariont commented 4 years ago

I tried compiling with optimizations turned off. It still crashes:

ERROR: Signal 11 occurred. VG has crashed. Run 'vg bugs --new' to report a bug.
Stack trace path: /tmp/vg_crash_7bwBlS/stacktrace.txt
>cat /tmp/vg_crash_7bwBlS/stacktrace.txt
Crash report for vg v1.18.0-141-gb3ed0145a "Zungoli"
Stack trace (most recent call last):
#23   Object "", at 0xffffffffffffffff, in
#22   Object "/project/heller_d-data/graph-genomes/bin/vg_o/bin/vg", at 0x1853329, in
#21   Object "/lib/libc-2.27.so", at 0x7ffb3cd7fb5d, in __libc_start_main
#20   Object "/project/heller_d-data/graph-genomes/bin/vg_o/bin/vg", at 0x18536a5, in
#19   Object "/project/heller_d-data/graph-genomes/bin/vg_o/bin/vg", at 0x1e70577, in
#18   Object "/project/heller_d-data/graph-genomes/bin/vg_o/bin/vg", at 0x1e7135b, in
#17   Object "/project/heller_d-data/graph-genomes/bin/vg_o/bin/vg", at 0x1df21c3, in
#16   Object "/project/heller_d-data/graph-genomes/bin/vg_o/bin/vg", at 0x1ebab57, in
#15   Object "/project/heller_d-data/graph-genomes/bin/vg_o/bin/vg", at 0x21f5e0f, in
#14   Object "/project/heller_d-data/graph-genomes/bin/vg_o/bin/vg", at 0x21f616f, in
#13   Object "/project/heller_d-data/graph-genomes/bin/vg_o/bin/vg", at 0x21f7e1e, in
#12   Object "/project/heller_d-data/graph-genomes/bin/vg_o/bin/vg", at 0x1ebb5fb, in
#11   Object "/project/heller_d-data/graph-genomes/bin/vg_o/bin/vg", at 0x1eba1b6, in
#10   Object "/project/heller_d-data/graph-genomes/bin/vg_o/bin/vg", at 0x240d28a, in
#9    Object "/project/heller_d-data/graph-genomes/bin/vg_o/bin/vg", at 0x240d7f3, in
#8    Object "/project/heller_d-data/graph-genomes/bin/vg_o/bin/vg", at 0x22d3c14, in
#7    Object "/project/heller_d-data/graph-genomes/bin/vg_o/bin/vg", at 0x22de8e8, in
#6    Object "/project/heller_d-data/graph-genomes/bin/vg_o/bin/vg", at 0x22e0dd7, in
#5    Object "/project/heller_d-data/graph-genomes/bin/vg_o/bin/vg", at 0x22e5a01, in
#4    Object "/project/heller_d-data/graph-genomes/bin/vg_o/bin/vg", at 0x22ec8d9, in
#3    Object "/project/heller_d-data/graph-genomes/bin/vg_o/bin/vg", at 0x2145ebe, in
#2    Object "/project/heller_d-data/graph-genomes/bin/vg_o/bin/vg", at 0x2149fd3, in
#1    Object "/project/heller_d-data/graph-genomes/bin/vg_o/bin/vg", at 0x214d3e7, in
#0    Object "/project/heller_d-data/graph-genomes/bin/vg_o/bin/vg", at 0x214f6a4, in

Unfortunately, I cannot make a static build because I don't have all the static libraries. But running the release build downloaded from github produces a different result:

Original graph graph.vg: 7261852 nodes, 9517200 edges
Built a temporary XG index
Removed all paths
terminate called after throwing an instance of 'std::bad_alloc'
  what():  std::bad_alloc
ERROR: Signal 6 occurred. VG has crashed. Run 'vg bugs --new' to report a bug.
Stack trace path: /tmp/vg_crash_Igq1Bs/stacktrace.txt
cat /tmp/vg_crash_Igq1Bs/stacktrace.txt
Crash report for vg v1.18.0 "Zungoli"
Stack trace (most recent call last):
#14   Object "/project/heller_d-data/graph-genomes/bin/v18/vg", at 0x4d5819, in
#13   Object "/project/heller_d-data/graph-genomes/bin/v18/vg", at 0x1a5d498, in
#12   Object "/project/heller_d-data/graph-genomes/bin/v18/vg", at 0x4098a2, in
#11   Object "/project/heller_d-data/graph-genomes/bin/v18/vg", at 0x9ac137, in
#10   Object "/project/heller_d-data/graph-genomes/bin/v18/vg", at 0x96985f, in
#9    Object "/project/heller_d-data/graph-genomes/bin/v18/vg", at 0xba8407, in
#8    Object "/project/heller_d-data/graph-genomes/bin/v18/vg", at 0xba7c95, in
#7    Object "/project/heller_d-data/graph-genomes/bin/v18/vg", at 0xb810c9, in
#6    Object "/project/heller_d-data/graph-genomes/bin/v18/vg", at 0x8793f9, in
#5    Object "/project/heller_d-data/graph-genomes/bin/v18/vg", at 0x199d568, in
#4    Object "/project/heller_d-data/graph-genomes/bin/v18/vg", at 0x199c1f0, in
#3    Object "/project/heller_d-data/graph-genomes/bin/v18/vg", at 0x199c1a5, in
#2    Object "/project/heller_d-data/graph-genomes/bin/v18/vg", at 0x1a36a34, in
#1    Object "/project/heller_d-data/graph-genomes/bin/v18/vg", at 0x1a6dd60, in
#0    Object "/project/heller_d-data/graph-genomes/bin/v18/vg", at 0x10da587, in
buczek commented 4 years ago

I work at the same organization as @eldariont, have access to his data and analyzed this a bit.

Its a badly handled malloc failure, because the maximum number of mmap() mappings of the process is exhausted. The jemalloc submodule is causing a severe virtual memory fragmentation.

The segfault happens, because _alloc_group_array from the sparsepp submodule doesn't check for NULL values returned by the allocator [1]. The default allocator just uses malloc() [2], which, of course, can return NULL. This null pointer later gets dereferenced.

The system is not at all out of memory and the user doesn't hit a memory resource limit. But there is an exorbitant number of anonymous mappings used, which exhaust the system limit:

strace -p $(pgrep vg)
[...]
mmap(0x7f888d544000, 8192, PROT_NONE, MAP_PRIVATE|MAP_FIXED|MAP_ANONYMOUS, -1, 0) = 0x7f888d544000
mmap(0x7f888d912000, 8192, PROT_NONE, MAP_PRIVATE|MAP_FIXED|MAP_ANONYMOUS, -1, 0) = 0x7f888d912000
mmap(0x7f889cedc000, 8192, PROT_NONE, MAP_PRIVATE|MAP_FIXED|MAP_ANONYMOUS, -1, 0) = 0x7f889cedc000
mmap(0x7f889d056000, 8192, PROT_NONE, MAP_PRIVATE|MAP_FIXED|MAP_ANONYMOUS, -1, 0) = 0x7f889d056000
mmap(0x7f889d078000, 8192, PROT_NONE, MAP_PRIVATE|MAP_FIXED|MAP_ANONYMOUS, -1, 0) = 0x7f889d078000
mmap(0x7f889d0b1000, 8192, PROT_NONE, MAP_PRIVATE|MAP_FIXED|MAP_ANONYMOUS, -1, 0) = -1 ENOMEM (Cannot allocate memory)
madvise(0x7f889d0b1000, 8192, MADV_DONTNEED) = 0
mmap(0x7f889d22e000, 8192, PROT_NONE, MAP_PRIVATE|MAP_FIXED|MAP_ANONYMOUS, -1, 0) = -1 ENOMEM (Cannot allocate memory)
madvise(0x7f889d22e000, 8192, MADV_DONTNEED) = 0
mmap(0x7f889d270000, 8192, PROT_NONE, MAP_PRIVATE|MAP_FIXED|MAP_ANONYMOUS, -1, 0) = -1 ENOMEM (Cannot allocate memory)
madvise(0x7f889d270000, 8192, MADV_DONTNEED) = 0
mmap(0x7f889d291000, 8192, PROT_NONE, MAP_PRIVATE|MAP_FIXED|MAP_ANONYMOUS, -1, 0) = -1 ENOMEM (Cannot allocate memory)
madvise(0x7f889d291000, 8192, MADV_DONTNEED) = 0
[....]

buczek@kreios:~/test_vg/vg (master)$ while cat /proc/$(pgrep vg)/maps|wc;do sleep .1;done
[...]
  65531  327804 3218655
  65532  327809 3218704
  65532  327809 3218704
  65532  327809 3218704
  65532  327809 3218704
  65532  327809 3218704
      0       0       0
      0       0       0
      0       0       0
cat: /proc//maps: No such file or directory
      0       0       0
cat: /proc//maps: No such file or directory
[....]

buczek@kreios:~/test_vg/vg (master)$ sysctl vm.max_map_count
vm.max_map_count = 65530

After disabling jemalloc in the Makefile, the failing command completed successfully with just 622 maps.

Notes:

[1] https://github.com/greg7mdp/sparsepp/blob/master/sparsepp/spp.h#L1916 [2] https://github.com/greg7mdp/sparsepp/blob/master/sparsepp/spp_utils.h#L412

adamnovak commented 4 years ago

Are you perhaps running with the sysctl vm.overcommit_memory set to 2 (i.e. never overcommit memory)? Apparently jemalloc doesn't like it when you do that, and despite that issue having been closed still doesn't really work well in that situation. I think when it sees that overcommit is off it tries to release unused but allocated memory to the OS by changing the permissions on the pages, but this then fragments the memory mapping.

There are some configuration changes we could apply to maybe make it work a bit better.

buczek commented 4 years ago

You are right. We do have vm.overcommit_memory=2 and setting it to 0 also prevents the problem from happening. Max 460 mappings in this case.

Setting MALLOC_CONF="retain:true" as suggested in the referenced thread, does not help. Maybe the version of jemalloc used here doesn't support that yet.

adamnovak commented 4 years ago

OK, it sounds like this is mostly an upstream problem with jemalloc. I've asked over there if there's a real solution to the problem; they also have some other malloc config settings that we could potentially make vg apply if it detects that memory overcommit is disabled. We could also I think just make it fall back to the system allocator in that case, somehow.

buczek commented 4 years ago

Maybe jemallocs opt.retain [1] is the option to go for:

If true, retain unused virtual memory for later reuse rather than discarding it by calling munmap(2) or equivalent (see stats.retained for related details). It also makes jemalloc use mmap(2) or equivalent in a more greedy way, mapping larger chunks in one go. This option is disabled by default unless discarding virtual memory is known to trigger platform-specific performance problems, namely 1) for [64-bit] Linux, which has a quirk in its virtual memory allocation algorithm that causes semi-permanent VM map holes under normal jemalloc operation; and 2) for [64-bit] Windows, which disallows split / merged regions with MEM_RELEASE. Although the same issues may present on 32-bit platforms as well, retaining virtual memory for 32-bit Linux and Windows is disabled by default due to the practical possibility of address space exhaustion.

However, the text suggest, that this options already defaults to true on Linux 64 bit. So I wonder if the problem still exists if the latest version of jemalloc is used. Otherwise it should be fixed there. vg as the user of the allocator should not be required to select configuration options based on the configuration of the operating system below it. These implementation details and requirements should be transparent to the user of jemalloc (in my hypothetical perfect small world, that is).

[1] http://jemalloc.net/jemalloc.3.html#opt.retain

buczek commented 4 years ago

Sorry, retain doesn't help, as retain:true is already defaulted (at least after I've updated the submodule version of jemalloc to its 5.2.1 tag).

Compiling jemalloc with --with-lg-page=16 reduces the number of mappings with our data to ca. 21000, which is below the system limit. But it still doesn't feel safe.

Adding MALLOC_CONF=confirm_conf:true,muzzy_decay_ms:0,dirty_decay_ms:30000 does not further reduce this number.

Disabling jemalloc increases the rumtime from 627 seconds to 748 seconds in my test setting.