gmarcais / Jellyfish

A fast multi-threaded k-mer counter
Other
460 stars 136 forks source link

Memory allocation error due to large hash leaves behind orphaned generator processes #162

Open cerebis opened 4 years ago

cerebis commented 4 years ago

Situation

When counting kmers and requesting a hash size that is larger than available memory, while using generators.

E.g. on a 4GB VM, the following will SIGABRT after attempting to allocate 8GB for the hash.

jellyfish count -C -m 24 -s 2G -g input.gen -o db.jf

Outcome

Jellyfish will fault with a SIGABRT, however the child process tree pertaining to the generators will become orphaned and subsequently adopted by systemd.

Console error.

terminate called after throwing an instance of 'jellyfish::large_hash::array_base<jellyfish::mer_dna_ns::mer_base_static<unsigned long, 0>, unsigned long, atomic::gcc, jellyfish::large_hash::unbounded_array<jellyfish::mer_dna_ns::mer_base_static<unsigned long, 0>, unsigned long, atomic::gcc, allocators::mmap> >::ErrorAllocation'
  what():  Failed to allocate 8000000000 bytes of memory

For a single threaded generator, this leaves behind the following proceses, where PID 1775 is systemd.

UID         PID   PPID  C STIME TTY          TIME CMD
ubuntu   20020  20010  0 01:14 pts/0    00:00:05 zsh
ubuntu   52084   1775  0 12:32 pts/0    00:00:00 jellyfish count -C -m 24 -s 2G -g input.gen -o db.jf
ubuntu   52085  52084  0 12:32 pts/0    00:00:00 jellyfish count -C -m 24 -s 2G -g input.gen -o db.jf
ubuntu   52153  20020  0 12:34 pts/0    00:00:00 ps -f

Solution

Looking at the code, count_main.cc sets up a sigaction for SIGTERM. Adding an identical sigaction for SIGABRT results in one of these orphans being cleaned up but not both. It looks like someone more conversant with your codebase could improve this to properly clean up the generator_manager.

E.g.

UID         PID   PPID  C STIME TTY          TIME CMD
ubuntu   20020  20010  0 01:14 pts/0    00:00:05 zsh
ubuntu   52893   1775  0 12:58 pts/0    00:00:00 .local/bin/jellyfish count -C -m 24 -s 2G -g input.gen -o db.jf
ubuntu   52899  20020  0 12:58 pts/0    00:00:00 ps -f
cerebis commented 4 years ago

I found that prctl could be used in Linux to enable clean-up of children when their parent process dies, but I later learned that this system function is not available under OSX.

Out of curiosity I took to OSX to see if I could reproduce the problem there. In testing on a non-virtual Mac with 32GB of physical memory and default swap of 11G, I found that I could not cause the allocation error even with a hash size of 2000G.

Therefore I suppose this error mode is unlikely and more a product of a tiny 4GB virtual machine.

I have made a fork of the codebase if you wish to see the small change to use prctl. Perhaps there is a means of doing this in OSX, but in my search I could not find any discussions.