jellyfish::cooperative::hash_counter destructor hangs in local scope

rob-p commented 8 years ago

Hi Guillaume,

I'm encountering the following very strange issue. I'm using a hash_counter inside a function to maintain counts for some k-mers and everything is working great. Until the point where the function should return. For some reason, the function will never return, and I've deduced that it is because it is waiting on the hash_counter destructor to finish (which it never does). The process for the program goes into S state, and an strace reveals the following is what's happening wait4(-1,. Do you have any idea what might be going on here? Interestingly, if I make a pointer to the hash_counter, and simply don't delete it (so that the memory escapes the scope of the function and "dangles"), then the program runs to completion (and, presumably, the normal process cleanup reclaims the memory). Any thoughts on what might be happening here are greatly appreciated!

rob-p commented 8 years ago

I also tried calling the done() method of hash_counter before scope exit, but it seemed to have no effect; the function still hangs.

gmarcais commented 8 years ago

Hi Rob,

do you have multiple threads accessing the the hash_counter? Do all of them call the done() method?

rob-p commented 8 years ago

This is all in single-threaded mode. The function looks like this.

using MerMapT = jellyfish::cooperative::hash_counter<my_mer>;

void foo(const std::string& s) {
  size_t hashSize = 100000000;
  double logTextLen = std::log2(s.length());
  uint16_t bitsPerVal = static_cast<uint16_t>(std::ceil(logTextLen));
  MerMapT khash(hashSize, my_mer::k()*2, bitsPerVal, 1);

  // do some stuff with khash

 khash.done();

// dump the hash to file
using JFFileHeader = jellyfish::file_header;
using JFDumper = jellyfish::binary_dumper<MerMapT::array>;

SpecialHeader fh;
fh.update_from_ary(*(khash->ary()));
fh.canonical(false);
fh.format("gus/special"); // Thanks, Guillaume
fh.counter_len(bitsPerVal); // size of counter in bits
fh.fill_standard();

std::ofstream jfos("kmers.jfhash");
fh.write(jfos);
khash->ary()->write(jfos);
jfos.close();
std::cerr << "done\n";
return true;
}

I declare the hash_counter to use 1 thread and the code has only one thread of execution.

gmarcais commented 8 years ago

That is weird. One done() has return, there should not be any waiting or locking done. The destructor only frees the memory (most likely an munmap). Can you gdb into the hang process and get a backtrace, to see where the wait system call is made.

rob-p commented 8 years ago

Sure. Here is the backtrace running under gdb:

Program received signal SIGINT, Interrupt.
__lll_lock_wait () at ../nptl/sysdeps/unix/sysv/linux/x86_64/lowlevellock.S:135
135     ../nptl/sysdeps/unix/sysv/linux/x86_64/lowlevellock.S: No such file or directory.
(gdb) bt
#0  __lll_lock_wait () at ../nptl/sysdeps/unix/sysv/linux/x86_64/lowlevellock.S:135
#1  0x00007ffff7bc812e in _L_lock_39 () from /lib/x86_64-linux-gnu/libpthread.so.0
#2  0x00007ffff7bc8068 in __pthread_cond_destroy (cond=0x7fffffffb9e8) at pthread_cond_destroy.c:35
#3  0x00000000004a14bc in jellyfish::cooperative::hash_counter<jellyfish::mer_dna_ns::mer_base_static<unsigned long, 1>, unsigned long, atomic::gcc, allocators::mmap>::~hash_counter() ()
#4  0x00000000004ca627 in bool buildHash<int>(std::string const&, std::string&, unsigned long, unsigned int, std::vector<int, std::allocator<int> >&) ()
#5  0x00000000004cf411 in void indexTranscriptsSA<jellyfish::whole_sequence_parser<jellyfish::stream_manager<__gnu_cxx::__normal_iterator<std::string const*, std::vector<std::string, std::allocator<std::string> > > > > >(jellyfish::whole_sequence_parser<jellyfish::stream_manager<__gnu_cxx::__normal_iterator<std::string const*, std::vector<std::string, std::allocator<std::string> > > > >*, std::string&, bool, std::mutex&) ()
#6  0x00000000004c261f in rapMapSAIndex(int, char**) ()

gmarcais commented 8 years ago

Out of curiosity, can we see which thread barrier is being used in your code. In the locks_pthread.hpp file, there is something like this:

#if defined(_POSIX_BARRIERS) && (_POSIX_BARRIERS - 20012L) >= 0
class barrier {
  // Define using pthread_barrier_t, which does not exist on all platforms
};
#else
class barrier {
  // Define using our own implementation of semaphore
};
#endif

Can you add some output to the constructor/destructor of those barrier classes, so we know if that is what is locking up the process, and if so, which version of is running.

The pthread_barrier_t version is unlikely to be buggy: it is a very thin wrapper. The other code could have a problem (it is Carl's code :). On Linux the pthread version should be used, the other version was written for the Mac which did not provide barriers (lazy bas^H^H^H). Maybe the detection does not work well with some compilers (or switches/options).

rob-p commented 8 years ago

The destructor that is being called --- and hanging --- is from the non-posix barrier (i.e. the second of the two which is using the custom semaphore implementation).

rob-p commented 8 years ago

Interestingly, this is on Ubuntu with gcc 4.9.1.

gmarcais commented 8 years ago

Why did the POSIX group make barriers optional? Why did the C++ group follow their bad lead and not provide barriers in C++11? It is an important way to sync thread together, easy to get wrong in your own implementation. And optional parts of a specification like POSIX are a pain to deal with.

gmarcais commented 8 years ago

So I'll fix the semaphore barrier code and add detection of pthread_barrier to autoconf (in addition to the _POSIX constants). In the meantime, you can add -D_POSIX_BARRIERS=20012L to your compilation.

rob-p commented 8 years ago

Thanks! ~~My guess is that barrier2 here should be initialized with 1 instead of 0.~~ (nvm: that still doesn't seem to fix it!).

--Rob

gmarcais commented 8 years ago

I am unable to reproduce your problem. I just copied the code of the foo function, and it performs nothing (!) without hanging. Did you run your code through valgrind? Does it run properly?

rob-p commented 8 years ago

When I run through valgrind, I get the following output:

==6271== Syscall param futex(op) contains uninitialised byte(s)
==6271==    at 0x4E45ADD: __lll_lock_wait (lowlevellock.S:135)
==6271==    by 0x4E4312D: _L_lock_39 (pthread_cond_destroy.c:85)
==6271==    by 0x4E43066: pthread_cond_destroy@@GLIBC_2.3.2 (pthread_cond_destroy.c:35)
==6271==    by 0x4A1693: jellyfish::cooperative::hash_counter<jellyfish::mer_dna_ns::mer_base_static<unsigned long, 1>, unsigned long, atomic::gcc, allocators::mmap>::~hash_counter() (in /home/rob/RapMapJF/build/src/rapmap)
==6271==    by 0x4CA9AF: bool buildHash<int>(std::string const&, std::string&, unsigned long, unsigned int, std::vector<int, std::allocator<int> >&) (in /home/rob/RapMapJF/build/src/rapmap)
==6271==    by 0x4CF7D0: void indexTranscriptsSA<jellyfish::whole_sequence_parser<jellyfish::stream_manager<__gnu_cxx::__normal_iterator<std::string const*, std::vector<std::string, std::allocator<std::string> > > > > >(jellyfish::whole_sequence_parser<jellyfish::stream_manager<__gnu_cxx::__normal_iterator<std::string const*, std::vector<std::string, std::allocator<std::string> > > > >*, std::string&, bool, std::mutex&) (in /home/rob/RapMapJF/build/src/rapmap)
==6271==    by 0x4C276E: rapMapSAIndex(int, char**) (in /home/rob/RapMapJF/build/src/rapmap)
==6271==    by 0x484039: main (in /home/rob/RapMapJF/build/src/rapmap)
==6271==

gmarcais commented 8 years ago

What is going here! I tried on Ubuntu 14.04 with 4.9.1 and it runs smoothly under valgrind. I am starting to wonder if the problem is in my code or on your system.

rob-p commented 8 years ago

I'm trying to find a minimal example (i.e. it doesn't seem to have a problem for me either if // do some stuff with khash is a no-op). Perhaps you might be able to recreate the issue if khash is actually used in foo().

gmarcais / Jellyfish

jellyfish::cooperative::hash_counter destructor hangs in local scope #53