Open rob-p opened 8 years ago
I also tried calling the done()
method of hash_counter
before scope exit, but it seemed to have no effect; the function still hangs.
Hi Rob,
do you have multiple threads accessing the the hash_counter
? Do all of them call the done()
method?
This is all in single-threaded mode. The function looks like this.
using MerMapT = jellyfish::cooperative::hash_counter<my_mer>;
void foo(const std::string& s) {
size_t hashSize = 100000000;
double logTextLen = std::log2(s.length());
uint16_t bitsPerVal = static_cast<uint16_t>(std::ceil(logTextLen));
MerMapT khash(hashSize, my_mer::k()*2, bitsPerVal, 1);
// do some stuff with khash
khash.done();
// dump the hash to file
using JFFileHeader = jellyfish::file_header;
using JFDumper = jellyfish::binary_dumper<MerMapT::array>;
SpecialHeader fh;
fh.update_from_ary(*(khash->ary()));
fh.canonical(false);
fh.format("gus/special"); // Thanks, Guillaume
fh.counter_len(bitsPerVal); // size of counter in bits
fh.fill_standard();
std::ofstream jfos("kmers.jfhash");
fh.write(jfos);
khash->ary()->write(jfos);
jfos.close();
std::cerr << "done\n";
return true;
}
I declare the hash_counter to use 1 thread and the code has only one thread of execution.
That is weird. One done()
has return, there should not be any waiting or locking done. The destructor only frees the memory (most likely an munmap
). Can you gdb into the hang process and get a backtrace, to see where the wait
system call is made.
Sure. Here is the backtrace running under gdb:
Program received signal SIGINT, Interrupt.
__lll_lock_wait () at ../nptl/sysdeps/unix/sysv/linux/x86_64/lowlevellock.S:135
135 ../nptl/sysdeps/unix/sysv/linux/x86_64/lowlevellock.S: No such file or directory.
(gdb) bt
#0 __lll_lock_wait () at ../nptl/sysdeps/unix/sysv/linux/x86_64/lowlevellock.S:135
#1 0x00007ffff7bc812e in _L_lock_39 () from /lib/x86_64-linux-gnu/libpthread.so.0
#2 0x00007ffff7bc8068 in __pthread_cond_destroy (cond=0x7fffffffb9e8) at pthread_cond_destroy.c:35
#3 0x00000000004a14bc in jellyfish::cooperative::hash_counter<jellyfish::mer_dna_ns::mer_base_static<unsigned long, 1>, unsigned long, atomic::gcc, allocators::mmap>::~hash_counter() ()
#4 0x00000000004ca627 in bool buildHash<int>(std::string const&, std::string&, unsigned long, unsigned int, std::vector<int, std::allocator<int> >&) ()
#5 0x00000000004cf411 in void indexTranscriptsSA<jellyfish::whole_sequence_parser<jellyfish::stream_manager<__gnu_cxx::__normal_iterator<std::string const*, std::vector<std::string, std::allocator<std::string> > > > > >(jellyfish::whole_sequence_parser<jellyfish::stream_manager<__gnu_cxx::__normal_iterator<std::string const*, std::vector<std::string, std::allocator<std::string> > > > >*, std::string&, bool, std::mutex&) ()
#6 0x00000000004c261f in rapMapSAIndex(int, char**) ()
Out of curiosity, can we see which thread barrier is being used in your code. In the locks_pthread.hpp
file, there is something like this:
#if defined(_POSIX_BARRIERS) && (_POSIX_BARRIERS - 20012L) >= 0
class barrier {
// Define using pthread_barrier_t, which does not exist on all platforms
};
#else
class barrier {
// Define using our own implementation of semaphore
};
#endif
Can you add some output to the constructor/destructor of those barrier classes, so we know if that is what is locking up the process, and if so, which version of is running.
The pthread_barrier_t
version is unlikely to be buggy: it is a very thin wrapper. The other code could have a problem (it is Carl's code :). On Linux the pthread version should be used, the other version was written for the Mac which did not provide barriers (lazy bas^H^H^H). Maybe the detection does not work well with some compilers (or switches/options).
The destructor that is being called --- and hanging --- is from the non-posix barrier (i.e. the second of the two which is using the custom semaphore implementation).
Interestingly, this is on Ubuntu with gcc 4.9.1.
Why did the POSIX group make barriers optional? Why did the C++ group follow their bad lead and not provide barriers in C++11? It is an important way to sync thread together, easy to get wrong in your own implementation. And optional parts of a specification like POSIX are a pain to deal with.
So I'll fix the semaphore barrier code and add detection of pthread_barrier
to autoconf (in addition to the _POSIX constants). In the meantime, you can add -D_POSIX_BARRIERS=20012L to your compilation.
Thanks! My guess is that barrier2 here should be initialized with 1 instead of 0. (nvm: that still doesn't seem to fix it!).
--Rob
I am unable to reproduce your problem. I just copied the code of the foo function, and it performs nothing (!) without hanging. Did you run your code through valgrind
? Does it run properly?
When I run through valgrind, I get the following output:
==6271== Syscall param futex(op) contains uninitialised byte(s)
==6271== at 0x4E45ADD: __lll_lock_wait (lowlevellock.S:135)
==6271== by 0x4E4312D: _L_lock_39 (pthread_cond_destroy.c:85)
==6271== by 0x4E43066: pthread_cond_destroy@@GLIBC_2.3.2 (pthread_cond_destroy.c:35)
==6271== by 0x4A1693: jellyfish::cooperative::hash_counter<jellyfish::mer_dna_ns::mer_base_static<unsigned long, 1>, unsigned long, atomic::gcc, allocators::mmap>::~hash_counter() (in /home/rob/RapMapJF/build/src/rapmap)
==6271== by 0x4CA9AF: bool buildHash<int>(std::string const&, std::string&, unsigned long, unsigned int, std::vector<int, std::allocator<int> >&) (in /home/rob/RapMapJF/build/src/rapmap)
==6271== by 0x4CF7D0: void indexTranscriptsSA<jellyfish::whole_sequence_parser<jellyfish::stream_manager<__gnu_cxx::__normal_iterator<std::string const*, std::vector<std::string, std::allocator<std::string> > > > > >(jellyfish::whole_sequence_parser<jellyfish::stream_manager<__gnu_cxx::__normal_iterator<std::string const*, std::vector<std::string, std::allocator<std::string> > > > >*, std::string&, bool, std::mutex&) (in /home/rob/RapMapJF/build/src/rapmap)
==6271== by 0x4C276E: rapMapSAIndex(int, char**) (in /home/rob/RapMapJF/build/src/rapmap)
==6271== by 0x484039: main (in /home/rob/RapMapJF/build/src/rapmap)
==6271==
What is going here! I tried on Ubuntu 14.04 with 4.9.1 and it runs smoothly under valgrind. I am starting to wonder if the problem is in my code or on your system.
I'm trying to find a minimal example (i.e. it doesn't seem to have a problem for me either if // do some stuff with khash
is a no-op). Perhaps you might be able to recreate the issue if khash is actually used in foo()
.
Hi Guillaume,
I'm encountering the following very strange issue. I'm using a hash_counter inside a function to maintain counts for some k-mers and everything is working great. Until the point where the function should return. For some reason, the function will never return, and I've deduced that it is because it is waiting on the hash_counter destructor to finish (which it never does). The process for the program goes into
S
state, and anstrace
reveals the following is what's happeningwait4(-1,
. Do you have any idea what might be going on here? Interestingly, if I make a pointer to the hash_counter, and simply don't delete it (so that the memory escapes the scope of the function and "dangles"), then the program runs to completion (and, presumably, the normal process cleanup reclaims the memory). Any thoughts on what might be happening here are greatly appreciated!