plasma-umass / coz

Coz: Causal Profiling
Other
4.04k stars 160 forks source link

Abort while running experiments #175

Closed gawashburn closed 3 years ago

gawashburn commented 3 years ago

I am running my code in a loop, with COZ_BEGIN and COZ_END around the section of interest. Eventually a particular iteration takes quite some time (maybe ten minutes) and before the next iteration starts coz aborts. I'm using Ubuntu 20.04, and compiled coz from source cloned today. My code has been compiled with g++ 9.3. Depending on the run it may take hundreds or thousands of iterations before this happens.

Aborted!
  0: /home/geoffw/tmp/coz/libcoz/libcoz.so(_ZN8profiler8on_errorEiP9siginfo_tPv+0x77) [0x7f6fe4823f27]
  1: /lib/x86_64-linux-gnu/libpthread.so.0(+0x153c0) [0x7f6fe09a63c0]
  2: /lib/x86_64-linux-gnu/libc.so.6(gsignal+0xcb) [0x7f6fe07e518b]
  3: /lib/x86_64-linux-gnu/libc.so.6(abort+0x12b) [0x7f6fe07c4859]
  4: /home/geoffw/tmp/coz/libcoz/libcoz.so(+0x17f9a) [0x7f6fe4818f9a]
  5: /home/geoffw/tmp/coz/libcoz/libcoz.so(_ZN8profiler14begin_samplingEP12thread_state+0x2f4) [0x7f6fe4824544]
  6: /home/geoffw/tmp/coz/libcoz/libcoz.so(_ZN8profiler12start_threadEPv+0x8b) [0x7f6fe4824abb]
  7: /lib/x86_64-linux-gnu/libpthread.so.0(+0x9609) [0x7f6fe099a609]
  8: /lib/x86_64-linux-gnu/libc.so.6(clone+0x43) [0x7f6fe08c1293]
emeryberger commented 3 years ago

It would help a ton if you could provide a minimal reproducible example. You can reach out to me if you can't share code publicly.

emeryberger commented 3 years ago

In the interim, I'd consider refining the progress points if possible -- you aren't going to get many samples if iterations take minutes to run.

gawashburn commented 3 years ago

Unfortunately, I do not yet have a self contained example to share. Right now I'm seeing if I can catch whatever this is in gdb to perhaps narrow this down to some point in the coz source.

I guess I was unclear. The iterations are quite short normally, hundreds of milliseconds, but eventually it hits a point after a few hundred or thousand iterations where a single iteration takes an extremely long time. And after that iteration completes the abort happens.

gawashburn commented 3 years ago

Okay, this is where it appears to be aborting:

0x00007ff12f54218b in raise () from /lib/x86_64-linux-gnu/libc.so.6
#1  0x00007ff12f521859 in abort () from /lib/x86_64-linux-gnu/libc.so.6
#2  0x00007ff133575f9a in ccutil::logger::~logger (this=0x7fef85ffa10e, __in_chrg=<optimized out>) at ccutil/log.h:40
40          if(_exit) abort();
#3  ccutil::logger::~logger (this=0x7fef85ffa10e, __in_chrg=<optimized out>) at ccutil/log.h:37
37      ~logger() {
#4  0x00007ff133581544 in timer::timer (sig=27, this=0x7fef85ffa110) at ccutil/timer.h:21
21      REQUIRE(timer_create(CLOCK_THREAD_CPUTIME_ID, &ev, &_timer) == 0)
#5  profiler::begin_sampling (this=<optimized out>, state=0x7ff1335d7fa8 <profiler::get_instance()::buf+292456>) at profiler.cpp:355
355   state->process_timer = timer(SampleSignal);
#6  0x00007ff133581abb in profiler::start_thread (p=<optimized out>) at profiler.h:183
183     return *p;
#7  0x00007ff12f6f7609 in start_thread () from /lib/x86_64-linux-gnu/libpthread.so.0
#8  0x00007ff12f61e293 in clone () from /lib/x86_64-linux-gnu/libc.so.6

I'll recompile and see if I can find out what the value of errno for the timer_create failure. But it seems like that could just be an unrelated side effect of whatever has caused iteration time to explode.

Just as sanity check that I didn't make a mistake when I constructed my benchmark, I reran without coz and it will run for hundreds of thousands of iterations with no problems.

emeryberger commented 3 years ago

Thanks for the update! Hopefully errno will provide some useful info.

gawashburn commented 3 years ago

Sorry about the long delay on this -- various other work projects have taken priority the past few weeks.

So timer_create is reporting EAGAIN. As I suspected, that would appear to mean that the available timers for the process have been exhausted. However, I checked and while the number of threads associated with the process will fluctuate some, it tends to stay around twenty threads. So the actual number of threads isn't exploding. That would seem to leave that perhaps the timer destructor is not begin called, which could maybe happen if the state for a thread that ended wasn't cleaned up?

I am not extremely familiar with your code, but my cursory investigation seems to indicate you'll clean up the per thread state in a wrapper around pthread_exit. Our code uses boost::thread, and I am even less familiar with its implementation. It would appear that it does not use pthread_exit and even states "the use of this function in a POSIX thread result in undefined behavior of any Boost.Thread function". However, it looks like to ensure cleanup, the wrapper you place around a function passed to pthread_create will call always pthread_exit.

So maybe as currently implemented Coz is not compatible with the Boost threading library? If that is the case, that should make concocting a much smaller reproducible example easier. I'll see if I can put something together.

gawashburn commented 3 years ago

I spent a bit of time a couple months ago trying to distill down a simpler example using Boost threads, but was unable to find a a reproduce the issue that way. It is possible we have some code that also interacts with signal handlers that could also be involved here.

Anyway, I am leaving the company, and it is unlikely that anyone else will have the time to invest in debugging this further. So I'll close the issue and hope that in my future projects I'll have a chance to use Coz more profitably.