plasma-umass / coz

Coz: Causal Profiling
Other
4.12k stars 160 forks source link

"[./profiler.h:111] Thread state not found" when trying to profile openalpr #54

Open petterreinholdtsen opened 8 years ago

petterreinholdtsen commented 8 years ago

To test coz-profiler, I decided to try to profile the openalpr command line tool, available from http://www.openalpr.com/ . It is a image processing system written in C++, used for extracting car license plates from images and video.

I modified the code like this to try benchmark how quickly it could process images.

diff --git a/src/CMakeLists.txt b/src/CMakeLists.txt
index 95d4e18..a970c39 100644
--- a/src/CMakeLists.txt
+++ b/src/CMakeLists.txt
@@ -110,7 +110,7 @@ IF (WIN32)
 ELSE()

        SET (Extra_LIBS
-               pthread
+               pthread dl
        )

 ENDIF()
diff --git a/src/main.cpp b/src/main.cpp
index 190072f..f32d923 100644
--- a/src/main.cpp
+++ b/src/main.cpp
@@ -22,6 +22,7 @@
 #include <iostream>
 #include <iterator>
 #include <algorithm>
+#include <coz.h>

 #include "opencv2/highgui/highgui.hpp"
 #include "opencv2/imgproc/imgproc.hpp"
@@ -360,7 +361,7 @@ bool detectandshow( Alpr* alpr, cv::Mat frame, std::string region, bool writeJso
     }
   }

-
+  COZ_PROGRESS;

   return results.plates.size() > 0;
 }

After building, I ran it with a random MP4 vidoe file I had lying around after generating a simple openalpr.conf file:

coz run --- ./alpr --config ../openalpr.conf ~/tmp/input.mp4

Instead of getting a useful profiling run done, I got the error message "[./profiler.h:111] Thread state not found" instead. Am I using coz the wrong way?

Looking at the code, I suspect the problem is that openalpr is bootstraping its threads the wrong way, but I am unable to pinpoint exactly what is wrong. Also, I suspect coz should handle any code, and thus decided to report the problem as an issue.

ccurtsinger commented 8 years ago

You may be right about coz missing thread creation, but the one other possibility is that openalpr is forking a new process. Could you verify that alpr is running in entirely in a single process?

One other possibility is that something in a module-level constructor is accessing thread states before coz is fully initialized. I will look into this.

petterreinholdtsen commented 8 years ago

[Charlie Curtsinger]

You may be right about coz missing thread creation, but the one other possibility is that openalpr is forking a new process. Could you verify that alpr is running in entirely in a single process?

Not quite sure how to verify it, but running 'ltrace alpr ... 2>&1 | grep fork' did not find any fork()s.

One other possibility is that something in a module-level constructor is accessing thread states before coz is fully initialized. I will look into this.

Great. You can find the openalpr code using <URL: https://github.com/openalpr/openalpr.git >.

Happy hacking Petter Reinholdtsen

petterreinholdtsen commented 8 years ago

I investigated some more, and suspect the issue at hand is that threads are created before main() is called, in shared library initializers. Here is the gdb backtrace for the first call to pthread_create():

(gdb) bt
#0  __pthread_create_2_1 (newthread=0x7fffe6666538, attr=0x0, start_routine=0x7fffe645df50, 
    arg=0x0) at pthread_create.c:461
#1  0x00007fffe645e16f in ?? () from /lib/x86_64-linux-gnu/libusb-1.0.so.0
#2  0x00007fffe645c3c5 in ?? () from /lib/x86_64-linux-gnu/libusb-1.0.so.0
#3  0x00007fffe6454f5f in libusb_init () from /lib/x86_64-linux-gnu/libusb-1.0.so.0
#4  0x00007fffecb3fa4c in ?? () from /usr/lib/x86_64-linux-gnu/libdc1394.so.22
#5  0x00007fffecb2e015 in dc1394_new () from /usr/lib/x86_64-linux-gnu/libdc1394.so.22
#6  0x00007ffff57b51d9 in CvDC1394::CvDC1394() ()
   from /usr/lib/x86_64-linux-gnu/libopencv_highgui.so.2.4
#7  0x00007ffff579fc40 in ?? () from /usr/lib/x86_64-linux-gnu/libopencv_highgui.so.2.4
#8  0x00007ffff7dea9ba in call_init (l=<optimized out>, argc=argc@entry=2, 
    argv=argv@entry=0x7fffffffdf48, env=env@entry=0x7fffffffdf60) at dl-init.c:78
#9  0x00007ffff7deaaa3 in call_init (env=0x7fffffffdf60, argv=0x7fffffffdf48, argc=2, 
    l=<optimized out>) at dl-init.c:36
#10 _dl_init (main_map=0x7ffff7ffe1a8, argc=2, argv=0x7fffffffdf48, env=0x7fffffffdf60)
    at dl-init.c:126
#11 0x00007ffff7ddd1ca in _dl_start_user () from /lib64/ld-linux-x86-64.so.2
#12 0x0000000000000002 in ?? ()
#13 0x00007fffffffe289 in ?? ()
#14 0x00007fffffffe297 in ?? ()
#15 0x0000000000000000 in ?? ()
(gdb)

I guess coz need to hook into the program startup code a bit earlier than wrapped_main() is doing it.

petterreinholdtsen commented 8 years ago

I was able to avoid the crash using the following patch:

diff --git a/libcoz/profiler.h b/libcoz/profiler.h
index 515d998..032e2ea 100644
--- a/libcoz/profiler.h
+++ b/libcoz/profiler.h
@@ -108,6 +108,8 @@ public:
     thread_start_arg* new_arg;

     thread_state* state = get_thread_state();
+    if (NULL == state)
+      state = add_thread();
     REQUIRE(state) << "Thread state not found";

     // Allocate a struct to pass as an argument to the new thread

The idea is to not depend on wrapped_main() being called to set up the state pointer. I suspect it is not a safe solution, but just wanted to let you know that it got rid of the crash. Unfortunately I did not get much information in the profile.coz file:

startup time=1468705859216684133
runtime time=223636086700

I suspect this is because I patched the wrong function in openalpr.

petterreinholdtsen commented 8 years ago

Hi. Is there more debugging information I can provide to help you with this?

ccurtsinger commented 8 years ago

I'm not sure there's a good way to fix this issue at the moment. My attempts to initialize the profiler in the replaced __libc_start_main lead to segfaults (probably because static initializers have not yet run), but deferring initialization until after static initializers run means we'll get the same error you're seeing. We could silently ignore missing thread state, but that means the profiler wouldn't be collecting samples for the thread created before main runs.

I'll have to put this issue on hold for a bit.

ccurtsinger commented 8 years ago

Implementing #60 should fix this issue, although this feature is in very early stages.