hughperkins / tf-coriander

OpenCL 1.2 implementation for Tensorflow
Apache License 2.0
791 stars 90 forks source link

Mac build: `libc++abi.dylib: terminating with uncaught exception of type std::__1::system_error: mutex lock failed: Invalid argument` #11

Closed hughperkins closed 7 years ago

hughperkins commented 7 years ago
hughperkins commented 7 years ago

So, wheel build is working, but it doesnt import yet. I imagine it's either an rpath issue, or something to do with the contents of this __init__.py script, https://github.com/hughperkins/tensorflow-cl/blob/tensorflow-cl/tensorflow/python/__init__.py#L48-L86 If someone has a moment to try to figure out what is going on here, nad how to fix it, that would be most appreciated :-)

hughperkins commented 7 years ago

There probably is an rpath issue, but that's probably solvable. However, in the meantime, I simply copied all the dylibs into the directory it is searching in, https://travis-ci.org/hughperkins/travis-test/builds/181011250 https://github.com/hughperkins/travis-test/blob/4da7968d49c87cffb874aa02ff620e5cbbdb6a8b/.travis.yml , and get a challenging-looking issue:

travis_time:end:1e68af38:start=1480797407446745000,finish=1480797407587369000,du$ python3 -c 'from tensorflow.python import _pywrap_tensorflow; print("imported")'
libc++abi.dylib: terminating with uncaught exception of type std::__1::system_error: mutex lock failed: Invalid argument
/Users/travis/build.sh: line 57:  2944 Abort trap: 6           python3 -c 'from tensorflow.python import _pywrap_tensorflow; print("imported")'

Searching around, I found https://github.com/dmlc/mxnet/issues/309 , which suggests the issue is probably wit hthe code itself, and quite challenging to diagnose. Ideally will be easiest by someone with direct access to a Mac I reckon.

hughperkins commented 7 years ago

(I reckon needs some combination of:

)

zihaolucky commented 7 years ago

Hi @hughperkins thanks for the great reference on travis-ci's config. But when I try to build tensorboard alone with bazel, using travis, it compiles so slow that the job is terminated by travis, which the building time is limited within 50mins.

I see you compile the whole tensorflow, so technically the tensorboard part could be much more less time. Do you have any ideas? Did you have some special tricks on building tensorflow quickly? Thank you!

hughperkins commented 7 years ago

Update: here is the stack trace for this crash, on Mac Sierra:

* thread #1, queue = 'com.apple.main-thread', stop reason = breakpoint 2.1
  * frame #0: 0x00007fff9440a765 libc++abi.dylib`__cxa_throw
    frame #1: 0x00007fff943d8445 libc++.1.dylib`std::__1::__throw_system_error(int, char const*) + 77
    frame #2: 0x00000001053b42d7 _pywrap_tensorflow.so`tensorflow::mutex_lock::mutex_lock(tensorflow::mutex&) + 55
    frame #3: 0x0000000107837d4d _pywrap_tensorflow.so`perftools::gputools::mutex_lock::mutex_lock(perftools::gputools::mutex&) + 29
    frame #4: 0x0000000109b8a6e8 _pywrap_tensorflow.so`perftools::gputools::PluginRegistry::Instance() + 24
    frame #5: 0x0000000109b29730 _pywrap_tensorflow.so`perftools::gputools::initialize_clblas() + 16
    frame #6: 0x0000000109b299f9 _pywrap_tensorflow.so`google_init_module_register_clblas() + 9
    frame #7: 0x0000000109b29ce3 _pywrap_tensorflow.so`perftools::gputools::port::Initializer::Initializer(void (*)()) + 19
    frame #8: 0x0000000109b29a1d _pywrap_tensorflow.so`perftools::gputools::port::Initializer::Initializer(void (*)()) + 29
    frame #9: 0x0000000109b2a207 _pywrap_tensorflow.so`__cxx_global_var_init + 23
    frame #10: 0x0000000109b2a219 _pywrap_tensorflow.so`_GLOBAL__sub_I_cl_blas.cc + 9
    frame #11: 0x0000000100019a1b dyld`ImageLoaderMachO::doModInitFunctions(ImageLoader::LinkContext const&) + 385
    frame #12: 0x0000000100019c1e dyld`ImageLoaderMachO::doInitialization(ImageLoader::LinkContext const&) + 40
    frame #13: 0x00000001000154aa dyld`ImageLoader::recursiveInitialization(ImageLoader::LinkContext const&, unsigned int, char const*, ImageLoader::InitializerTimingList&, ImageLoader::UninitedUpwards&) + 338
    frame #14: 0x0000000100014524 dyld`ImageLoader::processInitializers(ImageLoader::LinkContext const&, unsigned int, ImageLoader::InitializerTimingList&, ImageLoader::UninitedUpwards&) + 138
    frame #15: 0x00000001000145b9 dyld`ImageLoader::runInitializers(ImageLoader::LinkContext const&, ImageLoader::InitializerTimingList&) + 75
    frame #16: 0x00000001000097cd dyld`dyld::runInitializers(ImageLoader*) + 87
    frame #17: 0x00000001000113ec dyld`dlopen + 556
...

(Edit 2: I got this by doing:

lldb python
break set -E c++
run -c 'import tensorflow'
run -c 'import tensorflow'
y
bt

)

Edit 3: ok, I got as far as adding a cout to line 55 of tensorflow/stream_executor/plugin_registry.cc:

  std::cout << "plugin_registry.cc PluginRegistry::Instance()" << std::endl;
  mutex_lock lock{mu_};

When tensorflow is imported, this prints out just before the crash:

$ python -c 'import tensorflow'
plugin_registry.cc PluginRegistry::Instance()
libc++abi.dylib: terminating with uncaught exception of type std::__1::system_error: mutex lock failed: Invalid argument
Abort trap: 6

=> so something odd happening on the mutex. Collision with the mutexes used by cuda-on-cl? Some global mutex setting that needs tweaking? mutexes dont work in mac tensorflow? (last option seems unlikely). Using the wrong standard library?

hughperkins commented 7 years ago

Looks like it might be something along the lines of, the mutex object hasnt been initialized yet, by the time the static initializer runs, which tries to lock it. As for why this issue doesnt arise in normal Mac cpu build => open question :-)

hughperkins commented 7 years ago

Tentatively fixed in 8655b7a Yay :-)

hughperkins commented 7 years ago

Think this issue is fixed now, albeit in an unscaleable, but safe/portable-ish, way. => closing

zihaolucky commented 7 years ago

I walk around this issue via hacking the bazel config. Now the compile process could be finished within the time limit.

hughperkins commented 7 years ago

@zihaolucky You're saying, you are able to build tensorflow on travis in <= 40 minutes?

zihaolucky commented 7 years ago

@hughperkins Sorry for missing the detail, I only build the tensorboard.

hughperkins commented 7 years ago

ah, got it.