Segfault in FilterSupportedDevices

mirh commented 6 years ago

After https://github.com/codeplaysoftware/computecpp-sdk/issues/77 , I'm glad to announce I'm having yet another crash. But totally AMD-free this time!

Soo, without further ado:

#0  0x00007fca16a8ea12 in tensorflow::(anonymous namespace)::FilterSupportedDevices(std::vector<tensorflow::Device*, std::allocator<tensorflow::Device*> > const&, tensorflow::gtl::InlinedVector<tensorflow::DeviceType, 4> const&) ()
   from /usr/lib/python3.6/site-packages/tensorflow/python/../libtensorflow_framework.so


Stack trace of thread 20108:
                #0  0x00007fca16a8ea12 _ZN10tensorflow12_GLOBAL__N_122FilterSupportedDevicesERKSt6vectorIPNS_6DeviceESaIS3_EERKNS_3gtl13InlinedVectorINS_10DeviceTypeELi4EEE (libtensorflow_framework.so)
                #1  0x00007fca16a8d776 _ZN10tensorflow12_GLOBAL__N_115ColocationGraph17GetDevicesForNodeEPNS_4NodeEPPSt6vectorIPNS_6DeviceESaIS6_EE (libtensorflow_framework.so)
                #2  0x00007fca16a8b46d _ZN10tensorflow6Placer3RunEv (libtensorflow_framework.so)
                #3  0x00007fca1cc852f9 _ZN10tensorflow19GraphExecutionState13InitBaseGraphERKNS_17BuildGraphOptionsE (_pywrap_tensorflow_internal.so)
                #4  0x00007fca1cc84ec5 _ZN10tensorflow19GraphExecutionState16MakeForBaseGraphEPNS_8GraphDefERKNS_26GraphExecutionStateOptionsEPSt10unique_ptrIS0_St14default_deleteIS0_EE (_pywrap_tensorflow_internal.so)
                #5  0x00007fca1af90a27 _ZN10tensorflow13DirectSession29MaybeInitializeExecutionStateERKNS_8GraphDefEPb (_pywrap_tensorflow_internal.so)
                #6  0x00007fca1af90c8f _ZN10tensorflow13DirectSession12ExtendLockedERKNS_8GraphDefE (_pywrap_tensorflow_internal.so)
                #7  0x00007fca1af90e03 _ZN10tensorflow13DirectSession6ExtendERKNS_8GraphDefE (_pywrap_tensorflow_internal.so)
                #8  0x00007fca187a4b45 TF_ExtendGraph (_pywrap_tensorflow_internal.so)
                #9  0x00007fca1844d621 _ZL20_wrap_TF_ExtendGraphP7_objectS0_ (_pywrap_tensorflow_internal.so)
                #10 0x00007fca28919ad0 _PyCFunction_FastCallDict (libpython3.6m.so.1.0)
                #11 0x00007fca2893fd1b n/a (libpython3.6m.so.1.0)
                #12 0x00007fca288d3b5a _PyEval_EvalFrameDefault (libpython3.6m.so.1.0)

lukeiwanski commented 6 years ago

Hey @mirh

What is reported in your computecpp_info?

Thanks

mirh commented 6 years ago

ComputeCpp Info (CE 0.5.1)
GLIBC version: 2.26
GLIBCXX: 20160609
This version of libstdc++ is supported.
********************************************************************************
Device Info:
Discovered 2 devices matching:
  platform    : <any>
  device type : <any>
--------------------------------------------------------------------------------
Device 0:
  Device is supported                     : UNTESTED - Untested OS
  CL_DEVICE_NAME                          : Loveland
  CL_DEVICE_VENDOR                        : Advanced Micro Devices, Inc.
  CL_DRIVER_VERSION                       : 1800.11
  CL_DEVICE_TYPE                          : CL_DEVICE_TYPE_GPU 
--------------------------------------------------------------------------------
Device 1:
  Device is supported                     : UNTESTED - Untested OS
  CL_DEVICE_NAME                          : AMD E-350 Processor
  CL_DEVICE_VENDOR                        : AuthenticAMD
  CL_DRIVER_VERSION                       : 1800.11 (sse2)
  CL_DEVICE_TYPE                          : CL_DEVICE_TYPE_CPU

OS Platform and Distribution (e.g., Linux Ubuntu 16.04): Manjaro 17.1
TensorFlow installed from (source or binary): this very dev/amd_gpu branch
Python version: 3.6.4
Bazel version (if compiling from source): 0.9
GCC/Compiler version (if compiling from source): 7.2.1

DuncanMcBain commented 6 years ago

Hi @mirh, what version of ComputeCpp have you downloaded? I see you are using a pretty recent GCC which might have some incompatibilities with the compiler we use to make the CE builds, particularly if you are using the Ubuntu 14.04 build. Apologies if I've asked you this in a different thread!

mirh commented 6 years ago

The ubuntu 16.04 one. Though I see Luke is back on a coding spree today.. So guess like I might come back checking this in a week or so..

DuncanMcBain commented 6 years ago

Oh, I was slightly hoping you'd say the 14.04 one, then we might have the solution to the problem! We build the 16.04 package with GCC 5.4, honestly I doubt that this is the issue (and I've been checking just in case there have been some breaking changes from 5 through to 7) but it might be worth trying it just in case, as debugging remotely is quite hard (and we might not be able to test your exact setup any time soon).

mirh commented 6 years ago

Still happens with latest dev/amd_gpu and computecpp 0.6.0

lukeiwanski commented 6 years ago

@mirh have you tried using older GCC (4.8/5.4) or Clang (3.9) as @DuncanMcBain mentioned above?

mirh commented 6 years ago

I tried 4.9 and build failed after some seconds. Not in the mood to experiment further (especially considering.. I don't know, the crashy thingy looks way more mundane than something with the compiler) Also, no hurry on my side.

DuncanMcBain commented 6 years ago

Sure. Thanks for trying. All I can recommend after this is to attempt to look in a debugger to see if (for example) it is the vector being empty, as Luke suggested. I'm afraid we have otherwise been totally unable to reproduce this crash 😞

mirh commented 6 years ago

Ok so.. I spent the best part of the last two days to understand how to get the equivalent of python3.6-dbg on arch.

UNFORTUNATELY, I'm just damn getting this problem with my own python build https://github.com/tensorflow/tensorflow/issues/16836 It should derive from the same exact source arch guys used to build release package.. YET type has different dimensions and make the program abort.

DuncanMcBain commented 6 years ago

Oh, that's disappointing... Do you need the python-dbg package to be able to build a debug TensorFlow library? Otherwise, I'd try building that (i.e. bazel build -c dbg), then running something like

gdb python sometest.py

and looking for the break in libtensorflow_whatever.so.

mirh commented 6 years ago

Gdb python support requires python-debug build (which in turn required rebuilding numpy, tf and freaking gdb itself) Though, it didn't pass my mind to build tensorflow with dbg switch.

But the problem is, that for some reason, debug python builds have sys.getsizeof(type) = 416 (instead of 400). That is not liked by this

DuncanMcBain commented 6 years ago

Ah, I see, that's unfortunate. I have to say, when debugging Python, I've never used the special debug libraries or anything like that, just running python as my GDB executable (it will then get the debug info for the shared libs that have it). I understand if you don't want to go further with this, but we've been unable to reproduce your issue so far :(

mirh commented 6 years ago

just running python as my GDB executable

In that case, the info I get doesn't seem particularly more than the one I posted from coredump in the OP.

EDIT: could it be type is larger because it has to hold some additional debug info?

DuncanMcBain commented 6 years ago

Really? Huh! Personally I've used this setup internally here when debugging some bad kernel failures and had no problems. That said, I was making a debug tensorflow wheel, installing that then running (and that wheel can be really really big). I don't remember making any other particularly strange modifications...

mirh commented 6 years ago

Ok, I see. So I'll try to rebuild tf with normal python, but bazel build -c opt --config=sycl -c dbg --strip=never

EDIT: this may possibly change in \bazelbuild/bazel/issues/3039

DuncanMcBain commented 6 years ago

I think that should work! I was able to have a decent-ish debugging experience using that configuration :+1:

mirh commented 6 years ago

Ehrm... I have been trying to compile and recompile the thing for almost the whole last 3 days.. But for the love of me, there are just some kernels (e.g. argmax_op.cc and fused_batch_norm_op.cc) that makes everything go OOM.

Now, initially I thought that.. Well, it's me to be the problem: pretending to make suffice a 3GB laptop for this stuff. But after adding a shitton of swap file and campering in front of top, I noticed (in those cases) freaking compute++ first pretending 3GB is the new standard for a bit. Then after a short while we go all up 5-6gb... and boom we skyrocket all up to 10.

I guess like reporting this (if you manage to reproduce?) should go on the computecpp repo, but I'm just this much tired.

DuncanMcBain commented 6 years ago

I didn't realise you only had 3GB RAM. Compiling a debug build of TensorFlow with that amount of RAM is going to be very challenging indeed (FWIW, I had 16 here originally, but upgraded as compiles tend to take so much). We have been unable to reproduce your issue internally, but I'm operating on the assumption that the vector of devices is empty, which is why the crash is happening. Next week I'll take another look at that code and see if there's a line where we try to look at an element of the vector without checking whether it exists or similar.

mirh commented 6 years ago

Yes but I mean.. I was also skeptic about the whole affair initially But compiling release works just fine. I don't even need to run one job at time. I can't understand why debug builds would take almost an order of magnitude more memory to complete.

And I think requiring more than 10GB of ram is way bad/troubling even for most of your "normal" computers today.

DuncanMcBain commented 6 years ago

TensorFlow is huge, and linking takes lots of RAM. I'll admit that it seems excessive, but you can run into similar problems when linking LLVM, for example. If I remember correctly, one of the Python/C++ wrapper libraries in TensorFlow is over 2GB by itself, in debug builds. That will assuredly take much more than 3 GB to link successfully.

I've not looked at the code surrounding your crash yet. I still plan to look at it this week.

mirh commented 6 years ago

Sooo.. Updates. Tried latest experimental/amd_gpu branch with computecpp 0.6.1. Normal build compiles and everything, then returns the usual FilterSupportedDevices segfault when run. Debug build OTOH fails on

contrib/rnn/BUILD:240:1: C++ compilation of rule '//tensorflow/contrib/rnn:python/ops/_gru_ops.so' failed (Exit 1)
In file included from <built-in>:1:
In file included from ./tensorflow/contrib/rnn/kernels/blas_gemm.cc:22:
In file included from ./tensorflow/contrib/rnn/kernels/blas_gemm.h:19:
In file included from ./third_party/eigen3/unsupported/Eigen/CXX11/Tensor:1:
In file included from external/eigen_archive/unsupported/Eigen/CXX11/Tensor:15:
In file included from external/eigen_archive/unsupported/Eigen/CXX11/../../../Eigen/Core:93:
external/eigen_archive/unsupported/Eigen/CXX11/../../../Eigen/src/Core/util/Macros.h:1044:3: error: use of undeclared identifier 'assert'
  eigen_assert(first && message);
  ^
external/eigen_archive/unsupported/Eigen/CXX11/../../../Eigen/src/Core/util/Macros.h:608:25: note: expanded from macro 'eigen_assert'
#define eigen_assert(x) eigen_plain_assert(x)
                        ^
external/eigen_archive/unsupported/Eigen/CXX11/../../../Eigen/src/Core/util/Macros.h:578:35: note: expanded from macro 'eigen_plain_assert'
    #define eigen_plain_assert(x) assert(x)
                                  ^
1 error generated.

DuncanMcBain commented 6 years ago

Hi @mirh, we'll try to take a look at this this week. @lukeiwanski have we attempted a debug build internally? It looks like this might even come up in plain Eigen builds.

mirh commented 6 years ago

So.. I took for good this fix for debug python builds (not even trying anymore for debug tf builds) This is what I got

(gdb) py-bt
Traceback (most recent call first):
  0x7ffff089cf58
  <built-in method TF_ExtendGraph of module object at remote 0x7ffff0886a58>
  File "/usr/lib/python3.6/site-packages/tensorflow/python/client/session.py", line 1392, in _extend_graph
    graph_def.SerializeToString(), status)
  File "/usr/lib/python3.6/site-packages/tensorflow/python/client/session.py", line 1332, in _run_fn
    self._extend_graph()
  File "/usr/lib/python3.6/site-packages/tensorflow/python/client/session.py", line 1361, in _do_call
    return fn(*args)
  File "/usr/lib/python3.6/site-packages/tensorflow/python/client/session.py", line 1355, in _do_run
    options, run_metadata)
  File "/usr/lib/python3.6/site-packages/tensorflow/python/client/session.py", line 1137, in _run
    feed_dict_tensor, options, run_metadata)
  File "/usr/lib/python3.6/site-packages/tensorflow/python/client/session.py", line 905, in run
    run_metadata_ptr)
  File "classify_image.py", line 157, in run_inference_on_image
    {'DecodeJpeg/contents:0': image_data})
  File "classify_image.py", line 193, in main
    run_inference_on_image(image)
  File "/usr/lib/python3.6/site-packages/tensorflow/python/platform/app.py", line 126, in run
    _sys.exit(main(argv))
  File "classify_image.py", line 227, in <module>
    tf.app.run(main=main, argv=[sys.argv[0]] + unparsed)

> Cstack.txt

lukeiwanski commented 6 years ago

@DuncanMcBain / @mirh I will look into that first thing on the Monday. As far as I know we cannot reproduce the issue. Pinging @mehdi-goli for Eigen assert part of the issue

DuncanMcBain commented 6 years ago

I knew I'd seen this error somewhere! On Friday, I actually ran into the assert() problem in the office. Thanks very much @mirh, we'll continue to investigate!

DuncanMcBain commented 6 years ago

To put it simply, there is a slight problem with building Eigen in certain build configurations. We'll work on a fix for this, and let you know here when it's properly pushed. Thanks for looking!

lukeiwanski commented 6 years ago

@DuncanMcBain ping. Any update on this?

DuncanMcBain commented 6 years ago

Sorry, no. I won't have time to work on this issue.

lukeiwanski commented 6 years ago

Ok, I will takeover.

lukeiwanski commented 6 years ago

@mirh there are couple problems in this issue. One mentioned in https://github.com/lukeiwanski/tensorflow/issues/205#issuecomment-372218778 regarding asserts.. I believe guys on our end are looking at fixing it in Eigen.. workaround for that would be to add #include <cassert> before Eigen is included.. I would not recommend solving this one.. as you should avoid building debug TF in the first place - fastbuild target will have most of the debug info needed anyway.

I am setting up environment to match yours. GCC 7.2.1 and Python 3.6.5 - anything else? Can you provide me with the build command that you are using?

Can you confirm that you tried it with GCC 4.8 / 5.4 and / or Clang 3.9?

mirh commented 6 years ago

No, I cannot confirm you that I tried gcc 4, 5 or clang. Just 7.2.1 (and now 7.3.1, but whatever). Build is bazel build --config=sycl //tensorflow/tools/pip_package:build_pip_package

And I don't have problems with all the debug packages of the world, if it is to help finding out the problem.

lukeiwanski commented 6 years ago

Can you try with GCC 4 / 5 or Clang 3.9?

mirh commented 6 years ago

I.. guess in a day or so. EDIT: goddamit, isn't 1.8 fattened in stuff to compile

lukeiwanski commented 6 years ago

That would be very helpful and I presume all compiles fine with bazel build //tensorflow/tools/pip_package:build_pip_package

mirh commented 6 years ago

Yes

lukeiwanski commented 6 years ago

I am not able to reproduce the issue with GCC 7 and Python 2/3 on Ubuntu 16.04. Next to try is the Manjaro.

mirh commented 6 years ago

My educate guess would be the problem is with whatever shenanigan the gpu/driver has. Not compiler or OS.

lukeiwanski commented 6 years ago

Possibly, but then wouldn't you get similar problem with computecpp_info? speaking of the driver.. what clinfo reports?

mirh commented 6 years ago

The crash seems in whatever makes the "list" in tensorflow, more than something about computecpp

lukeiwanski commented 6 years ago

What confuses me about this crash is.. that list should always has at least CPU device in it.. but yours seems to be empty?

mirh commented 6 years ago

At least in computecpp_info, as you can see from the third comment, both devices are presented. Also, on running any TF testcase, the GPU is explicitly stated to be there (./tensorflow/core/common_runtime/sycl/sycl_device.h)

Unfortunately, I can report of https://github.com/codeplaysoftware/computecpp-sdk/issues/77 coming up again with tf 1.8 and cpp 0.7.0... EDIT: I'd also swear until like a week or so ago, configure had you asking which gcc you wanted to use. Now, for reasons, even if I remove it altogether I get no feedback of it at all.

lukeiwanski commented 6 years ago

At least in computecpp_info, as you can see from the third comment, both devices are presented.

Yes, that's what I am getting at. If computecpp_info reports them runtime should(tm) work just fine. ( @DuncanMcBain to confirm? )

EDIT: I'd also swear until like a week or so ago, configure had you asking which gcc you wanted to use. Now, for reasons, even if I remove it altogether I get no feedback of it at all.

We moved to cross-compilation approach that uses computecpp driver - that way instead of having to call 2 compilers (host and device) you call just one. That way we don't need custom python script to manage that process, bazel takes care of things.

As of the ComputeCpp 0.7 and TF 1.8 as I mentioned earlier - do not use it for now. It is very unstable.. and in the worst case scenario it will break your AMD driver to the point that you need to hard re-boot your rig.

Thanks for testing it tho! It's very useful! Do you want a hoodie? ;D

DuncanMcBain commented 6 years ago

Yeah if it appears in one it should be seen in the other. It's not exactly the same code (i.e. the same source files), but it's incredibly similar.

mirh commented 6 years ago

Actually, I fear my last problem is a regression in computecpp 0.7.0, because even amd_gpu is giving me the same problem now (and of course, this had to happen that one time I get lazy after some release) I'm just trying to change to 0.6.1 to see if that really nails it down.

We moved to cross-compilation approach that uses computecpp driver - that way instead of having to call 2 compilers (host and device) you call just one. That way we don't need custom python script to manage that process, bazel takes care of things.

All nice and dandy, but how would I tell it to use /usr/bin/gcc-4.9 instead of the "default" one?

DuncanMcBain commented 6 years ago

The default compiler is now "compute++", at least for SYCL code - so it's not any GCC. I forgot that had been merged in, after all this time.

Other avenues you could test would be to try installing alternative OpenCL implementations. Intel's CPU implementation is very stable and I trust it. If the vector of devices is still coming up empty while other SYCL projects work... Well, I guess we can think about that if and when it happens.

mirh commented 6 years ago

Hola Would it make sense for any of these commits (prolly 6d09da2 I guess, but I cannot further bissect revisions in between due to an undefined symbol error) to have fixed this?

mirh commented 6 years ago

Well, no answer for my little wondering.. Fixed it is fixed, so gg.

DuncanMcBain commented 6 years ago

Sorry, I missed this first time 'round! It's fixed now? I don't know that any of the commits in that range would fix it, and it's nice to know what fixed it in case i breaks again, but I'm also not going to look a gift horse in the mouth...

lukeiwanski / tensorflow

Segfault in FilterSupportedDevices #205