Closed mirh closed 6 years ago
Hey @mirh
What is reported in your computecpp_info?
Thanks
ComputeCpp Info (CE 0.5.1)
GLIBC version: 2.26
GLIBCXX: 20160609
This version of libstdc++ is supported.
********************************************************************************
Device Info:
Discovered 2 devices matching:
platform : <any>
device type : <any>
--------------------------------------------------------------------------------
Device 0:
Device is supported : UNTESTED - Untested OS
CL_DEVICE_NAME : Loveland
CL_DEVICE_VENDOR : Advanced Micro Devices, Inc.
CL_DRIVER_VERSION : 1800.11
CL_DEVICE_TYPE : CL_DEVICE_TYPE_GPU
--------------------------------------------------------------------------------
Device 1:
Device is supported : UNTESTED - Untested OS
CL_DEVICE_NAME : AMD E-350 Processor
CL_DEVICE_VENDOR : AuthenticAMD
CL_DRIVER_VERSION : 1800.11 (sse2)
CL_DEVICE_TYPE : CL_DEVICE_TYPE_CPU
Hi @mirh, what version of ComputeCpp have you downloaded? I see you are using a pretty recent GCC which might have some incompatibilities with the compiler we use to make the CE builds, particularly if you are using the Ubuntu 14.04 build. Apologies if I've asked you this in a different thread!
The ubuntu 16.04 one. Though I see Luke is back on a coding spree today.. So guess like I might come back checking this in a week or so..
Oh, I was slightly hoping you'd say the 14.04 one, then we might have the solution to the problem! We build the 16.04 package with GCC 5.4, honestly I doubt that this is the issue (and I've been checking just in case there have been some breaking changes from 5 through to 7) but it might be worth trying it just in case, as debugging remotely is quite hard (and we might not be able to test your exact setup any time soon).
Still happens with latest dev/amd_gpu and computecpp 0.6.0
@mirh have you tried using older GCC (4.8/5.4) or Clang (3.9) as @DuncanMcBain mentioned above?
I tried 4.9 and build failed after some seconds. Not in the mood to experiment further (especially considering.. I don't know, the crashy thingy looks way more mundane than something with the compiler) Also, no hurry on my side.
Sure. Thanks for trying. All I can recommend after this is to attempt to look in a debugger to see if (for example) it is the vector being empty, as Luke suggested. I'm afraid we have otherwise been totally unable to reproduce this crash 😞
Ok so.. I spent the best part of the last two days to understand how to get the equivalent of python3.6-dbg on arch.
UNFORTUNATELY, I'm just damn getting this problem with my own python build https://github.com/tensorflow/tensorflow/issues/16836 It should derive from the same exact source arch guys used to build release package.. YET type has different dimensions and make the program abort.
Oh, that's disappointing... Do you need the python-dbg package to be able to build a debug TensorFlow library? Otherwise, I'd try building that (i.e. bazel build -c dbg), then running something like
gdb python sometest.py
and looking for the break in libtensorflow_whatever.so.
Gdb python support requires python-debug build (which in turn required rebuilding numpy, tf and freaking gdb itself) Though, it didn't pass my mind to build tensorflow with dbg switch.
But the problem is, that for some reason, debug python builds have sys.getsizeof(type) = 416 (instead of 400). That is not liked by this
Ah, I see, that's unfortunate. I have to say, when debugging Python, I've never used the special debug libraries or anything like that, just running python as my GDB executable (it will then get the debug info for the shared libs that have it). I understand if you don't want to go further with this, but we've been unable to reproduce your issue so far :(
just running python as my GDB executable
In that case, the info I get doesn't seem particularly more than the one I posted from coredump in the OP.
EDIT: could it be type is larger because it has to hold some additional debug info?
Really? Huh! Personally I've used this setup internally here when debugging some bad kernel failures and had no problems. That said, I was making a debug tensorflow wheel, installing that then running (and that wheel can be really really big). I don't remember making any other particularly strange modifications...
Ok, I see.
So I'll try to rebuild tf with normal python, but bazel build -c opt --config=sycl -c dbg --strip=never
EDIT: this may possibly change in \bazelbuild/bazel/issues/3039
I think that should work! I was able to have a decent-ish debugging experience using that configuration :+1:
Ehrm... I have been trying to compile and recompile the thing for almost the whole last 3 days.. But for the love of me, there are just some kernels (e.g. argmax_op.cc and fused_batch_norm_op.cc) that makes everything go OOM.
Now, initially I thought that.. Well, it's me to be the problem: pretending to make suffice a 3GB laptop for this stuff.
But after adding a shitton of swap file and campering in front of top
, I noticed (in those cases) freaking compute++ first pretending 3GB is the new standard for a bit.
Then after a short while we go all up 5-6gb... and boom we skyrocket all up to 10.
I guess like reporting this (if you manage to reproduce?) should go on the computecpp repo, but I'm just this much tired.
I didn't realise you only had 3GB RAM. Compiling a debug build of TensorFlow with that amount of RAM is going to be very challenging indeed (FWIW, I had 16 here originally, but upgraded as compiles tend to take so much). We have been unable to reproduce your issue internally, but I'm operating on the assumption that the vector of devices is empty, which is why the crash is happening. Next week I'll take another look at that code and see if there's a line where we try to look at an element of the vector without checking whether it exists or similar.
Yes but I mean.. I was also skeptic about the whole affair initially But compiling release works just fine. I don't even need to run one job at time. I can't understand why debug builds would take almost an order of magnitude more memory to complete.
And I think requiring more than 10GB of ram is way bad/troubling even for most of your "normal" computers today.
TensorFlow is huge, and linking takes lots of RAM. I'll admit that it seems excessive, but you can run into similar problems when linking LLVM, for example. If I remember correctly, one of the Python/C++ wrapper libraries in TensorFlow is over 2GB by itself, in debug builds. That will assuredly take much more than 3 GB to link successfully.
I've not looked at the code surrounding your crash yet. I still plan to look at it this week.
Sooo.. Updates. Tried latest experimental/amd_gpu branch with computecpp 0.6.1. Normal build compiles and everything, then returns the usual FilterSupportedDevices segfault when run. Debug build OTOH fails on
contrib/rnn/BUILD:240:1: C++ compilation of rule '//tensorflow/contrib/rnn:python/ops/_gru_ops.so' failed (Exit 1)
In file included from <built-in>:1:
In file included from ./tensorflow/contrib/rnn/kernels/blas_gemm.cc:22:
In file included from ./tensorflow/contrib/rnn/kernels/blas_gemm.h:19:
In file included from ./third_party/eigen3/unsupported/Eigen/CXX11/Tensor:1:
In file included from external/eigen_archive/unsupported/Eigen/CXX11/Tensor:15:
In file included from external/eigen_archive/unsupported/Eigen/CXX11/../../../Eigen/Core:93:
external/eigen_archive/unsupported/Eigen/CXX11/../../../Eigen/src/Core/util/Macros.h:1044:3: error: use of undeclared identifier 'assert'
eigen_assert(first && message);
^
external/eigen_archive/unsupported/Eigen/CXX11/../../../Eigen/src/Core/util/Macros.h:608:25: note: expanded from macro 'eigen_assert'
#define eigen_assert(x) eigen_plain_assert(x)
^
external/eigen_archive/unsupported/Eigen/CXX11/../../../Eigen/src/Core/util/Macros.h:578:35: note: expanded from macro 'eigen_plain_assert'
#define eigen_plain_assert(x) assert(x)
^
1 error generated.
Hi @mirh, we'll try to take a look at this this week. @lukeiwanski have we attempted a debug build internally? It looks like this might even come up in plain Eigen builds.
So.. I took for good this fix for debug python builds (not even trying anymore for debug tf builds) This is what I got
(gdb) py-bt
Traceback (most recent call first):
0x7ffff089cf58
<built-in method TF_ExtendGraph of module object at remote 0x7ffff0886a58>
File "/usr/lib/python3.6/site-packages/tensorflow/python/client/session.py", line 1392, in _extend_graph
graph_def.SerializeToString(), status)
File "/usr/lib/python3.6/site-packages/tensorflow/python/client/session.py", line 1332, in _run_fn
self._extend_graph()
File "/usr/lib/python3.6/site-packages/tensorflow/python/client/session.py", line 1361, in _do_call
return fn(*args)
File "/usr/lib/python3.6/site-packages/tensorflow/python/client/session.py", line 1355, in _do_run
options, run_metadata)
File "/usr/lib/python3.6/site-packages/tensorflow/python/client/session.py", line 1137, in _run
feed_dict_tensor, options, run_metadata)
File "/usr/lib/python3.6/site-packages/tensorflow/python/client/session.py", line 905, in run
run_metadata_ptr)
File "classify_image.py", line 157, in run_inference_on_image
{'DecodeJpeg/contents:0': image_data})
File "classify_image.py", line 193, in main
run_inference_on_image(image)
File "/usr/lib/python3.6/site-packages/tensorflow/python/platform/app.py", line 126, in run
_sys.exit(main(argv))
File "classify_image.py", line 227, in <module>
tf.app.run(main=main, argv=[sys.argv[0]] + unparsed)
@DuncanMcBain / @mirh I will look into that first thing on the Monday. As far as I know we cannot reproduce the issue. Pinging @mehdi-goli for Eigen assert part of the issue
I knew I'd seen this error somewhere! On Friday, I actually ran into the assert() problem in the office. Thanks very much @mirh, we'll continue to investigate!
To put it simply, there is a slight problem with building Eigen in certain build configurations. We'll work on a fix for this, and let you know here when it's properly pushed. Thanks for looking!
@DuncanMcBain ping. Any update on this?
Sorry, no. I won't have time to work on this issue.
Ok, I will takeover.
@mirh there are couple problems in this issue.
One mentioned in https://github.com/lukeiwanski/tensorflow/issues/205#issuecomment-372218778 regarding asserts.. I believe guys on our end are looking at fixing it in Eigen.. workaround for that would be to add #include <cassert>
before Eigen is included.. I would not recommend solving this one.. as you should avoid building debug TF in the first place - fastbuild
target will have most of the debug info needed anyway.
I am setting up environment to match yours. GCC 7.2.1 and Python 3.6.5 - anything else? Can you provide me with the build command that you are using?
Can you confirm that you tried it with GCC 4.8 / 5.4 and / or Clang 3.9?
No, I cannot confirm you that I tried gcc 4, 5 or clang. Just 7.2.1 (and now 7.3.1, but whatever).
Build is bazel build --config=sycl //tensorflow/tools/pip_package:build_pip_package
And I don't have problems with all the debug packages of the world, if it is to help finding out the problem.
Can you try with GCC 4 / 5 or Clang 3.9?
I.. guess in a day or so. EDIT: goddamit, isn't 1.8 fattened in stuff to compile
That would be very helpful
and I presume all compiles fine with bazel build //tensorflow/tools/pip_package:build_pip_package
Yes
I am not able to reproduce the issue with GCC 7 and Python 2/3 on Ubuntu 16.04. Next to try is the Manjaro.
My educate guess would be the problem is with whatever shenanigan the gpu/driver has. Not compiler or OS.
Possibly, but then wouldn't you get similar problem with computecpp_info
?
speaking of the driver.. what clinfo
reports?
The crash seems in whatever makes the "list" in tensorflow, more than something about computecpp
What confuses me about this crash is.. that list should always has at least CPU device in it.. but yours seems to be empty?
At least in computecpp_info, as you can see from the third comment, both devices are presented. Also, on running any TF testcase, the GPU is explicitly stated to be there (./tensorflow/core/common_runtime/sycl/sycl_device.h)
Unfortunately, I can report of https://github.com/codeplaysoftware/computecpp-sdk/issues/77 coming up again with tf 1.8 and cpp 0.7.0...
EDIT: I'd also swear until like a week or so ago, configure
had you asking which gcc you wanted to use. Now, for reasons, even if I remove it altogether I get no feedback of it at all.
At least in computecpp_info, as you can see from the third comment, both devices are presented.
Yes, that's what I am getting at. If computecpp_info
reports them runtime should(tm) work just fine. ( @DuncanMcBain to confirm? )
EDIT: I'd also swear until like a week or so ago, configure had you asking which gcc you wanted to use. Now, for reasons, even if I remove it altogether I get no feedback of it at all.
We moved to cross-compilation approach that uses computecpp
driver - that way instead of having to call 2 compilers (host and device) you call just one. That way we don't need custom python script to manage that process, bazel takes care of things.
As of the ComputeCpp 0.7 and TF 1.8 as I mentioned earlier - do not use it for now. It is very unstable.. and in the worst case scenario it will break your AMD driver to the point that you need to hard re-boot your rig.
Thanks for testing it tho! It's very useful! Do you want a hoodie? ;D
Yeah if it appears in one it should be seen in the other. It's not exactly the same code (i.e. the same source files), but it's incredibly similar.
Actually, I fear my last problem is a regression in computecpp 0.7.0, because even amd_gpu is giving me the same problem now (and of course, this had to happen that one time I get lazy after some release) I'm just trying to change to 0.6.1 to see if that really nails it down.
We moved to cross-compilation approach that uses computecpp driver - that way instead of having to call 2 compilers (host and device) you call just one. That way we don't need custom python script to manage that process, bazel takes care of things.
All nice and dandy, but how would I tell it to use /usr/bin/gcc-4.9 instead of the "default" one?
The default compiler is now "compute++", at least for SYCL code - so it's not any GCC. I forgot that had been merged in, after all this time.
Other avenues you could test would be to try installing alternative OpenCL implementations. Intel's CPU implementation is very stable and I trust it. If the vector of devices is still coming up empty while other SYCL projects work... Well, I guess we can think about that if and when it happens.
Hola Would it make sense for any of these commits (prolly 6d09da2 I guess, but I cannot further bissect revisions in between due to an undefined symbol error) to have fixed this?
Well, no answer for my little wondering.. Fixed it is fixed, so gg.
Sorry, I missed this first time 'round! It's fixed now? I don't know that any of the commits in that range would fix it, and it's nice to know what fixed it in case i breaks again, but I'm also not going to look a gift horse in the mouth...
After https://github.com/codeplaysoftware/computecpp-sdk/issues/77 , I'm glad to announce I'm having yet another crash. But totally AMD-free this time!
Soo, without further ado: