hughperkins / tf-coriander

OpenCL 1.2 implementation for Tensorflow
Apache License 2.0
791 stars 90 forks source link

Crashed Xorg on AMD GPUPro driver #13

Closed ghost closed 7 years ago

ghost commented 7 years ago

python3 recurrent_network.py crashes, dragging down Xorg with it.

I can't copy/paste the error from Ubuntu's bugchecker, for some reason (???), so I screenshotted the information, which included stack and traceback information. But, the basic error appears to land within the AMD GPU Pro driver code, with it trying to execute at 0x00... so perhaps a function pointer somewhere is being passed as null instead of pointing to an OpenCL function as intended?

Please enjoy a wall of images..

cropped_error-0 cropped_error-1 cropped_error-2 cropped_error-3 cropped_error-4 cropped_error-5 cropped_error-6 cropped_error-7 cropped_error-8 cropped_error-9 cropped_error-10 cropped_error-11 cropped_error-12 cropped_error-13 cropped_error-14 cropped_error-15

hughperkins commented 7 years ago

Ok. Unfortunately I dont have the source-code to the AMD GPU driver :-P In the absence of that we need to do bisection to localize the error. Or basically, find the simplest program that triggers the error basically.

Did you try the other tests? Do they all run? Is it only the recurrent_network that triggers the crash? Have you run the python tests in https://github.com/hughperkins/cuda-on-cl to completion for example?

hughperkins commented 7 years ago

I made some updates to some of the event/driver/context code recently, so its possible, but far from certain, that that might have affected this bug in some way, ideally for the better. Might be worth grabbing latest version and retrying. I should probably build a wheel to facilitate that. For ubuntu 16.04 is that right?

hughperkins commented 7 years ago

If you get a moment, do you mind building latest tensorflow-cl branch, and seeing if that helps at all?

hughperkins commented 7 years ago

(Also, can you confirm that latest https://github.com/hughperkins/coriander unit tests are/arent working? There are three sets:

I think you should run all of these, and report the results, please

)

ghost commented 7 years ago

I'll see what I can do! :) Very soon this error might be defunct anyway, as I may be moving from AMDGPU-pro to ROCm as the driver/OpenCL engine on my setup.

Unfortunately right now my system is in a woefully unstable situation due to experimenting with ROCm on partially supported hardware, but I have another boot drive that I can test this on.

hughperkins commented 7 years ago

Ah, no worries. Let's wait till you have the new stable system. No point in fixing bugs on a soon-to-be-obsolete system.

On 28 May 2017 14:13:50 BST, Cathal Garvey notifications@github.com wrote:

I'll see what I can do! :) Very soon this error might be defunct anyway, as I may be moving from AMDGPU-pro to ROCm as the driver/OpenCL engine on my setup.

Unfortunately right now my system is in a woefully unstable situation due to experimenting with ROCm on partially supported hardware, but I have another boot drive that I can test this on.

-- You are receiving this because you commented. Reply to this email directly or view it on GitHub: https://github.com/hughperkins/tensorflow-cl/issues/13#issuecomment-304513428

-- Sent from my Android device with K-9 Mail. Please excuse my brevity.

hughperkins commented 7 years ago

Lets close this for now then. We can reopen it if you have more recent results.

hughperkins commented 7 years ago

(Theres a new wheel out now, if you do have a moment to try, just for curiosity mostly I guess)