Computer freezes when using dedicated GPU

hery commented 9 years ago

Hello! I'm having this issue where my computer freezes completely (besides the mouse) when running cltorch on a dedicated GPU, running a simple script to test my environment.

-- hello.lua
require 'cltorch'

print('Current device: ', cltorch.getDevice())

-- Switch to dedicated GPU. Everything breaks if we uncomment those lines.
-- cltorch.setDevice(2)
-- cltorch.synchronize()
-- cltorch.finish() -- not sure this line is needed
-- print('Current device: ', cltorch.getDevice()) -- this prints out, but then hangs.

-- Things print out properly on the integrated GPU (device(1))
C = torch.ClTensor{{3,2,4},{9,7,5}}
print(C:t())
print(C:transpose(1,2))

The lines that are commented out are the ones that make everything freeze. Any clue what could be going on, or what I can do to get more descriptive logs? I am loading the script as require 'hello' in a th prompt, but it hangs even when running the commands individually.

hery commented 9 years ago

Here are the dedicated GPU specs:

th> props = cltorch.getDeviceProperties(2)
th> props
{
  deviceType : "GPU"
  localMemSizeKB : 32
  globalMemSizeMB : 2048
  deviceVersion : "OpenCL 1.2 "
  platformVendor : "Apple"
  deviceName : "AMD Radeon R9 M370X Compute Engine"
  maxComputeUnits : 10
  globalMemCachelineSizeKB : 0
  openClCVersion : "OpenCL C 1.2 "
  maxClockFrequency : 800
  maxMemAllocSizeMB : 512
  maxWorkGroupSize : 256
}

hughperkins commented 9 years ago

Hi hery, thank-you for the bug report. Dont suppose... do you mind running luarocks install cltorch, to update to latest cltorch, and then provide output of th -l cltorch -e 'cltorch.about()' to confirm. I'm not saying that will fix the issue, but since that's easy to do, probably good to do that first, just to check.

hughperkins commented 9 years ago

(By the way, on my own computer, the following script:

require 'cltorch'

local function eval(expression)
  loadstring('res=' .. expression)()
  print(expression, res)
end

eval('cltorch.getDevice()')

-- Switch to dedicated GPU. Everything breaks if we uncomment those lines.
 cltorch.setDevice(2)
 cltorch.synchronize()
 cltorch.finish() -- not sure this line is needed
 print('Current device: ', cltorch.getDevice()) -- this prints out, but then hangs.

-- Things print out properly on the integrated GPU (device(1))
C = torch.ClTensor{{3,2,4},{9,7,5}}
print(C:t())
print(C:transpose(1,2))

eval('cltorch.getDeviceCount()')
eval('cltorch.getDevice()')

b = torch.ClTensor({3,5,2})
eval('b')
cltorch.setDevice(1)
eval('cltorch.getDevice()')
a = torch.ClTensor({2,4,7})
eval('a')
eval('a:add(2)')

cltorch.setDevice(2)
eval('b:add(2)')

produces the following output:

$ th testhery.lua 
cltorch.getDevice() 1   
Using NVIDIA Corporation platform: NVIDIA CUDA
Using device: GeForce 940M
Current device:     2   
 3  9
 2  7
 4  5
[torch.ClTensor of size 3x2]

 3  9
 2  7
 4  5
[torch.ClTensor of size 3x2]

cltorch.getDeviceCount()    2   
cltorch.getDevice() 2   
b    3
 5
 2
[torch.ClTensor of size 3]

cltorch.getDevice() 1   
Using Intel platform: Intel Gen OCL Driver
Using device: Intel(R) HD Graphics BroadWell U-Processor GT2
a    2
 4
 7
[torch.ClTensor of size 3]

a:add(2)     4
 6
 9
[torch.ClTensor of size 3]

b:add(2)     5
 7
 4
[torch.ClTensor of size 3]

.... hence I cannot reproduce the problem on my own machine. Hence this is going to need some digging :-/ Hence, why, let's check the obvious things first :-)

hery commented 9 years ago

Hi Hugh, thanks for taking the time to look into this!

Here's the output of the about() command after updating cltorch.

pandaman$ th -l cltorch -e 'cltorch.about()'
cltorch.  OpenCL backend for Torch
Built from commit 3e6d445
More info, doc: https://github.com/hughperkins/cltorch

hughperkins commented 9 years ago

Ok, and problem is still there?

hery commented 9 years ago

Yes, still there!

hughperkins commented 9 years ago

Hmmm... thats odd... it doesnt do very much at that point, just creates a command queue, and calls clFinish() on it. Its a mystery. I think I might try seeing if you can run EasyCL unit tests on it. There are two challenges for this:

EasyCL unit tests seem not to run on Mac. I'm not sure how much knowledge you have of building shared objects on mac? You can perhaps check the end of https://github.com/hughperkins/cltorch/issues/7 , and see if you have knoweldge in this area?
currently the unittests only test the first gpu device, so I need to modify them to let you test the second one, but this is fairly easy to overcome.

Hmmm, whilst I'm writing this, one idea occurs to me: if you go into /etc/OpenCL/vendors, you should see two '.icd' files. If you rename the one for your first gpu to have a '_' suffix, then it will no longer be available to opencl, and so you dont need to change device. If you do this, do things work ok? Meaning: is the problem related to changing devices? Or is it a problem more to with something about the device itself?

hery commented 9 years ago

I don't have an /etc/OpenCL directory. (running OS X 10.10)

If running EasyCL unit tests can help, I can look into that. Or I can play with the source and see where it breaks, if that's an option.

hughperkins commented 9 years ago

Hmmm.... ok... it breaks at such an early stage, we could probably start by just running some really simple 'hello world' type program, like https://developer.apple.com/library/mac/samplecode/OpenCL_Hello_World_Example/Listings/hello_c.html , and then go from there. You'd wnat to change the clGetDeviceIds line, and the clCreateContext line, to get it to choose your second gpu.

However, if you can get hte easycl tests building/compiling, then that would rock. Actually, we could just disable all the lua stuff in easycl tests, and build it like that. I used to have an option to do that, I took it away, I might add it back...

hery commented 9 years ago

Looking into the hello world program now. I like how complicated it is to select the offline GPU.

https://developer.apple.com/library/mac/technotes/tn2335/_index.html

hughperkins commented 9 years ago

Ok. You mean, if you're using the GPU to drive your display, cannot select it for use with OpenCL?

hery commented 9 years ago

I'm not sure, that could be the issue. This article refers to the Mac Pro, but I'm on a Macbook Pro. Let me try to run the cltorch script without an external display.

hery commented 9 years ago

Great, it didn't freeze without the external monitor. This could definitely be the issue. But it didn't work either haha. Here's the output:

pandaman$ th -l hello
Current device:     1
Using Apple platform: Apple
Using device: AMD Radeon R9 M370X Compute Engine
Current device:     2
 0.0000 -2.0000
 0.0000 -2.0000
 0.0000  0.0000
[torch.ClTensor of size 3x2]

Abort trap: 6

Looks like the Tensor printed properly, but not its transpose? Also, I'm using gfxCardStatus to see which GPU is in use, and it doesn't seem to switch to the dedicated GPU when I run the script.

hery commented 9 years ago

Hmm never mind, the dedicated GPU can't seem to work. The tests run fine on the integrated GPU, but it aborts when I run them on the dedicated GPU.

th> cltorch.getDevice()
2
                                                                      [0.0001s]
th> cltorch.test()
running tests...
aftter requiring cltorch.unit_storage
Running 1 tests
|  ==> test_basic
Using Apple platform: Apple
Using device: AMD Radeon R9 M370X Compute Engine
_  ==> Done Completed 11 asserts in 1 tests with 0 errors
--------------------------------------------------------------------------------
aftter requiring cltorch.unit_tensor
Running 91 tests
|__________________________________________________________________________________________  ==> outplace_div
left
     1.1765  0.5882 -0.2941
 0.9118  0.3529  1.4412
[torch.FloatTensor of size 2x3]

right
     0.0000e+00  3.6893e+19  0.0000e+00
 3.6893e+19  3.4513e-31  0.0000e+00
[torch.FloatTensor of size 2x3]

*|_________________________________________________________________________________________  ==> test_addcmul
Abort trap: 6

Now when I run the tests after manually switching to the dedicated GPU using gfxCardStatus, it freezes. It correlates to when an external monitor is plugged, because using an external monitor forces the computer to use the dedicated GPU.

So I'm pretty certain that it never runs any code on the dedicated GPU, which is confirmed by both gfxCardStatus and the activity monitor. (see screenshot below, which shows the integrated gpu is in use)

I guess I'll start looking into the EasyCL tests as we discussed, will keep you updated.

hughperkins commented 9 years ago

Ok. Yes, it sounds like it's quite a low-level issue, ie drivers etc. Did you manage to get the helloworld.c running ok on the dedicated gpu? (edit: I mean, the c-program helloworld, rather than the lua-version?)

hery commented 9 years ago

I ran the hello world program on the integrated GPU, but I didn't get it to run on the dedicated GPU yet.

hughperkins commented 9 years ago

Ok. Until the helloworld.c program, from the mac website, runs ok on the dedicated gpu, I dont think that cltorch is going to get very far. Seems like some kind of low-level driver/configuration problem, right?

hery commented 9 years ago

I agree, and so does this article, which says the gpu drivers on OS X are broken. No luck here, can't use CUDA either since it's an AMD gpu. I'm going to set up an Arch Linux dual boot on my machine, we may have more luck there.

szagoruyko commented 9 years ago

it's a problem of AMD apparently, on an old mac with Nvidia I can run cltorch on dedicated while on integrated or on dedicated, no problem

hughperkins commented 9 years ago

@hery, ok, so, do you mind if I close this issue? Seems like it is not related to cltorch itself right?

hughperkins commented 9 years ago

@szagoruyko Good info. Thanks!

hery commented 9 years ago

@hughperkins Yea go ahead, thanks!

hughperkins / cltorch

Computer freezes when using dedicated GPU #12