torch / cunn

Other
215 stars 174 forks source link

Strange issue with DataParallelTable on 1080 Ti #484

Closed AaronJackson closed 7 years ago

AaronJackson commented 7 years ago

Hi all,

I have been using some code on a machine with four Titan X's without any issues. Recently got another machine with 1080 Ti and I just can't figure out what the problem is.

Take for example the following piece of code:

nn = require 'nn'
cunn = require 'cunn'

n = require('stackedhourglass')

pt = nn.DataParallelTable(1,false,false)
pt:add(n(), {1,2})

pt:cuda()

a = torch.FloatTensor(4,3,192,192):cuda()
o = pt:forward(a)
o = pt:forward(a)
o = pt:forward(a)
print(o)

Most of the time it works - without changing the code, it will occasionally not work. This is not a GPU memory issue. Occasionally it will hang while running updateOutput on a SpatialConvolution module without printing an error.

If I double the batch size, or double the spatial resolution, it will almost certainly fail.

I don't seen to have this issue if I do not use DataParallelTable. It seems like there might be some kind of race condition / something I don't understand. As I say, I didn't have this issue with Titan X cards, so I am beginning to wonder if there is an issue with the driver?

If anyone can offer me any advice it would be very welcome.

Thanks, Aaron.

AaronJackson commented 7 years ago

We had a ulimit in the global bashrc file. Not sure why it was affecting it but it seems to be working now.