twitter-archive / torch-dataset

An extensible and high performance method of reading, sampling and processing data for Torch
Apache License 2.0
76 stars 24 forks source link

Got stuck in getBatch with larger batch size #21

Open joeyhng opened 8 years ago

joeyhng commented 8 years ago

The following code reproduce the error:

local Dataset = require 'dataset.Dataset'

local opt = lapp[[
Got stuck in torch-dataset with batchSize == 128

(options)
   --batchSize     (default 128)    how many images in a mini-batch?
]]

-- create tmp csv file containing lots of rows
local tmpcsv = paths.tmpname() .. '.csv'
f = io.open(tmpcsv, 'w')
f:write('filename\n')
for i=1,300 do
  f:write(paths.tmpname() .. '\n')
end
f:close()

dataset = Dataset(tmpcsv)

getBatch, numBatches, reset = dataset.sampledBatcher({
  batchSize = opt.batchSize,
  inputDims = {10, 256},
  verbose = true,
  poolSize = 4,
  get = function(x)
    return torch.FloatTensor(10,256)
  end,
  processor = function(res, processorOpt, input) 
    return true
  end,
})

print('before getBatch')
local batch = getBatch()
print('finish getBatch')

Strangely the program works when batchSize is 64, but got stuck in getBatch() when batchSize is 128.

I came across this problem for several different problems which use custom get function and load data with other non-default method like image.load, where batchSize 64 works but not 128.

Any idea is appreciated. Thanks!

zakattacktwitter commented 8 years ago

Hi,

I am not sure what you are trying to accomplish with this sample code? Can you provide a high level explanation of what you want to use Dataset for?

Thanks, Zak

joeyhng commented 8 years ago

In my actual application, I'm usually trying to do something like this:

getBatch, numBatches, reset = dataset.sampledBatcher({
  batchSize = opt.batchSize,
  inputDims = {10, 256},
  verbose = true,
  poolSize = 4,
  get = function(x)
    return torch.load(x) -- or some other loading function like image.load / npy4th.load
  end,
  processor = function(res, processorOpt, input) 
    local x = augment(res) -- some data augmentation function
    input:copy(x)
    return true
  end,
})

which I use a custom get function to load the data, and do some data augmentation in processor.

This issue happens to me in different similar scenario where larger batch size got stuck. Thanks for your help.

zakattacktwitter commented 8 years ago

Try not setting the poolSize option, that's a tricky one to set.

joeyhng commented 8 years ago

Yes, I find that not setting poolSize removes this error, but sometimes when I run for longer time the process got killed (just printed "Killed" in stderr), and I haven't figured out why yet. I suspect it is because of creating too many threads.

Should the poolSize limited by the number of cores on the machine? Are there any guideline for how to set it?

zakattacktwitter commented 8 years ago

Its not really meant for users to set. I should probably remove.

The threads are created once at the start and no more are created after it. So it doesn't make sense that it crashed due to too many threads.

They way you are using Dataset, putting torch.load in custom get function, will create a ton of garbage and definitely won't be speedy.

How is your data laid out? Is it a whole bunch of little files on disk? If you describe your data I can help you use Dataset to sample it efficiently.

joeyhng commented 8 years ago

I'm processing video data, which are saved in a hard drive mounted in the system. Usually I save them in two formats:

  1. Extracted frame level features, usually in npy or t7 format. Each file would contain the extracted features of a specific video, which is a T x D tensor.
  2. Video frames in image format. Each video will have a directory, each containing a number of .jpg files representing the frames. I usually sample and load a few consecutive frames from the directory in the get or processor function.

Thanks a lot for your help!

zakattacktwitter commented 8 years ago

Hi,

You can now adjust poolSize as much as you want.

The deadlock has been fixed in the IPC ( https://github.com/twitter/torch-ipc ) package. Just get the latest version of it and you should be good to go.

Thanks, Zak