twitter-archive / torch-dataset

An extensible and high performance method of reading, sampling and processing data for Torch
Apache License 2.0
76 stars 24 forks source link

Best way to load dataset with large batchsize #27

Closed mkvb01 closed 8 years ago

mkvb01 commented 8 years ago

Hi,

We would like to use the torch-dataset/distlearn packages for audio data and speech recognition application. Our entire data set would be split up into parts (chunks) to avoid out-of-memory situations. For instance, each part would be stored in Tensor format and may look like this:

-- Define (global) batch size and number of nodes (GPUs) opt.batchSize = 8192 opt.numNodes = 8

-- Adapt batch size, per node: opt.batchSize = math.ceil(opt.batchSize / opt.numNodes) print('Batch size: per node = ' .. opt.batchSize .. ', total = ' .. (opt.batchSize*opt.numNodes))

-- Load the dataset local trainingDataset = Dataset('/home/test/torch-dataset-master/train.t7', { -- Partition dataset so each node sees a subset: partition = opt.nodeIndex, partitions = opt.numNodes, })

local getTrainingBatch, numTrainingBatches = trainingDataset.sampledBatcher({ samplerKind = 'linear', batchSize = opt.batchSize, -- 1024 inputDims = { 600 },
verbose = true, cuda = true, processor = function(res, processorOpt, input) input:copy(res) return true end, })

This results in a large number of worker (CPU) threads being created and ultimately the error is thrown:

.../torch/install/share/lua/5.1/dataset/Reader.lua:52: ERROR: (/home/.../torch-ipc-master/src/map.c, 107): (11, Resource temporarily unavailable)

We believe the problem is that because of the large number of threads the ulimit for number of processes is reached (2048 on our system):

Dataset.lua, line 72: poolSize = opt.poolSize or (numBuffers * opt.batchSize), Reader.lua, line 43: local numWorkers = opt.poolSize or 64

With numBuffers being 2 this would set numWorkers to 2 * 1024 = 2048 per node. With 8 nodes this gives 16384 threads and we had to adjust our ulimit accordingly to make the above error "Resource temporarily unavailable" disappear. We tried to reduce the number of threads by setting the poolSize parameter directly in sampledBatcher() function but then the getBatch() function hangs and does not return.

Could you please let me know what would be the best/recommended way to load our dataset with torch-dataset package? We are not sure if it is necessary to use such a high number of worker threads. Also, any pointers how we can reorganize our data set to better fit the dataset package requirements are welcome.

Thank you!

zakattacktwitter commented 8 years ago

Hi,

Dataset uses a ton of threads under the hood to hide IO when pulling assets in over the network. If your data is already local to your machine then far less threads are needed. Further, on OSX, Apple has artificially limited the number of open file handles (so they can sell OSX server with a unlimited ulimit) which causes Lua's require system to die.

Can you try setting poolSize to something reasonable, like 256? That should be all you need.

mkvb01 commented 8 years ago

Thanks for the quick reply. I agree, setting a poolSize of 256 should be sufficient.

However, we already tried setting the poolSize to a lower setting, e.g.,:

local getTrainingBatch, numTrainingBatches = trainingDataset.sampledBatcher({ samplerKind = 'linear', batchSize = opt.batchSize, -- 1024 inputDims = { 600 }, poolSize = 256, verbose = true, cuda = true, processor = function(res, processorOpt, input) input:copy(res) return true end, })

but then the getTrainingBatch() function hangs (status of lujit processes in top is sleep with 0.3% CPU usage) and never returns. Perhaps there is a synchronization issue or we have configured something incorrectly ...

Any idea how to resolve/debug this problem?

zakattacktwitter commented 8 years ago

Hi,

The deadlock has been fixed in the IPC ( https://github.com/twitter/torch-ipc ) package. Just get the latest version of it and you should be good to go.

Thanks, Zak

mkvb01 commented 8 years ago

I can confirm that with the updated IPC package we no longer see the problem.
This issue can be closed.

Thanks Zak!