twitter-archive / torch-dataset

An extensible and high performance method of reading, sampling and processing data for Torch
Apache License 2.0
76 stars 24 forks source link

How to use SlowFS? #30

Closed chienlinhuang1116 closed 8 years ago

chienlinhuang1116 commented 8 years ago

Hi,

I would like to use SlowFS but dont know how to use it. Originally, we got data by using following scripts. We indicated a data file and read it to the memory.

local dataset = Dataset('/test/hdfs/files/part-r-00000.t7', {partition = 1, partitions = 1})
local getBatch, numBatches = dataset.sampledBatcher({
  samplerKind = 'linear',
  batchSize = 1,
  inputDims = {1},
  processor = function(res, opt, input)
      print(res)
      return true
  end,
})

I try to make files can be loaded dynamically as the training/sampler progresses through them. The scripts changed as followings to match our purpose.

local dataset = Dataset('viewfs:///test/hdfs/files', {partition = 1, partitions = 1})
local getBatch, numBatches = dataset.sampledBatcher({
  samplerKind = 'part-linear',
  batchSize = 1,
  inputDims = {1},
  processor = function(res, opt, input)
      print(res)
      return true
  end,
})

The problem is 'res' should be a 'torch.*Tensor' but it is a 'string' now. Do you have any idea?

Thank you.

chienlinhuang1116 commented 8 years ago

This related to

Can we open files by "FloatTensor" instead of "io.popen" when testing "test_StreamedDataset.lua" ? #33

Thank you.

zakattacktwitter commented 8 years ago

Hi,

Can you describe what you are trying to do? There's no default implementation of a SlowFS interface in the github code. Internally at Twitter we have an implementation that uses Hadoop to fetch files.

It doesn't make sense to hook it up to a tensor. Can you elaborate on what you are trying to do?

Thanks, Zak

On Tuesday, May 17, 2016, Chien-Lin Huang 黃建霖 notifications@github.com wrote:

This related to

Can we open files by "FloatTensor" instead of "io.popen" when testing "test_StreamedDataset.lua" ? #33 https://github.com/twitter/torch-dataset/issues/33

Thank you.

— You are receiving this because you are subscribed to this thread. Reply to this email directly or view it on GitHub https://github.com/twitter/torch-dataset/issues/30#issuecomment-219853350

zakattacktwitter commented 8 years ago

Hi,

SlowFS is meant to be used for a slow and remote file system, like Hadoop or S3. If you have data that’s already on your machines disk then you do not want to use SlowFS.

I think what you want is more a long the lines of the IndexDirectory setup. You have a bunch of files on your local disk and just want to construct an Index/Dataset based on all the files that are present.

The big difference is that IndexDirectory does not currently support files that are chunks of multiple items. This is something that would be needed to be added.

A simple fix would be to store all your data in a directory, where each file represents one tensor that you would like to train on. With that directory you can simply construct a Dataset and use it.

Hope this helps.

On Wed, May 18, 2016 at 10:35 AM, Chien-Lin Huang 黃建霖 < notifications@github.com> wrote:

Reopened #30 https://github.com/twitter/torch-dataset/issues/30.

— You are receiving this because you commented. Reply to this email directly or view it on GitHub https://github.com/twitter/torch-dataset/issues/30#event-664741576

chienlinhuang1116 commented 8 years ago

Hi,

I have about one thousand hour speech data. After feature extraction, I have four hundred million vectors and the dimension is about 500 for each floating vector. I would like to use these vectors to train NNs. In examples of mnist.lua and cifar10.lua, I can indicate a small data file and process it. Because now the data is huge, I need to make several thousand chunking files and them can be loaded dynamically as the training/sampler progresses through them, and get FloatTensor values.

SlowFS should be a solution but do not know how to use it. There is an example at 'test/test_StreamedDataset.lua' but it returns the binary content instead of FloatTensor values. By modifying 'test/test_StreamedDataset.lua', 'IndexSlowFS.lua' and 'Reader.lua', now thousand chunking files can be loaded dynamically and return FloatTensor values.

Getters.lua

local function getters(url, indexType)
   if url and string.sub(url, 1, 4) == 'http' then
      return getHTTP
   else
      return getTensor
   end
end 

IndexSlowFS.lua

local function makeFileIndex(fileName, url, opt, idx, SlowFS)
      local Cache = require 'dataset.Cache'
      local cache = Cache(opt)
      local slowFS = SlowFS(cache, opt)
      local fileURL = url .. '/' .. fileName
      local fpath = slowFS.get(fileURL)
      local offsets = {0}
      return {
         url = url,
         fileName = fileName,
         filePath = fpath,
         itemCount = 1,
         offsets = offsets,
         idx = idx,
      }
End

Reader.lua

if item.offset ~= nil then
    res[i] = torch.load(item.url)
else
    res[i] = resi
end

Although it works now, this seems not a formal solution. Do you have any idea about these changes?

I got 4 times faster than a single GPU when using 6 GPUs on testing the small dataset. However, it became much slower when using SlowFS on testing the big dataset. Is it correct?

Thank you, Chien-Lin

chienlinhuang1116 commented 8 years ago

Thank you Zak :D

It works well using IndexDirectory.lua . However, I need to modify Reader.lua in line:77, res[i] = torch.load(item.url) to make it works. The reason why did I modify line:77 is that the return value is the binary content instead of Torch FloatTensor using IndexDirectory.lua.

How can I have the return value like IndexTensor.lua when using IndexDirectory.lua?

Thank you, Chien-Lin

chienlinhuang1116 commented 8 years ago

Thank you, the discussion and answer can be found at #34