Preprocessing script, and Checkpoint DataSource

nicholas-leonard commented 9 years ago

Quoting a recent discussion concerning pylearn2, Pascal Lamblin (@lamblin) offered some nice solutions to a problem both our libraries are having:

Considering Datasets immutable, and only allow read access to the data through an iterator. Current-style preprocessing could be done either by a different script beforehand, or by a function that returns a different Dataset object. That would help making the experiments checkpoint and restart.

have an explicit pipeline of on-the-fly processing on minibatches between the Dataset and the Model. These transformations would not be part of the Theano graph, but happen on the numeric data. These could be iterators not unlike TransformerIterator, but would not be limited to batch-by-batch processing, and could do things like caching, data augmentation, data reordering.

While these solutions are offered for pylearn2, they also concern dp. The Preprocess objects currently modify the DataSets inplace. Currently, preprocessing has to be done each time you run an experiment. But you could easily do it once, and reuse that Checkpoint for your experiments. All you would need is a script to create the checkpoint and a means of referring to the resulting files from you experiment.

[ ] Change Preprocesses so that they work with Batches and provide an output (no inplace).
[ ] preprocess.lua : a script to apply Preprocess to DataSources and save the resulting data to disk
[x] Checkpoint : a generic DataSource that works with the output of the preprocess.lua scripts.
[ ] common data format : hdf5 + view spec. Or th7 + view spec (for now). View spec in JSON format.

lamblin commented 9 years ago

Such a solution (credit mostly goes to @dwf, actually) has also been implemented recently in Blocks by @bartvm, if you want to have a look.

bartvm commented 9 years ago

We call the on-the-fly preprocessors data streams, while the datasets themselves are immutable. For a lengthier discussion on how we do checkpointing you can look here.

nicholas-leonard commented 9 years ago

@bartvm very nice package this Blocks. I am definitely using it as a reference point. Love the doc. Thanks.

nicholas-leonard commented 9 years ago

Found an intermediate/quickfix solution (for checkpoints) : https://github.com/nicholas-leonard/dp/commit/bbeeeab4f7ef15a931cbcc94a3778e839071a6d8

datasource = torch.checkpoint(checkpointPath, function()
    return dp.Mnist{input_preprocess=input_preprocess}
end)

nicholas-leonard / dp

Preprocessing script, and Checkpoint DataSource #116