mila-iqia / fuel

A data pipeline framework for machine learning
MIT License
867 stars 268 forks source link

Save transformed dataset to disk #230

Open fvisin opened 9 years ago

fvisin commented 9 years ago

I know it was discussed before but I wasn't able to find an open issue on this.

It would be useful to have a module that takes one or more datastreams (I hope that's the correct name, I mean whatever comes out from an iterator or a transformer) as an input and writes them in a dataset file. This would allow to avoid recomputing the same preprocessing for every experiment in cases when this is very time-consuming or, more generally, when computation is more expensive than storage.

vdumoulin commented 9 years ago

Agreed! A quick fix for datasets which fit into memory might look like this:

from six.moves import cPickle
from fuel.datasets import IndexableDataset

def preprocess_dataset(dataset, transformers, save_path):
    """Preprocess a dataset and save it to disk.

    Parameters
    ----------
    dataset : :class:`fuel.datasets.Dataset`
        Dataset to preprocess.
    transformers : :class:`tuple`
        Sequence of ``(transformer_class, args, kwargs)`` describing
        the preprocessing pipeline.
    save_path : str
        Where to save the preprocessed dataset.
    """
    preprocessing_stream = dataset.apply_default_transformers(transformers)
    preprocessed_data = # Concatenate stream output and wrap into a dict
    with open(save_path, 'w') as f:
        cPickle.dump(preprocessed_data, f)

class PreprocessedDataset(IndexableDataset):
    """Interfaces with datasets preprocessed by :func:`preprocess_dataset`.

    Parameters
    ----------
    path : str
        Path to the preprocessed dataset.
    """
    def __init__(self, path, **kwargs):
        with open(path, 'r') as f:
            indexables = cPickle.load(f)
        super(PreprocessedDataset, self).__init__(indexables, **kwargs)

For large datasets which are stored as an HDF5 file, things might get a little hairier: we'd need to prepare the HDF5 file and sequentially populate it. However, it's entirely feasible as well.

dwf commented 9 years ago

I feel like we want to be careful about doing too much magic on the save-to-HDF5 front. Automatically inferring the right way to store the preprocessed data is probably hard, and it's fair to expect users to put some work in to benefit from this.

fvisin commented 9 years ago

[Disclaimer: I don't know Fuel very well and I am very aware that dealing with the data is always a pain! I am "brainstorming" to help the discussion, but possibly some or all my ideas will not make sense or will not be feasible for technical reasons I cannot predict]

@vdumoulin Thanks for quickly providing a draft solution! I think the best would be to save the data on disk such that the usual Dataset objects can be used. Is there any reason why you proposed a PreprocessedDataset class instead? Was it to avoid the difficulties that might arise creating an hdf5 file? (it is the standard format for every dataset in Fuel, right?). @dwf I agree on your fear that automating the process too much could be difficult to achieve and most importantly to maintain. Let's keep this in mind. That said, in most cases the to-be-saved data will maintain many of the properties of the original data. It should be possible to infer how to save the transformed data by looking at how the original data was saved on disk and which Transformers have been applied. The Transformer itself could probably provide all the required information (output sources, labels, shapes, ...). Does it make any sense?

dwf commented 9 years ago

In theory it would be nice if the chain of Transformers could propagate forth enough information to make this practical, and the only thing you'd need to know from the original dataset would be the total number of entries to expect. In practice, we don't even propagate axis labels in all cases AFAIK.

Off-hand I think the sort of thing we'd like to know is

Then there's all the bookkeeping related to subsets and splits that I haven't totally followed as of late.

rizar commented 9 years ago

Transformation of a dataset into is going to be very difficult I think, with the list of things to be handled drafted above by @dwf. On the other hand, we could do just slightly better then the code by @vdumoulin does by saving not a single pickle, but a file containing a pickle for every example. This way large data streams can be saved and then traversed in exactly same order.

I think this would be quite a desirable feature.

dwf commented 9 years ago

This might be interesting (although I'd suggest that one might want to serialize batches from a batch stream as well, not just examples).

Better yet, you can save numpy arrays when that's going to be more efficient, and pickles when necessary. numpy.load() will load either.

The dumbest possible thing (that is fast to read but not very fault tolerant) is to just concatenate multiple pickled objects/serialized arrays on disk. Then, you can call numpy.load repeatedly on an open file handle to step through the different serialized objects (this works, I have tried it).