mila-iqia / fuel

A data pipeline framework for machine learning
MIT License
868 stars 268 forks source link

Introduce `source_shapes` for H5PYDataset #392

Open rubenvereecken opened 7 years ago

rubenvereecken commented 7 years ago

This pull request is meant to initiate discussion and is by no means finished.

I needed to get the dimensions of my data before reading any data from my HDF5 files. There is the num_examples attribute but of course that's only limited to one dimension. I could not find any straightforward way to get all dimensions.. except H5PYDataset.source_shapes seemed to represent what I wanted. But it wasn't really implemented. So it might very well be that I missed a way of getting my sources' dimensions but in the meantime I've implemented the source_shapes attribute to accomplish what I need, albeit in not all possible scenarios.

If you agree that this source_shapes attribute is useful, I could look into how to complete the feature because currently it only works if the user has provided no custom slices (which suits my use case just fine for now).

Little edit to explain why I want these dimensions. I want to specify an input layer's shape in a neural network by looking at the spec that is already present in the HDF5 datasets.

dmitriy-serdyuk commented 7 years ago

This is a useful feature. Do you plan to get one batch from the dataset to compute its shape?

rubenvereecken commented 7 years ago

No, as far as I know h5py datasets have the shape attribute which should be just fine. I've never used dimension scales though, nor do I know about variable-length datasets. Either way, I think variable length only goes in the first dimension? Whereas if you'd get a batch from the dataset you don't know anything about the first dimension anyway.