huggingface / datasets

🤗 The largest hub of ready-to-use datasets for ML models with fast, easy-to-use and efficient data manipulation tools
https://huggingface.co/docs/datasets
Apache License 2.0
19.01k stars 2.63k forks source link

Loading Data from HDF files #3113

Open FeryET opened 2 years ago

FeryET commented 2 years ago

Is your feature request related to a problem? Please describe. More often than not I come along big HDF datasets, and currently there is no straight forward way to feed them to a dataset.

Describe the solution you'd like I would love to see a from_h5 method that gets an interface implemented by the user on how items are extracted from dataset (in case of multiple datasets containing elements like arrays and metadata and etc).

Describe alternatives you've considered Currently I manually load hdf files using h5py and implement PyTorch dataset interface. For small h5 files I load them into a pandas dataframe and use from_pandas function in the datasets package to load them, but for big datasets this is not feasible.

Additional context HDF files are widespread throughout different domains and are one of the go to's for many researchers/scientists/engineers who work with numerical data. Given datasets' usecases have outgrown NLP use cases, it will make a lot of sense focusing on things like supporting HDF files.

DiGyt commented 2 years ago

I'm currently working on bringing Ecoset to huggingface datasets and I would second this request...

jacobbieker commented 2 years ago

I would also like this support or something similar. Geospatial datasets come in netcdf which is derived from hdf5, or zarr. I've gotten zarr stores to work with datasets and streaming, but it takes awhile to convert the data to zarr if it's not stored in that natively.

VijayKalmath commented 2 years ago

@mariosasko , I would like to contribute on this "good second issue" . Is there anything in the works for this Issue or can I go ahead ?

mariosasko commented 2 years ago

Hi @VijayKalmath! As far as I know, nobody is working on it, so feel free to take over. Also, before you start, I suggest you comment #self-assign on this issue to assign it to yourself.

VijayKalmath commented 2 years ago

self-assign

zutarich commented 11 months ago

Hey @mariosasko can you assign this issue to me !!

shermansiu commented 8 months ago

So basically, we just need to load HDF5 files to Parquet?

e.g. Like this? https://stackoverflow.com/questions/46157709/converting-hdf5-to-parquet-without-loading-into-memory