Open FeryET opened 2 years ago
I'm currently working on bringing Ecoset to huggingface datasets and I would second this request...
I would also like this support or something similar. Geospatial datasets come in netcdf which is derived from hdf5, or zarr. I've gotten zarr stores to work with datasets and streaming, but it takes awhile to convert the data to zarr if it's not stored in that natively.
@mariosasko , I would like to contribute on this "good second issue" . Is there anything in the works for this Issue or can I go ahead ?
Hi @VijayKalmath! As far as I know, nobody is working on it, so feel free to take over. Also, before you start, I suggest you comment #self-assign
on this issue to assign it to yourself.
Hey @mariosasko can you assign this issue to me !!
So basically, we just need to load HDF5 files to Parquet?
e.g. Like this? https://stackoverflow.com/questions/46157709/converting-hdf5-to-parquet-without-loading-into-memory
Is your feature request related to a problem? Please describe. More often than not I come along big HDF datasets, and currently there is no straight forward way to feed them to a dataset.
Describe the solution you'd like I would love to see a
from_h5
method that gets an interface implemented by the user on how items are extracted from dataset (in case of multiple datasets containing elements like arrays and metadata and etc).Describe alternatives you've considered Currently I manually load hdf files using
h5py
and implement PyTorch dataset interface. For small h5 files I load them into a pandas dataframe and usefrom_pandas
function in thedatasets
package to load them, but for big datasets this is not feasible.Additional context HDF files are widespread throughout different domains and are one of the go to's for many researchers/scientists/engineers who work with numerical data. Given
datasets
' usecases have outgrown NLP use cases, it will make a lot of sense focusing on things like supporting HDF files.