[FR] Add support for loading data in HDF5 format

voxel51 / fiftyone

Refine high-quality datasets and visual AI models

https://fiftyone.ai

Apache License 2.0

8.84k stars 557 forks source link

[FR] Add support for loading data in HDF5 format #526

Open brimoor opened 4 years ago

brimoor commented 4 years ago

This was a request from a recent webinar.

One option to supporting this would be to ingest the data from HDF5 format and write images/other media to an internal FiftyOne directory in a standard individual file format at dataset creation time. This is analogous to how we support loading data in TFRecords format, for example.

There is an h5py package on github for working with HDF5 formatted data.

oguz-hanoglu commented 2 years ago

Any progress?

brimoor commented 2 years ago

We haven't had time to add "native" HDF5 support yet.

FiftyOne currently requires access to each individual image via the filepath field of each sample, which must be an image format that web browser's can display (png,jpg,tiff` -- possibly with a browser extension installed, etc.)

The way to work with HDF5 data currently would be to unpack it using h5py into regular images on disk so you can construct a FiftyOne dataset.

It would be awesome to have a custom importer contributed that would automate this unpacking, similar to how TF records can be imported, for example 🤗

oguz-hanoglu commented 2 years ago

Using the library you mentioned, unpacking an hdf5 is like:

HDF5_FILE = "data.h5"
with h5py.File(HDF5_FILE, 'r') as f:
    for img in f["images"]:
        cv2.imwrite("filename.png", img)

So, would it be useful if we simply

implement a foud.UnlabeledImageDatasetImporter?
take hdf5 file path and key("images" in the example) as input?
setup method includes a code piece similar to the one above?

The rest would be very similar to unlabeled version of this.

zero0kiriyu commented 1 year ago

Hello, I also encountered the same problem. Is there any progress? I have a large amount of image data, but read them from disk is very slow. It would be great if we can read image from formats such as HDF5 or LMDB.