huggingface / optimum-habana

Easy and lightning fast training of 🤗 Transformers on Habana Gaudi processor (HPU)
Apache License 2.0
148 stars 187 forks source link

When loading datasets by HuggingFace datasets.load_dataset like cifar10, could it be possible to return the dataset without decoding automatically. #236

Closed jychen21 closed 1 year ago

jychen21 commented 1 year ago

Feature request

When loading datasets by HuggingFace datasets.load_dataset like cifar10, could it be possible to return the dataset without decoding automatically?

Motivation

According to https://github.com/huggingface/optimum-habana/pull/189, The scale efficiency is about 72.6% for Gaudi2 and 79.4% for Gaudi, we found that the efficiency of Gaudi2 is low because of the data loader, so we intend to implement a data loader (especially for Gaudi2) based on Habana Media Pipeline to do the Decoding, RandomResizedCrop, RandomHorizontalFlip, and Normalize. image

As described in cifar10 dataset data-fields, when accessing the image column: dataset[0]["image"] the image file is automatically decoded this will be executed on CPU. can it just be like the dataset with a root path to the image files and let our self defined dataloader to do the decoding on HPU?

Your contribution

Implement a Habana media-based data loader

regisss commented 1 year ago

Hi @jychen-habana!

If you do:

import datasets

ds = datasets.load_dataset("cifar10")
ds = ds.cast_column("img", datasets.Image(decode=False))

then images won't be automatically decoded. For instance, ds["train"][0]["img"] looks like:

{'bytes': b'...', 'path': None}

path and bytes can be None, but never both at the same time.

Does it help?

jychen21 commented 1 year ago

Hi @jychen-habana!

If you do:

import datasets

ds = datasets.load_dataset("cifar10")
ds = ds.cast_column("img", datasets.Image(decode=False))

then images won't be automatically decoded. For instance, ds["train"][0]["img"] looks like:

{'bytes': b'...', 'path': None}

path and bytes can be None, but never both at the same time.

Does it help?

Thanks! I will give it a try.

jychen21 commented 1 year ago

close since datasets could set decode to False