[Data] `ray.data.from_huggingface` does not work as expected

Rohan138 commented 1 year ago

What happened + What you expected to happen

cc: @amogkam

HuggingFace datasets seems to do postprocessing on top of their datasets which does not get copied over when we create a Ray dataset using ray.data.from_huggingface. This is because their postprocessing isn't a part of the underlying pyarrow table, but the dataset features.

This results in unexpected behavior when trying to migrate huggingface batch inference code to Ray, because we instantiate the Ray Dataset using the underlying pa.Table object.

A possible hacky workaround (See example) is to instead convert the HF dataset to e.g. pandas or numpy first, then to a Ray.Dataset; currently we only support in-memory HF datasets anyway. Otherwise we could maybe call dataset.features.batch_decode ourselves inside ray.data.from_huggingface or something.

This issue likely applies to most image and audio datasets on HF.

Example 1: Audio

from datasets import load_dataset

dataset = load_dataset("PolyAI/minds14", "en-US", split="train[:10]")

print(dataset)
# Dataset({
#     features: ['path', 'audio', 'transcription', 'english_transcription', 'intent_class', 'lang_id'],
#     num_rows: 10
# })
print(dataset['audio'][0])
# {
# 'path': '/mnt/shared_storage/rohan/huggingface/datasets/downloads/extracted/efdc32f0cf0171c560b244bfa7be6c76a7d7e26d8f0434d9122b20d881a479ff/en-US~JOINT_ACCOUNT/602ba55abb1e6d0fbce92065.wav',
# 'array': array([ 0.        ,  0.00024414, -0.00024414, ..., -0.00024414,        0.        ,  0.        ])
# 'sampling_rate': 8000
# }
#
# Users expect dataset['audio'] to be a dict with keys [`path`, `array`, `sampling_rate`]
print(dataset.data.schema)
# path: string
# audio: struct<bytes: binary, path: string>
#   child 0, bytes: binary
#   child 1, path: string
# transcription: string
# english_transcription: string
# intent_class: int64
# lang_id: int64
# -- schema metadata --
# huggingface: '{"info": {"features": {"path": {"dtype": "string", "_type":' + 615
#
# Underlying table has dict with keys `bytes`, `path`
hf_pa_ds = dataset.with_format("arrow")
print(hf_pa_ds["audio"][0])
# <pyarrow.StructScalar: [('bytes', None), ('path', '/mnt/shared_storage/rohan/huggingface/datasets/downloads/extracted/efdc32f0cf0171c560b244bfa7be6c76a7d7e26d8f0434d9122b20d881a479ff/en-US~JOINT_ACCOUNT/602ba55abb1e6d0fbce92065.wav')]>
hf_df = dataset.with_format("pandas")
print(hf_df["audio"][0])
# {'path': '/mnt/shared_storage/rohan/huggingface/datasets/downloads/extracted/efdc32f0cf0171c560b244bfa7be6c76a7d7e26d8f0434d9122b20d881a479ff/en-US~JOINT_ACCOUNT/602ba55abb1e6d0fbce92065.wav', 'array': array([ 0.        ,  0.00024414, -0.00024414, ..., -0.00024414,
#         0.        ,  0.        ]), 'sampling_rate': 8000}

# HuggingFace implements most of these conversions using `dataset.features.batch_decode`,
# which is called every time you acces an HF dataset, and converts the underlying pyarrow
# row to the expected output format.

# Possible Workaround: Convert the HF dataset to pandas, then convert it to a Ray Dataset
# Currently we only support in-memory HF dataset,
# not memory-mapped or streaming, so this would work for now
import ray.data
ds1 = ray.data.from_huggingface(dataset)
ds2 = ray.data.from_arrow(hf_pa_ds)
ds3 = ray.data.from_pandas(hf_df)

print(ds1.take(limit=1)[0]['audio'])
# {'bytes': None, 'path': '/mnt/shared_storage/rohan/huggingface/datasets/downloads/extracted/efdc32f0cf0171c560b244bfa7be6c76a7d7e26d8f0434d9122b20d881a479ff/en-US~JOINT_ACCOUNT/602ba55abb1e6d0fbce92065.wav'}
print(ds2.take(limit=1)[0]['audio'])
# {'bytes': None, 'path': '/mnt/shared_storage/rohan/huggingface/datasets/downloads/extracted/efdc32f0cf0171c560b244bfa7be6c76a7d7e26d8f0434d9122b20d881a479ff/en-US~JOINT_ACCOUNT/602ba55abb1e6d0fbce92065.wav'}
print(ds3.take(limit=1)[0]['audio'])
# {'path': '/mnt/shared_storage/rohan/huggingface/datasets/downloads/extracted/efdc32f0cf0171c560b244bfa7be6c76a7d7e26d8f0434d9122b20d881a479ff/en-US~JOINT_ACCOUNT/602ba55abb1e6d0fbce92065.wav', 'array': array([ 0.        ,  0.00024414, -0.00024414, ..., -0.00024414,
#        0.        ,  0.        ]), 'sampling_rate': 8000}

Example 2: Image

from datasets import load_dataset

dataset = load_dataset("frgfm/imagenette", '160px', split="validation")

print(dataset)
# Dataset({
#     features: ['image', 'label'],
#     num_rows: 3925
# })
print(dataset['image'][0])
# <PIL.JpegImagePlugin.JpegImageFile image mode=RGB size=164x160 at 0x7F4AE766E400>
print(dataset.data.schema)
# image: struct<bytes: binary, path: string>
#   child 0, bytes: binary
#   child 1, path: string
# label: int64
# -- schema metadata --
# huggingface: '{"info": {"features": {"image": {"_type": "Image"}, "label"' + 180
#
hf_pa_ds = dataset.with_format("arrow")
print(hf_pa_ds["image"][0])
hf_df = dataset.with_format("pandas")
print(hf_df["image"][0])

import ray.data
ds1 = ray.data.from_huggingface(dataset)
ds2 = ray.data.from_arrow(hf_pa_ds)
ds3 = ray.data.from_pandas(hf_df)

print(ds1.take(limit=1)[0]['image'])
print(ds2.take(limit=1)[0]['image'])
print(ds3.take(limit=1)[0]['image'])

Versions / Dependencies

Ray 2.5.0

Reproduction script

See above

Issue Severity

Medium: It is a significant difficulty but I can work around it.

amogkam commented 1 year ago

Can you try again on master?

Rohan138 commented 1 year ago

Sorry, should have clarified: This is on current master. Also ds2 in the example is doing the same thing as master.

Rohan138 commented 1 year ago

Actually, found a better solution:

ds2 = ray.data.from_arrow(hf_pa_ds).map(hf_pa_ds.features.decode_example)

Lazily applies the feature transforms used by huggingface to the Ray dataset.

ray-project / ray