Open Rohan138 opened 1 year ago
Can you try again on master?
Sorry, should have clarified: This is on current master. Also ds2
in the example is doing the same thing as master.
Actually, found a better solution:
ds2 = ray.data.from_arrow(hf_pa_ds).map(hf_pa_ds.features.decode_example)
Lazily applies the feature transforms used by huggingface to the Ray dataset.
What happened + What you expected to happen
cc: @amogkam
HuggingFace
datasets
seems to do postprocessing on top of their datasets which does not get copied over when we create a Ray dataset usingray.data.from_huggingface
. This is because their postprocessing isn't a part of the underlying pyarrow table, but the datasetfeatures
.This results in unexpected behavior when trying to migrate huggingface batch inference code to Ray, because we instantiate the Ray Dataset using the underlying
pa.Table
object.A possible hacky workaround (See example) is to instead convert the HF dataset to e.g.
pandas
ornumpy
first, then to aRay.Dataset
; currently we only support in-memory HF datasets anyway. Otherwise we could maybe calldataset.features.batch_decode
ourselves insideray.data.from_huggingface
or something.This issue likely applies to most image and audio datasets on HF.
Example 1: Audio
Example 2: Image
Versions / Dependencies
Ray 2.5.0
Reproduction script
See above
Issue Severity
Medium: It is a significant difficulty but I can work around it.