huggingface / datasets

🤗 The largest hub of ready-to-use datasets for ML models with fast, easy-to-use and efficient data manipulation tools
https://huggingface.co/docs/datasets
Apache License 2.0
19.31k stars 2.7k forks source link

Attempting to return a rank 3 grayscale image from dataset.map results in extreme slowdown #7134

Open navidmafi opened 2 months ago

navidmafi commented 2 months ago

Describe the bug

Background: Digital images are often represented as a (Height, Width, Channel) tensor. This is the same for huggingface datasets that contain images. These images are loaded in Pillow containers which offer, for example, the .convert method.

I can convert an image from a (H,W,3) shape to a grayscale (H,W) image and I have no problems with this. But when attempting to return a (H,W,1) shaped matrix from a map function, it never completes and sometimes even results in an OOM from the OS.

I've used various methods to expand a (H,W) shaped array to a (H,W,1) array. But they all resulted in extremely long map operations consuming a lot of CPU and RAM.

Steps to reproduce the bug

Below is a minimal example using two methods to get the desired output. Both of which don't work

import tensorflow as tf
import datasets
import numpy as np

ds = datasets.load_dataset("project-sloth/captcha-images")
to_gray_pillow = lambda sample: {'image': np.expand_dims(sample['image'].convert("L"), axis=-1)}
ds_gray = ds.map(to_gray_pillow)

# Alternatively
ds = datasets.load_dataset("project-sloth/captcha-images").with_format("tensorflow")
to_gray_tf = lambda sample: {'image': tf.expand_dims(tf.image.rgb_to_grayscale(sample['image']), axis=-1)}
ds_gray = ds.map(to_gray_tf)

Expected behavior

I expect the map operation to complete and return a new dataset containing grayscale images in a (H,W,1) shape.

Environment info

datasets 2.21.0 python tested with both 3.11 and 3.12 host os : linux