ArrowInvalid: Could not convert <PIL.Image.Image image mode=RGB when adding image to Dataset

NielsRogge commented 2 years ago

Describe the bug

When adding a Pillow image to an existing Dataset on the hub, add_item fails due to the Pillow image not being automatically converted into the Image feature.

Steps to reproduce the bug

from datasets import load_dataset
from PIL import Image

dataset = load_dataset("hf-internal-testing/example-documents")

# load any random Pillow image
image = Image.open("/content/cord_example.png").convert("RGB")

new_image = {'image': image}
dataset['test'] = dataset['test'].add_item(new_image)

Expected results

The image should be automatically casted to the Image feature when using add_item. For now, this can be fixed by using encode_example:

import datasets

feature = datasets.Image(decode=False)
new_image = {'image': feature.encode_example(image)}
dataset['test'] = dataset['test'].add_item(new_image)

Actual results

ArrowInvalid: Could not convert <PIL.Image.Image image mode=RGB size=576x864 at 0x7F7CCC4589D0> with type Image: did not recognize Python value type when inferring an Arrow data type

NielsRogge commented 2 years ago

@mariosasko I'm getting a similar issue when creating a Dataset from a Pandas dataframe, like so:

from datasets import Dataset, Features, Image, Value
import pandas as pd
import requests
import PIL

# we need to define the features ourselves
features = Features({
    'a': Value(dtype='int32'),
    'b': Image(),
})

url = "http://images.cocodataset.org/val2017/000000039769.jpg"
image = PIL.Image.open(requests.get(url, stream=True).raw)

df = pd.DataFrame({"a": [1, 2], 
                   "b": [image, image]})

dataset = Dataset.from_pandas(df, features=features)

results in

ArrowInvalid: ('Could not convert <PIL.JpegImagePlugin.JpegImageFile image mode=RGB size=640x480 at 0x7F7991A15C10> with type JpegImageFile: did not recognize Python value type when inferring an Arrow data type', 'Conversion failed for column b with type object')

Will the PR linked above also fix that?

mariosasko commented 2 years ago

I would expect this to work, but it doesn't. Shouldn't be too hard to fix tho (in a subsequent PR).

darraghdog commented 2 years ago

Hi @mariosasko just wanted to check in if there is a PR to follow for this. I was looking to create a demo app using this. If it's not working I can just use byte encoded images in the dataset which are not displayed.

mariosasko commented 2 years ago

Hi @darraghdog! No PR yet, but I plan to fix this before the next release.

stas00 commented 2 years ago

I was just pointed here by @mariosasko, meanwhile I found a workaround using encode_example like so:

from datasets import load_from_disk, Dataset
DATASET_PATH = "/hf/m4-master/data/cm4/cm4-10000-v0.1"
ds1 = load_from_disk(DATASET_PATH)
ds2 = Dataset.from_dict(mapping={k: [] for k in ds1[99].keys()},
                       features=ds1.features
)
for i in range(2):
    # could add several representative items here
    row = ds1[99]
    row_encoded = ds2.features.encode_example(row)
    ds2 = ds2.add_item(row_encoded)

stas00 commented 2 years ago

Hmm, interesting. If I create the dataset on the fly:

from datasets import load_from_disk, Dataset
DATASET_PATH = "/hf/m4-master/data/cm4/cm4-10000-v0.1"
ds1 = load_from_disk(DATASET_PATH)
ds2 = Dataset.from_dict(mapping={k: [v]*2 for k, v in ds1[99].items()},
                        features=ds1.features)

it doesn't fail with the error in the OP, as from_dict performs encode_batch.

However if I try to use this dataset it fails now with:

Traceback (most recent call last):
  File "/home/stas/anaconda3/envs/py38-pt112/lib/python3.8/site-packages/multiprocess/pool.py", line 125, in worker
    result = (True, func(*args, **kwds))
  File "/home/stas/anaconda3/envs/py38-pt112/lib/python3.8/site-packages/datasets/arrow_dataset.py", line 557, in wrapper
    out: Union["Dataset", "DatasetDict"] = func(self, *args, **kwargs)
  File "/home/stas/anaconda3/envs/py38-pt112/lib/python3.8/site-packages/datasets/arrow_dataset.py", line 524, in wrapper
    out: Union["Dataset", "DatasetDict"] = func(self, *args, **kwargs)
  File "/home/stas/anaconda3/envs/py38-pt112/lib/python3.8/site-packages/datasets/fingerprint.py", line 480, in wrapper
    out = func(self, *args, **kwargs)
  File "/home/stas/anaconda3/envs/py38-pt112/lib/python3.8/site-packages/datasets/arrow_dataset.py", line 2775, in _map_single
    batch = apply_function_on_filtered_inputs(
  File "/home/stas/anaconda3/envs/py38-pt112/lib/python3.8/site-packages/datasets/arrow_dataset.py", line 2655, in apply_function_on_filtered_inputs
    processed_inputs = function(*fn_args, *additional_args, **fn_kwargs)
  File "/home/stas/anaconda3/envs/py38-pt112/lib/python3.8/site-packages/datasets/arrow_dataset.py", line 2347, in decorated
    result = f(decorated_item, *args, **kwargs)
  File "debug_leak2.py", line 235, in split_pack_and_pad
    images.append(image_transform(image.convert("RGB")))
AttributeError: 'dict' object has no attribute 'convert'
"""

The above exception was the direct cause of the following exception:

Traceback (most recent call last):
  File "debug_leak2.py", line 418, in <module>
    train_loader, val_loader = get_dataloaders()
  File "debug_leak2.py", line 348, in get_dataloaders
    dataset = dataset.map(mapper, batch_size=32, batched=True, remove_columns=dataset.column_names, num_proc=4)
  File "/home/stas/anaconda3/envs/py38-pt112/lib/python3.8/site-packages/datasets/arrow_dataset.py", line 2500, in map
    transformed_shards[index] = async_result.get()
  File "/home/stas/anaconda3/envs/py38-pt112/lib/python3.8/site-packages/multiprocess/pool.py", line 771, in get
    raise self._value
AttributeError: 'dict' object has no attribute 'convert'

but if I create that same dataset one item at a time as in the previous comment's code snippet it doesn't fail.

The features of this dataset are set to:

{'texts': Sequence(feature=Value(dtype='string', id=None), length=-1, id=None), 
'images': Sequence(feature=Image(decode=True, id=None), length=-1, id=None)}

MaxxTr commented 1 year ago

@mariosasko I'm getting a similar issue when creating a Dataset from a Pandas dataframe, like so:

from datasets import Dataset, Features, Image, Value
import pandas as pd
import requests
import PIL

# we need to define the features ourselves
features = Features({
    'a': Value(dtype='int32'),
    'b': Image(),
})

url = "http://images.cocodataset.org/val2017/000000039769.jpg"
image = PIL.Image.open(requests.get(url, stream=True).raw)

df = pd.DataFrame({"a": [1, 2], 
                   "b": [image, image]})

dataset = Dataset.from_pandas(df, features=features)

results in

ArrowInvalid: ('Could not convert <PIL.JpegImagePlugin.JpegImageFile image mode=RGB size=640x480 at 0x7F7991A15C10> with type JpegImageFile: did not recognize Python value type when inferring an Arrow data type', 'Conversion failed for column b with type object')

Will the PR linked above also fix that?

It looks like the problem still exists. Any news ? Any good workaround ?

Thank you

MaxxTr commented 1 year ago

There is a workaround: Create a loader python scrypt and upload the dataset to huggingface.

Here is an example how to do that:

https://huggingface.co/datasets/jamescalam/image-text-demo/tree/main

and Here are videos with explanations:

https://www.youtube.com/watch?v=lqK4ocAKveE and https://www.youtube.com/watch?v=ODdKC30dT8c

NielsRogge commented 1 year ago

cc @mariosasko gentle ping for a fix :)

chumpblocckami commented 1 year ago

Any update on this? I'm still facing this issure. Any workaround?

umarpreet1 commented 1 year ago

I was facing the same issue. Downgrading datasets from 2.11.0 to 2.4.0 solved the issue.

chumpblocckami commented 1 year ago

Any update on this? I'm still facing this issure. Any workaround?

I was able to resolve my issue with a quick workaround:

from collections import defaultdict
from datasets import Dataset

data = defaultdict(list)
for idx in tqdm(range( len(dataloader)),desc="Captioning..."):
    img = dataloader[idx]
    data['image'].append(img)
    data['text'].append(f"{img_{idx}})

dataset = Dataset.from_dict(data)
dataset = dataset.filter(lambda example: example['image'] is not None)
dataset = dataset.filter(lambda example: example['text'] is not None)

dataset.push_to_hub(path-to-repo', private=False)

Hope it helps! Happy coding

thinh-huynh-re commented 1 year ago

Any update on this? I'm still facing this issure. Any workaround?

I was able to resolve my issue with a quick workaround:

from collections import defaultdict
from datasets import Dataset

data = defaultdict(list)
for idx in tqdm(range( len(dataloader)),desc="Captioning..."):
    img = dataloader[idx]
    data['image'].append(img)
    data['text'].append(f"{img_{idx}})

dataset = Dataset.from_dict(data)
dataset = dataset.filter(lambda example: example['image'] is not None)
dataset = dataset.filter(lambda example: example['text'] is not None)

dataset.push_to_hub(path-to-repo', private=False)

Hope it helps! Happy coding

It works!!

LanceGao97 commented 1 year ago

how did this work, how to use this script or where to paste it?

Ekhao commented 5 months ago

I had a similar issue to @NielsRogge where I was unable to create a dataset from a Pandas DataFrame containing PIL.Images.

I found another workaround that works in this case which involves converting the DataFrame to a python dictionary, and then creating a dataset from said python dictionary.

This is a generic example of my workaround. The example assumes that you have your data in a Pandas DataFrame variable called "dataframe" plus a dictionary of your data's features in a variable called "features".

import datasets

dictionary = dataframe.to_dict(orient='list')
dataset = datasets.Dataset.from_dict(dictionary, features=features)

NielsRogge commented 5 months ago

cc @mariosasko this issue has been open for 2 years, would be great to resolve it :)

tanyav2 commented 5 months ago

I have the same issue, my current workaround is saving the dataframe to a csv and then loading the dataset from the csv. Would also appreciate it a fix :)

huggingface / datasets