Open NielsRogge opened 2 years ago
@mariosasko I'm getting a similar issue when creating a Dataset from a Pandas dataframe, like so:
from datasets import Dataset, Features, Image, Value
import pandas as pd
import requests
import PIL
# we need to define the features ourselves
features = Features({
'a': Value(dtype='int32'),
'b': Image(),
})
url = "http://images.cocodataset.org/val2017/000000039769.jpg"
image = PIL.Image.open(requests.get(url, stream=True).raw)
df = pd.DataFrame({"a": [1, 2],
"b": [image, image]})
dataset = Dataset.from_pandas(df, features=features)
results in
ArrowInvalid: ('Could not convert <PIL.JpegImagePlugin.JpegImageFile image mode=RGB size=640x480 at 0x7F7991A15C10> with type JpegImageFile: did not recognize Python value type when inferring an Arrow data type', 'Conversion failed for column b with type object')
Will the PR linked above also fix that?
I would expect this to work, but it doesn't. Shouldn't be too hard to fix tho (in a subsequent PR).
Hi @mariosasko just wanted to check in if there is a PR to follow for this. I was looking to create a demo app using this. If it's not working I can just use byte encoded images in the dataset which are not displayed.
Hi @darraghdog! No PR yet, but I plan to fix this before the next release.
I was just pointed here by @mariosasko, meanwhile I found a workaround using encode_example
like so:
from datasets import load_from_disk, Dataset
DATASET_PATH = "/hf/m4-master/data/cm4/cm4-10000-v0.1"
ds1 = load_from_disk(DATASET_PATH)
ds2 = Dataset.from_dict(mapping={k: [] for k in ds1[99].keys()},
features=ds1.features
)
for i in range(2):
# could add several representative items here
row = ds1[99]
row_encoded = ds2.features.encode_example(row)
ds2 = ds2.add_item(row_encoded)
Hmm, interesting. If I create the dataset on the fly:
from datasets import load_from_disk, Dataset
DATASET_PATH = "/hf/m4-master/data/cm4/cm4-10000-v0.1"
ds1 = load_from_disk(DATASET_PATH)
ds2 = Dataset.from_dict(mapping={k: [v]*2 for k, v in ds1[99].items()},
features=ds1.features)
it doesn't fail with the error in the OP, as from_dict
performs encode_batch
.
However if I try to use this dataset it fails now with:
Traceback (most recent call last):
File "/home/stas/anaconda3/envs/py38-pt112/lib/python3.8/site-packages/multiprocess/pool.py", line 125, in worker
result = (True, func(*args, **kwds))
File "/home/stas/anaconda3/envs/py38-pt112/lib/python3.8/site-packages/datasets/arrow_dataset.py", line 557, in wrapper
out: Union["Dataset", "DatasetDict"] = func(self, *args, **kwargs)
File "/home/stas/anaconda3/envs/py38-pt112/lib/python3.8/site-packages/datasets/arrow_dataset.py", line 524, in wrapper
out: Union["Dataset", "DatasetDict"] = func(self, *args, **kwargs)
File "/home/stas/anaconda3/envs/py38-pt112/lib/python3.8/site-packages/datasets/fingerprint.py", line 480, in wrapper
out = func(self, *args, **kwargs)
File "/home/stas/anaconda3/envs/py38-pt112/lib/python3.8/site-packages/datasets/arrow_dataset.py", line 2775, in _map_single
batch = apply_function_on_filtered_inputs(
File "/home/stas/anaconda3/envs/py38-pt112/lib/python3.8/site-packages/datasets/arrow_dataset.py", line 2655, in apply_function_on_filtered_inputs
processed_inputs = function(*fn_args, *additional_args, **fn_kwargs)
File "/home/stas/anaconda3/envs/py38-pt112/lib/python3.8/site-packages/datasets/arrow_dataset.py", line 2347, in decorated
result = f(decorated_item, *args, **kwargs)
File "debug_leak2.py", line 235, in split_pack_and_pad
images.append(image_transform(image.convert("RGB")))
AttributeError: 'dict' object has no attribute 'convert'
"""
The above exception was the direct cause of the following exception:
Traceback (most recent call last):
File "debug_leak2.py", line 418, in <module>
train_loader, val_loader = get_dataloaders()
File "debug_leak2.py", line 348, in get_dataloaders
dataset = dataset.map(mapper, batch_size=32, batched=True, remove_columns=dataset.column_names, num_proc=4)
File "/home/stas/anaconda3/envs/py38-pt112/lib/python3.8/site-packages/datasets/arrow_dataset.py", line 2500, in map
transformed_shards[index] = async_result.get()
File "/home/stas/anaconda3/envs/py38-pt112/lib/python3.8/site-packages/multiprocess/pool.py", line 771, in get
raise self._value
AttributeError: 'dict' object has no attribute 'convert'
but if I create that same dataset one item at a time as in the previous comment's code snippet it doesn't fail.
The features of this dataset are set to:
{'texts': Sequence(feature=Value(dtype='string', id=None), length=-1, id=None),
'images': Sequence(feature=Image(decode=True, id=None), length=-1, id=None)}
@mariosasko I'm getting a similar issue when creating a Dataset from a Pandas dataframe, like so:
from datasets import Dataset, Features, Image, Value import pandas as pd import requests import PIL # we need to define the features ourselves features = Features({ 'a': Value(dtype='int32'), 'b': Image(), }) url = "http://images.cocodataset.org/val2017/000000039769.jpg" image = PIL.Image.open(requests.get(url, stream=True).raw) df = pd.DataFrame({"a": [1, 2], "b": [image, image]}) dataset = Dataset.from_pandas(df, features=features)
results in
ArrowInvalid: ('Could not convert <PIL.JpegImagePlugin.JpegImageFile image mode=RGB size=640x480 at 0x7F7991A15C10> with type JpegImageFile: did not recognize Python value type when inferring an Arrow data type', 'Conversion failed for column b with type object')
Will the PR linked above also fix that?
It looks like the problem still exists. Any news ? Any good workaround ?
Thank you
There is a workaround: Create a loader python scrypt and upload the dataset to huggingface.
Here is an example how to do that:
https://huggingface.co/datasets/jamescalam/image-text-demo/tree/main
and Here are videos with explanations:
https://www.youtube.com/watch?v=lqK4ocAKveE and https://www.youtube.com/watch?v=ODdKC30dT8c
cc @mariosasko gentle ping for a fix :)
Any update on this? I'm still facing this issure. Any workaround?
I was facing the same issue. Downgrading datasets from 2.11.0 to 2.4.0 solved the issue.
Any update on this? I'm still facing this issure. Any workaround?
I was able to resolve my issue with a quick workaround:
from collections import defaultdict
from datasets import Dataset
data = defaultdict(list)
for idx in tqdm(range( len(dataloader)),desc="Captioning..."):
img = dataloader[idx]
data['image'].append(img)
data['text'].append(f"{img_{idx}})
dataset = Dataset.from_dict(data)
dataset = dataset.filter(lambda example: example['image'] is not None)
dataset = dataset.filter(lambda example: example['text'] is not None)
dataset.push_to_hub(path-to-repo', private=False)
Hope it helps! Happy coding
Any update on this? I'm still facing this issure. Any workaround?
I was able to resolve my issue with a quick workaround:
from collections import defaultdict from datasets import Dataset data = defaultdict(list) for idx in tqdm(range( len(dataloader)),desc="Captioning..."): img = dataloader[idx] data['image'].append(img) data['text'].append(f"{img_{idx}}) dataset = Dataset.from_dict(data) dataset = dataset.filter(lambda example: example['image'] is not None) dataset = dataset.filter(lambda example: example['text'] is not None) dataset.push_to_hub(path-to-repo', private=False)
Hope it helps! Happy coding
It works!!
how did this work, how to use this script or where to paste it?
I had a similar issue to @NielsRogge where I was unable to create a dataset from a Pandas DataFrame containing PIL.Images.
I found another workaround that works in this case which involves converting the DataFrame to a python dictionary, and then creating a dataset from said python dictionary.
This is a generic example of my workaround. The example assumes that you have your data in a Pandas DataFrame variable called "dataframe" plus a dictionary of your data's features in a variable called "features".
import datasets
dictionary = dataframe.to_dict(orient='list')
dataset = datasets.Dataset.from_dict(dictionary, features=features)
cc @mariosasko this issue has been open for 2 years, would be great to resolve it :)
I have the same issue, my current workaround is saving the dataframe to a csv and then loading the dataset from the csv. Would also appreciate it a fix :)
Describe the bug
When adding a Pillow image to an existing Dataset on the hub,
add_item
fails due to the Pillow image not being automatically converted into the Image feature.Steps to reproduce the bug
Expected results
The image should be automatically casted to the Image feature when using
add_item
. For now, this can be fixed by usingencode_example
:Actual results