When trying to filter my custom dataset, the process hangs, regardless of the lambda function used. It appears to be an issue with the way the Images are being handled. The dataset in question is a preprocessed version of https://huggingface.co/datasets/danaaubakirova/patfig where notably, I have converted the data to the Parquet format.
Eventually I ctrl+C and I obtain this stack trace:
>>> ds_filtered = ds.filter(lambda row: row['cpc_class'] != 'Y')
Filter: 0%| | 0/998 [00:00<?, ? examples/s]Filter: 0%| | 0/998 [00:35<?, ? examples/s]
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
File "/home/l-walewski/miniconda3/envs/patentqa/lib/python3.11/site-packages/datasets/arrow_dataset.py", line 567, in wrapper
out: Union["Dataset", "DatasetDict"] = func(self, *args, **kwargs)
^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/home/l-walewski/miniconda3/envs/patentqa/lib/python3.11/site-packages/datasets/fingerprint.py", line 482, in wrapper
out = func(dataset, *args, **kwargs)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/home/l-walewski/miniconda3/envs/patentqa/lib/python3.11/site-packages/datasets/arrow_dataset.py", line 3714, in filter
indices = self.map(
^^^^^^^^^
File "/home/l-walewski/miniconda3/envs/patentqa/lib/python3.11/site-packages/datasets/arrow_dataset.py", line 602, in wrapper
out: Union["Dataset", "DatasetDict"] = func(self, *args, **kwargs)
^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/home/l-walewski/miniconda3/envs/patentqa/lib/python3.11/site-packages/datasets/arrow_dataset.py", line 567, in wrapper
out: Union["Dataset", "DatasetDict"] = func(self, *args, **kwargs)
^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/home/l-walewski/miniconda3/envs/patentqa/lib/python3.11/site-packages/datasets/arrow_dataset.py", line 3161, in map
for rank, done, content in Dataset._map_single(**dataset_kwargs):
File "/home/l-walewski/miniconda3/envs/patentqa/lib/python3.11/site-packages/datasets/arrow_dataset.py", line 3552, in _map_single
batch = apply_function_on_filtered_inputs(
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/home/l-walewski/miniconda3/envs/patentqa/lib/python3.11/site-packages/datasets/arrow_dataset.py", line 3421, in apply_function_on_filtered_inputs
processed_inputs = function(*fn_args, *additional_args, **fn_kwargs)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/home/l-walewski/miniconda3/envs/patentqa/lib/python3.11/site-packages/datasets/arrow_dataset.py", line 6478, in get_indices_from_mask_function
num_examples = len(batch[next(iter(batch.keys()))])
~~~~~^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/home/l-walewski/miniconda3/envs/patentqa/lib/python3.11/site-packages/datasets/formatting/formatting.py", line 273, in __getitem__
value = self.format(key)
^^^^^^^^^^^^^^^^
File "/home/l-walewski/miniconda3/envs/patentqa/lib/python3.11/site-packages/datasets/formatting/formatting.py", line 376, in format
return self.formatter.format_column(self.pa_table.select([key]))
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/home/l-walewski/miniconda3/envs/patentqa/lib/python3.11/site-packages/datasets/formatting/formatting.py", line 443, in format_column
column = self.python_features_decoder.decode_column(column, pa_table.column_names[0])
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/home/l-walewski/miniconda3/envs/patentqa/lib/python3.11/site-packages/datasets/formatting/formatting.py", line 219, in decode_column
return self.features.decode_column(column, column_name) if self.features else column
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/home/l-walewski/miniconda3/envs/patentqa/lib/python3.11/site-packages/datasets/features/features.py", line 2008, in decode_column
[decode_nested_example(self[column_name], value) if value is not None else None for value in column]
File "/home/l-walewski/miniconda3/envs/patentqa/lib/python3.11/site-packages/datasets/features/features.py", line 2008, in <listcomp>
[decode_nested_example(self[column_name], value) if value is not None else None for value in column]
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/home/l-walewski/miniconda3/envs/patentqa/lib/python3.11/site-packages/datasets/features/features.py", line 1351, in decode_nested_example
return schema.decode_example(obj, token_per_repo_id=token_per_repo_id)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/home/l-walewski/miniconda3/envs/patentqa/lib/python3.11/site-packages/datasets/features/image.py", line 188, in decode_example
image.load() # to avoid "Too many open files" errors
^^^^^^^^^^^^
File "/home/l-walewski/miniconda3/envs/patentqa/lib/python3.11/site-packages/PIL/ImageFile.py", line 293, in load
n, err_code = decoder.decode(b)
^^^^^^^^^^^^^^^^^
KeyboardInterrupt
Warning! This can even seem to cause some computers to crash.
Describe the bug
When trying to filter my custom dataset, the process hangs, regardless of the lambda function used. It appears to be an issue with the way the Images are being handled. The dataset in question is a preprocessed version of https://huggingface.co/datasets/danaaubakirova/patfig where notably, I have converted the data to the Parquet format.
Steps to reproduce the bug
Eventually I ctrl+C and I obtain this stack trace:
Warning! This can even seem to cause some computers to crash.
Expected behavior
Should return the filtered dataset
Environment info
datasets
version: 2.20.0huggingface_hub
version: 0.24.0fsspec
version: 2024.5.0