huggingface / datasets

🤗 The largest hub of ready-to-use datasets for ML models with fast, easy-to-use and efficient data manipulation tools
https://huggingface.co/docs/datasets
Apache License 2.0
19.31k stars 2.7k forks source link

Filter hangs #7071

Open lucienwalewski opened 4 months ago

lucienwalewski commented 4 months ago

Describe the bug

When trying to filter my custom dataset, the process hangs, regardless of the lambda function used. It appears to be an issue with the way the Images are being handled. The dataset in question is a preprocessed version of https://huggingface.co/datasets/danaaubakirova/patfig where notably, I have converted the data to the Parquet format.

Steps to reproduce the bug

from datasets import load_dataset
ds = load_dataset('lcolonn/patfig', split='test')
ds_filtered = ds.filter(lambda row: row['cpc_class'] != 'Y')

Eventually I ctrl+C and I obtain this stack trace:

>>> ds_filtered = ds.filter(lambda row: row['cpc_class'] != 'Y')
Filter:   0%|                                                                | 0/998 [00:00<?, ? examples/s]Filter:   0%|                                                                | 0/998 [00:35<?, ? examples/s]
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "/home/l-walewski/miniconda3/envs/patentqa/lib/python3.11/site-packages/datasets/arrow_dataset.py", line 567, in wrapper
    out: Union["Dataset", "DatasetDict"] = func(self, *args, **kwargs)
                                           ^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/l-walewski/miniconda3/envs/patentqa/lib/python3.11/site-packages/datasets/fingerprint.py", line 482, in wrapper
    out = func(dataset, *args, **kwargs)
          ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/l-walewski/miniconda3/envs/patentqa/lib/python3.11/site-packages/datasets/arrow_dataset.py", line 3714, in filter
    indices = self.map(
              ^^^^^^^^^
  File "/home/l-walewski/miniconda3/envs/patentqa/lib/python3.11/site-packages/datasets/arrow_dataset.py", line 602, in wrapper
    out: Union["Dataset", "DatasetDict"] = func(self, *args, **kwargs)
                                           ^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/l-walewski/miniconda3/envs/patentqa/lib/python3.11/site-packages/datasets/arrow_dataset.py", line 567, in wrapper
    out: Union["Dataset", "DatasetDict"] = func(self, *args, **kwargs)
                                           ^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/l-walewski/miniconda3/envs/patentqa/lib/python3.11/site-packages/datasets/arrow_dataset.py", line 3161, in map
    for rank, done, content in Dataset._map_single(**dataset_kwargs):
  File "/home/l-walewski/miniconda3/envs/patentqa/lib/python3.11/site-packages/datasets/arrow_dataset.py", line 3552, in _map_single
    batch = apply_function_on_filtered_inputs(
            ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/l-walewski/miniconda3/envs/patentqa/lib/python3.11/site-packages/datasets/arrow_dataset.py", line 3421, in apply_function_on_filtered_inputs
    processed_inputs = function(*fn_args, *additional_args, **fn_kwargs)
                       ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/l-walewski/miniconda3/envs/patentqa/lib/python3.11/site-packages/datasets/arrow_dataset.py", line 6478, in get_indices_from_mask_function
    num_examples = len(batch[next(iter(batch.keys()))])
                       ~~~~~^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/l-walewski/miniconda3/envs/patentqa/lib/python3.11/site-packages/datasets/formatting/formatting.py", line 273, in __getitem__
    value = self.format(key)
            ^^^^^^^^^^^^^^^^
  File "/home/l-walewski/miniconda3/envs/patentqa/lib/python3.11/site-packages/datasets/formatting/formatting.py", line 376, in format
    return self.formatter.format_column(self.pa_table.select([key]))
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/l-walewski/miniconda3/envs/patentqa/lib/python3.11/site-packages/datasets/formatting/formatting.py", line 443, in format_column
    column = self.python_features_decoder.decode_column(column, pa_table.column_names[0])
             ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/l-walewski/miniconda3/envs/patentqa/lib/python3.11/site-packages/datasets/formatting/formatting.py", line 219, in decode_column
    return self.features.decode_column(column, column_name) if self.features else column
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/l-walewski/miniconda3/envs/patentqa/lib/python3.11/site-packages/datasets/features/features.py", line 2008, in decode_column
    [decode_nested_example(self[column_name], value) if value is not None else None for value in column]
  File "/home/l-walewski/miniconda3/envs/patentqa/lib/python3.11/site-packages/datasets/features/features.py", line 2008, in <listcomp>
    [decode_nested_example(self[column_name], value) if value is not None else None for value in column]
     ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/l-walewski/miniconda3/envs/patentqa/lib/python3.11/site-packages/datasets/features/features.py", line 1351, in decode_nested_example
    return schema.decode_example(obj, token_per_repo_id=token_per_repo_id)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/l-walewski/miniconda3/envs/patentqa/lib/python3.11/site-packages/datasets/features/image.py", line 188, in decode_example
    image.load()  # to avoid "Too many open files" errors
    ^^^^^^^^^^^^
  File "/home/l-walewski/miniconda3/envs/patentqa/lib/python3.11/site-packages/PIL/ImageFile.py", line 293, in load
    n, err_code = decoder.decode(b)
                  ^^^^^^^^^^^^^^^^^
KeyboardInterrupt

Warning! This can even seem to cause some computers to crash.

Expected behavior

Should return the filtered dataset

Environment info