huggingface / datasets

๐Ÿค— The largest hub of ready-to-use datasets for ML models with fast, easy-to-use and efficient data manipulation tools
https://huggingface.co/docs/datasets
Apache License 2.0
19.31k stars 2.7k forks source link

Custom Dataset | Still Raise Error while handling errors in _generate_examples #7061

Open hahmad2008 opened 4 months ago

hahmad2008 commented 4 months ago

Describe the bug

I follow this example to handle errors in custom dataset. I am writing a dataset script which read jsonl files and i need to handle errors and continue reading files without raising exception and exit the execution.

def _generate_examples(self, filepaths):
        errors=[]

        id_ = 0
        for filepath in filepaths:
            try:
                with open(filepath, 'r') as f:
                    for line in f:
                        json_obj = json.loads(line)

                        yield id_, json_obj
                        id_ += 1
            except Exception as exc:     
                logger.error(f"error occur at filepath: {filepath}")      
                errors.append(error)

seems the logger.error is printed but still exception is raised the the run is exit.

Downloading and preparing dataset custom_dataset/default to /home/myuser/.cache/huggingface/datasets/custom_dataset/default-a14cdd566afee0a6/1.0.0/acfcc9fb9c57034b580c4252841
ERROR: datasets_modules.datasets.custom_dataset.acfcc9fb9c57034b580c4252841bb890a5617cbd28678dd4be5e52b81188ad02.custom_dataset: 2024-07-22 10:47:42,167: error occur at filepath: '/home/myuser/ds/corrupted-file.jsonl
Traceback (most recent call last):
  File "/home/myuser/.cache/huggingface/modules/datasets_modules/datasets/custom_dataset/ac..2/custom_dataset.py", line 48, in _generate_examples
    json_obj = json.loads(line)
  File "myenv/lib/python3.8/json/__init__.py", line 357, in loads
    return _default_decoder.decode(s)
  File "myenv/lib/python3.8/json/decoder.py", line 337, in decode
    obj, end = self.raw_decode(s, idx=_w(s, 0).end())
  File "myenv/lib/python3.8/json/decoder.py", line 353, in raw_decode
    obj, end = self.scan_once(s, idx)
json.decoder.JSONDecodeError: Invalid control character at: line 1 column 4 (char 3)
Generating train split: 0 examples [00:06, ? examples/s]>
RemoteTraceback: 
"""
Traceback (most recent call last):
  File "myenv/lib/python3.8/site-packages/datasets/builder.py", line 1637, in _prepare_split_single
    num_examples, num_bytes = writer.finalize()
  File "myenv/lib/python3.8/site-packages/datasets/arrow_writer.py", line 594, in finalize
    raise SchemaInferenceError("Please pass `features` or at least one example when writing data")
datasets.arrow_writer.SchemaInferenceError: Please pass `features` or at least one example when writing data

The above exception was the direct cause of the following exception:

Traceback (most recent call last):
  File "myenv/lib/python3.8/site-packages/multiprocess/pool.py", line 125, in worker
    result = (True, func(*args, **kwds))
  File "myenv/lib/python3.8/site-packages/datasets/utils/py_utils.py", line 1353, in 
_write_generator_to_queue
    for i, result in enumerate(func(**kwargs)):
  File "myenv/lib/python3.8/site-packages/datasets/builder.py", line 1646, in _prepare_split_single
    raise DatasetGenerationError("An error occurred while generating the dataset") from e
datasets.builder.DatasetGenerationError: An error occurred while generating the dataset
"""

The above exception was the direct cause of the following exception:

โ”‚                                                                                                  โ”‚
โ”‚ myenv/lib/python3.8/site-packages/datasets/utils/py_utils. โ”‚
โ”‚ py:1377 in <listcomp>                                                                            โ”‚
โ”‚                                                                                                  โ”‚
โ”‚   1374 โ”‚   โ”‚   โ”‚   โ”‚   if all(async_result.ready() for async_result in async_results) and queue  โ”‚
โ”‚   1375 โ”‚   โ”‚   โ”‚   โ”‚   โ”‚   break                                                                 โ”‚
โ”‚   1376 โ”‚   โ”‚   # we get the result in case there's an error to raise                             โ”‚
โ”‚ โฑ 1377 โ”‚   โ”‚   [async_result.get() for async_result in async_results]                            โ”‚
โ”‚   1378                                                                                           โ”‚
โ”‚                                                                                                  โ”‚
โ”‚ โ•ญโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€ locals โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ•ฎ                      โ”‚
โ”‚ โ”‚           .0 = <list_iterator object at 0x7f2cc1f0ce20>                 โ”‚                      โ”‚
โ”‚ โ”‚ async_result = <multiprocess.pool.ApplyResult object at 0x7f2cc1f79c10> โ”‚                      โ”‚
โ”‚ โ•ฐโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ•ฏ                      โ”‚
โ”‚                                                                                                  โ”‚
โ”‚ myenv/lib/python3.8/site-packages/multiprocess/pool.py:771 โ”‚
โ”‚ in get                                                                                           โ”‚
โ”‚                                                                                                  โ”‚
โ”‚   768 โ”‚   โ”‚   if self._success:                                                                  โ”‚
โ”‚   769 โ”‚   โ”‚   โ”‚   return self._value                                                             โ”‚
โ”‚   770 โ”‚   โ”‚   else:                                                                              โ”‚
โ”‚ โฑ 771 โ”‚   โ”‚   โ”‚   raise self._value                                                              โ”‚
โ”‚   772 โ”‚                                                                                          โ”‚
โ”‚   773 โ”‚   def _set(self, i, obj):                                                                โ”‚
โ”‚   774 โ”‚   โ”‚   self._success, self._value = obj                                                   โ”‚
โ”‚                                                                                                  โ”‚
โ”‚ โ•ญโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€ locals โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ•ฎ                           โ”‚
โ”‚ โ”‚    self = <multiprocess.pool.ApplyResult object at 0x7f2cc1f79c10> โ”‚                           โ”‚
โ”‚ โ”‚ timeout = None                                                     โ”‚                           โ”‚
โ”‚ โ•ฐโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ•ฏ                           โ”‚

DatasetGenerationError: An error occurred while generating the dataset

Steps to reproduce the bug

same as above

Expected behavior

should handle error and continue reading remaining files

Environment info

python 3.9