I follow this example to handle errors in custom dataset. I am writing a dataset script which read jsonl files and i need to handle errors and continue reading files without raising exception and exit the execution.
def _generate_examples(self, filepaths):
errors=[]
id_ = 0
for filepath in filepaths:
try:
with open(filepath, 'r') as f:
for line in f:
json_obj = json.loads(line)
yield id_, json_obj
id_ += 1
except Exception as exc:
logger.error(f"error occur at filepath: {filepath}")
errors.append(error)
seems the logger.error is printed but still exception is raised the the run is exit.
Downloading and preparing dataset custom_dataset/default to /home/myuser/.cache/huggingface/datasets/custom_dataset/default-a14cdd566afee0a6/1.0.0/acfcc9fb9c57034b580c4252841
ERROR: datasets_modules.datasets.custom_dataset.acfcc9fb9c57034b580c4252841bb890a5617cbd28678dd4be5e52b81188ad02.custom_dataset: 2024-07-22 10:47:42,167: error occur at filepath: '/home/myuser/ds/corrupted-file.jsonl
Traceback (most recent call last):
File "/home/myuser/.cache/huggingface/modules/datasets_modules/datasets/custom_dataset/ac..2/custom_dataset.py", line 48, in _generate_examples
json_obj = json.loads(line)
File "myenv/lib/python3.8/json/__init__.py", line 357, in loads
return _default_decoder.decode(s)
File "myenv/lib/python3.8/json/decoder.py", line 337, in decode
obj, end = self.raw_decode(s, idx=_w(s, 0).end())
File "myenv/lib/python3.8/json/decoder.py", line 353, in raw_decode
obj, end = self.scan_once(s, idx)
json.decoder.JSONDecodeError: Invalid control character at: line 1 column 4 (char 3)
Generating train split: 0 examples [00:06, ? examples/s]>
RemoteTraceback:
"""
Traceback (most recent call last):
File "myenv/lib/python3.8/site-packages/datasets/builder.py", line 1637, in _prepare_split_single
num_examples, num_bytes = writer.finalize()
File "myenv/lib/python3.8/site-packages/datasets/arrow_writer.py", line 594, in finalize
raise SchemaInferenceError("Please pass `features` or at least one example when writing data")
datasets.arrow_writer.SchemaInferenceError: Please pass `features` or at least one example when writing data
The above exception was the direct cause of the following exception:
Traceback (most recent call last):
File "myenv/lib/python3.8/site-packages/multiprocess/pool.py", line 125, in worker
result = (True, func(*args, **kwds))
File "myenv/lib/python3.8/site-packages/datasets/utils/py_utils.py", line 1353, in
_write_generator_to_queue
for i, result in enumerate(func(**kwargs)):
File "myenv/lib/python3.8/site-packages/datasets/builder.py", line 1646, in _prepare_split_single
raise DatasetGenerationError("An error occurred while generating the dataset") from e
datasets.builder.DatasetGenerationError: An error occurred while generating the dataset
"""
The above exception was the direct cause of the following exception:
โ โ
โ myenv/lib/python3.8/site-packages/datasets/utils/py_utils. โ
โ py:1377 in <listcomp> โ
โ โ
โ 1374 โ โ โ โ if all(async_result.ready() for async_result in async_results) and queue โ
โ 1375 โ โ โ โ โ break โ
โ 1376 โ โ # we get the result in case there's an error to raise โ
โ โฑ 1377 โ โ [async_result.get() for async_result in async_results] โ
โ 1378 โ
โ โ
โ โญโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโ locals โโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโฎ โ
โ โ .0 = <list_iterator object at 0x7f2cc1f0ce20> โ โ
โ โ async_result = <multiprocess.pool.ApplyResult object at 0x7f2cc1f79c10> โ โ
โ โฐโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโฏ โ
โ โ
โ myenv/lib/python3.8/site-packages/multiprocess/pool.py:771 โ
โ in get โ
โ โ
โ 768 โ โ if self._success: โ
โ 769 โ โ โ return self._value โ
โ 770 โ โ else: โ
โ โฑ 771 โ โ โ raise self._value โ
โ 772 โ โ
โ 773 โ def _set(self, i, obj): โ
โ 774 โ โ self._success, self._value = obj โ
โ โ
โ โญโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโ locals โโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโฎ โ
โ โ self = <multiprocess.pool.ApplyResult object at 0x7f2cc1f79c10> โ โ
โ โ timeout = None โ โ
โ โฐโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโฏ โ
DatasetGenerationError: An error occurred while generating the dataset
Steps to reproduce the bug
same as above
Expected behavior
should handle error and continue reading remaining files
Describe the bug
I follow this example to handle errors in custom dataset. I am writing a dataset script which read jsonl files and i need to handle errors and continue reading files without raising exception and exit the execution.
seems the logger.error is printed but still exception is raised the the run is exit.
Steps to reproduce the bug
same as above
Expected behavior
should handle error and continue reading remaining files
Environment info
python 3.9