modelscope / data-juicer

A one-stop data processing system to make data higher-quality, juicier, and more digestible for (multimodal) LLMs! 🍎 🍋 🌽 ➡️ ➡️🍸 🍹 🍷为大模型提供更高质量、更丰富、更易”消化“的数据!
Apache License 2.0
2.9k stars 175 forks source link

why often happen: One of the subprocesses has abruptly died during map operation? #430

Closed strongcc closed 1 month ago

strongcc commented 1 month ago

Before Asking 在提问之前

Search before asking 先搜索,再提问

Question

您好:

  1. 我在使用dj时遇到报错:RuntimeError: One of the subprocesses has abruptly died during map operation.To debug the error, disable multiprocessing. 即使我使用单线程也会挂掉且没有告警信息打出来。
  2. 我想使用use_checkpoint,想着如果能保存一个op的结果,我多跑几次能跑完也能凑合用。但是这个功能也不好用,也失败。

我的配置: export_shard_size: 0 export_in_parallel: false np: 10 # number of subprocess to process your dataset open_tracer: true text_keys: 'text'

use_checkpoint: true op_fusion: false cache_compress: 'gzip'

process:

我的数据量:800万行数据,5.2G

报错信息: 2024-09-14 06:45:05 | ERROR | data_juicer.core.data:195 - An error occurred during Op [whitespace_normalization_mapper]. Traceback (most recent call last): File "/home/work/wangsicong/code/data-juicer/data_juicer/core/data.py", line 187, in process dataset = op.run(dataset, exporter=exporter, tracer=tracer) File "/home/work/wangsicong/code/data-juicer/data_juicer/ops/base_op.py", line 240, in run new_dataset = dataset.map( File "/home/work/wangsicong/code/data-juicer/data_juicer/core/data.py", line 248, in map new_ds = NestedDataset(super().map(*args, kargs)) File "/home/work/wangsicong/miniconda3/envs/data_juicer/lib/python3.10/site-packages/datasets/arrow_dataset.py", line 593, in wrapper out: Union["Dataset", "DatasetDict"] = func(self, *args, *kwargs) File "/home/work/wangsicong/miniconda3/envs/data_juicer/lib/python3.10/site-packages/datasets/arrow_dataset.py", line 558, in wrapper out: Union["Dataset", "DatasetDict"] = func(self, args, kwargs) File "/home/work/wangsicong/miniconda3/envs/data_juicer/lib/python3.10/site-packages/datasets/arrow_dataset.py", line 3197, in map for rank, done, content in iflatmap_unordered( File "/home/work/wangsicong/miniconda3/envs/data_juicer/lib/python3.10/site-packages/datasets/utils/py_utils.py", line 656, in iflatmap_unordered raise RuntimeError( RuntimeError: One of the subprocesses has abruptly died during map operation.To debug the error, disable multiprocessing. 2024-09-14 06:45:05 | INFO | data_juicer.core.data:200 - Writing checkpoint of dataset processed by last op...

Saving the dataset (0/40 shards): 0%| | 0/3427644 [00:00<?, ? examples/s] Saving the dataset (0/40 shards): 0%| | 1000/3427644 [36:42<2096:14:49, 2.20s/ examples] Saving the dataset (0/40 shards): 0%| | 1000/3427644 [36:43<2097:19:51, 2.20s/ examples] 2024-09-14 07:23:31 | ERROR | main:33 - An error has been caught in function '', process 'MainProcess' (23318), thread 'MainThread' (140652129257280): Traceback (most recent call last):

File "/home/work/wangsicong/code/data-juicer/data_juicer/core/data.py", line 187, in process dataset = op.run(dataset, exporter=exporter, tracer=tracer) │ │ │ │ └ <data_juicer.core.tracer.Tracer object at 0x7fe90ff72110> │ │ │ └ <data_juicer.core.exporter.Exporter object at 0x7fe90ff73220> │ │ └ Dataset({ │ │ features: ['text', 'djstats__'], │ │ num_rows: 3427644 │ │ }) │ └ <function Mapper.run at 0x7feabc2cbd00> └ <data_juicer.ops.mapper.whitespace_normalization_mapper.WhitespaceNormalizationMapper object at 0x7fe90fb80520>

File "/home/work/wangsicong/code/data-juicer/data_juicer/ops/base_op.py", line 240, in run new_dataset = dataset.map( │ └ <function NestedDataset.map at 0x7fe9102265f0> └ Dataset({ features: ['text', 'djstats__'], num_rows: 3427644 })

File "/home/work/wangsicong/code/data-juicer/data_juicer/core/data.py", line 248, in map new_ds = NestedDataset(super().map(*args, **kargs)) │ │ └ {'num_proc': 40, 'with_rank': False, 'desc': 'whitespace_normalization_mapper_process', 'batched': True, 'batch_size': 1, 'ne... │ └ [<function WhitespaceNormalizationMapper.process at 0x7fe90f9ad900>] └ <class 'data_juicer.core.data.NestedDataset'>

File "/home/work/wangsicong/miniconda3/envs/data_juicer/lib/python3.10/site-packages/datasets/arrow_dataset.py", line 593, in wrapper out: Union["Dataset", "DatasetDict"] = func(self, *args, *kwargs) │ │ │ │ └ {'num_proc': 40, 'with_rank': False, 'desc': 'whitespace_normalization_mapper_process', 'batched': True, 'batch_size': 1, 'ne... │ │ │ └ (<function WhitespaceNormalizationMapper.process at 0x7fe90f9ad900>,) │ │ └ Dataset({ │ │ features: ['text', 'djstats__'], │ │ num_rows: 3427644 │ │ }) │ └ <function Dataset.map at 0x7feabc54d1b0> └ typing.Union File "/home/work/wangsicong/miniconda3/envs/data_juicer/lib/python3.10/site-packages/datasets/arrow_dataset.py", line 558, in wrapper out: Union["Dataset", "DatasetDict"] = func(self, args, **kwargs) │ │ │ │ └ {'num_proc': 40, 'with_rank': False, 'desc': 'whitespace_normalization_mapper_process', 'batched': True, 'batch_size': 1, 'ne... │ │ │ └ (<function WhitespaceNormalizationMapper.process at 0x7fe90f9ad900>,) │ │ └ Dataset({ │ │ features: ['text', 'djstats__'], │ │ num_rows: 3427644 │ │ }) │ └ <function Dataset.map at 0x7feabc54d120> └ typing.Union File "/home/work/wangsicong/miniconda3/envs/data_juicer/lib/python3.10/site-packages/datasets/arrow_dataset.py", line 3197, in map for rank, done, content in iflatmap_unordered( │ └ <function iflatmap_unordered at 0x7feac62ba8c0> └ 39 File "/home/work/wangsicong/miniconda3/envs/data_juicer/lib/python3.10/site-packages/datasets/utils/py_utils.py", line 656, in iflatmap_unordered raise RuntimeError(

RuntimeError: One of the subprocesses has abruptly died during map operation.To debug the error, disable multiprocessing.

During handling of the above exception, another exception occurred:

Traceback (most recent call last):

File "/home/work/wangsicong/code/data-juicer/data_juicer/core/data.py", line 197, in process exit(1)

File "/home/work/wangsicong/miniconda3/envs/data_juicer/lib/python3.10/_sitebuiltins.py", line 26, in call raise SystemExit(code) └ 1

SystemExit: 1

During handling of the above exception, another exception occurred:

Traceback (most recent call last):

File "/home/work/wangsicong/miniconda3/envs/data_juicer/bin/dj-process", line 33, in sys.exit(load_entry_point('py-data-juicer', 'console_scripts', 'dj-process')()) │ │ └ <function importlib_load_entry_point at 0x7fec1ff5bd90> │ └ └ <module 'sys' (built-in)>

File "/home/work/wangsicong/code/data-juicer/tools/process_data.py", line 15, in main executor.run() │ └ <function Executor.run at 0x7fe910227ac0> └ <data_juicer.core.executor.Executor object at 0x7fe90fb80670>

File "/home/work/wangsicong/code/data-juicer/data_juicer/core/executor.py", line 164, in run dataset = dataset.process(ops, │ │ └ [<data_juicer.ops.filter.language_id_score_filter.LanguageIDScoreFilter object at 0x7fe910036f20>, <data_juicer.ops.mapper.wh... │ └ <function NestedDataset.process at 0x7fe910226560> └ Dataset({ features: ['text'], num_rows: 8537246 })

File "/home/work/wangsicong/code/data-juicer/data_juicer/core/data.py", line 203, in process checkpointer.save_ckpt(dataset) │ │ └ Dataset({ │ │ features: ['text', 'djstats__'], │ │ num_rows: 3427644 │ │ }) │ └ <function CheckpointManager.save_ckpt at 0x7fe9102272e0> └ <data_juicer.utils.ckpt_utils.CheckpointManager object at 0x7fe90fb4cfd0>

File "/home/work/wangsicong/code/data-juicer/data_juicer/utils/ckpt_utils.py", line 124, in save_ckpt ds.save_to_disk(self.ckpt_ds_dir, num_proc=self.num_proc) │ │ │ │ │ └ 40 │ │ │ │ └ <data_juicer.utils.ckpt_utils.CheckpointManager object at 0x7fe90fb4cfd0> │ │ │ └ '/home/work/wangsicong/data/ckpt/latest' │ │ └ <data_juicer.utils.ckpt_utils.CheckpointManager object at 0x7fe90fb4cfd0> │ └ <function Dataset.save_to_disk at 0x7feabc543400> └ Dataset({ features: ['text', 'djstats__'], num_rows: 3427644 })

File "/home/work/wangsicong/miniconda3/envs/data_juicer/lib/python3.10/site-packages/datasets/arrow_dataset.py", line 1523, in save_to_disk for job_id, done, content in iflatmap_unordered( │ │ │ └ <function iflatmap_unordered at 0x7feac62ba8c0> │ │ └ 1000 │ └ False └ 0 File "/home/work/wangsicong/miniconda3/envs/data_juicer/lib/python3.10/site-packages/datasets/utils/py_utils.py", line 656, in iflatmap_unordered raise RuntimeError(

RuntimeError: One of the subprocesses has abruptly died during map operation.To debug the error, disable multiprocessing. Traceback (most recent call last): File "/home/work/wangsicong/code/data-juicer/data_juicer/core/data.py", line 187, in process dataset = op.run(dataset, exporter=exporter, tracer=tracer) File "/home/work/wangsicong/code/data-juicer/data_juicer/ops/base_op.py", line 240, in run new_dataset = dataset.map( File "/home/work/wangsicong/code/data-juicer/data_juicer/core/data.py", line 248, in map new_ds = NestedDataset(super().map(*args, kargs)) File "/home/work/wangsicong/miniconda3/envs/data_juicer/lib/python3.10/site-packages/datasets/arrow_dataset.py", line 593, in wrapper out: Union["Dataset", "DatasetDict"] = func(self, *args, *kwargs) File "/home/work/wangsicong/miniconda3/envs/data_juicer/lib/python3.10/site-packages/datasets/arrow_dataset.py", line 558, in wrapper out: Union["Dataset", "DatasetDict"] = func(self, args, kwargs) File "/home/work/wangsicong/miniconda3/envs/data_juicer/lib/python3.10/site-packages/datasets/arrow_dataset.py", line 3197, in map for rank, done, content in iflatmap_unordered( File "/home/work/wangsicong/miniconda3/envs/data_juicer/lib/python3.10/site-packages/datasets/utils/py_utils.py", line 656, in iflatmap_unordered raise RuntimeError( RuntimeError: One of the subprocesses has abruptly died during map operation.To debug the error, disable multiprocessing.

During handling of the above exception, another exception occurred:

Traceback (most recent call last): File "/home/work/wangsicong/code/data-juicer/data_juicer/core/data.py", line 197, in process exit(1) File "/home/work/wangsicong/miniconda3/envs/data_juicer/lib/python3.10/_sitebuiltins.py", line 26, in call raise SystemExit(code) SystemExit: 1

Additional 额外信息

No response

drcege commented 1 month ago

这个问题由 Hugging Face 的 dataset.map 引起,请检查是否存在机器资源不足的情况,并尝试减小 num_proc

此外,注意到您的配置中设置 np=10,而在报错信息中却显示为 40。请确认是否使用了旧版本代码,建议更新到最新版本以解决这个问题。

https://github.com/hiyouga/LLaMA-Factory/issues/662 https://github.com/huggingface/datasets/issues/6787 https://discuss.huggingface.co/t/map-multiprocessing-issue/4085

strongcc commented 1 month ago

谢谢,我是先使用的np=40 ,后面猜测可能是过大,改成10也不行。

那我再继续缩小试试,谢谢。

strongcc commented 1 month ago

这个问题由 Hugging Face 的 dataset.map 引起,请检查是否存在机器资源不足的情况,并尝试减小 num_proc

此外,注意到您的配置中设置 np=10,而在报错信息中却显示为 40。请确认是否使用了旧版本代码,建议更新到最新版本以解决这个问题。

hiyouga/LLaMA-Factory#662 huggingface/datasets#6787 https://discuss.huggingface.co/t/map-multiprocessing-issue/4085

老师好。问下我换成np=4(很小了)。第一个op成功后,第二个op又报错了。还有什么其他的修改建议不?

language_id_score_filter_process (num_proc=4): 100%|#########9| 75974136/75976181 [11:56<00:00, 40247.35 examples/s] language_id_score_filter_process (num_proc=4): 100%|##########| 75976181/75976181 [11:58<00:00, 105736.02 examples/s] 2024-09-15 17:01:38 | INFO | data_juicer.core.data:192 - OP [language_id_score_filter] Done in 20398.867s. Left 30544553 samples.

whitespace_normalization_mapper_process (num_proc=4): 0%| | 0/30544553 [00:00<?, ? examples/s] whitespace_normalization_mapper_process (num_proc=4): 0%| | 0/30544553 [06:58<?, ? examples/s] 2024-09-15 17:31:26 | ERROR | data_juicer.core.data:195 - An error occurred during Op [whitespace_normalization_mapper]. Traceback (most recent call last): File "/home/work/wangsicong/code/data-juicer/data_juicer/core/data.py", line 187, in process dataset = op.run(dataset, exporter=exporter, tracer=tracer) File "/home/work/wangsicong/code/data-juicer/data_juicer/ops/base_op.py", line 240, in run new_dataset = dataset.map( File "/home/work/wangsicong/code/data-juicer/data_juicer/core/data.py", line 248, in map new_ds = NestedDataset(super().map(*args, kargs)) File "/home/work/wangsicong/miniconda3/envs/data_juicer/lib/python3.10/site-packages/datasets/arrow_dataset.py", line 593, in wrapper out: Union["Dataset", "DatasetDict"] = func(self, *args, *kwargs) File "/home/work/wangsicong/miniconda3/envs/data_juicer/lib/python3.10/site-packages/datasets/arrow_dataset.py", line 558, in wrapper out: Union["Dataset", "DatasetDict"] = func(self, args, kwargs) File "/home/work/wangsicong/miniconda3/envs/data_juicer/lib/python3.10/site-packages/datasets/arrow_dataset.py", line 3197, in map for rank, done, content in iflatmap_unordered( File "/home/work/wangsicong/miniconda3/envs/data_juicer/lib/python3.10/site-packages/datasets/utils/py_utils.py", line 656, in iflatmap_unordered raise RuntimeError( RuntimeError: One of the subprocesses has abruptly died during map operation.To debug the error, disable multiprocessing.

github-actions[bot] commented 1 month ago

This issue is marked as stale because there has been no activity for 21 days. Remove stale label or add new comments or this issue will be closed in 3 day.

github-actions[bot] commented 1 month ago

Close this stale issue.