Closed strongcc closed 1 month ago
这个问题由 Hugging Face 的 dataset.map
引起,请检查是否存在机器资源不足的情况,并尝试减小 num_proc
。
此外,注意到您的配置中设置 np=10
,而在报错信息中却显示为 40。请确认是否使用了旧版本代码,建议更新到最新版本以解决这个问题。
https://github.com/hiyouga/LLaMA-Factory/issues/662 https://github.com/huggingface/datasets/issues/6787 https://discuss.huggingface.co/t/map-multiprocessing-issue/4085
谢谢,我是先使用的np=40 ,后面猜测可能是过大,改成10也不行。
那我再继续缩小试试,谢谢。
这个问题由 Hugging Face 的
dataset.map
引起,请检查是否存在机器资源不足的情况,并尝试减小num_proc
。此外,注意到您的配置中设置
np=10
,而在报错信息中却显示为 40。请确认是否使用了旧版本代码,建议更新到最新版本以解决这个问题。hiyouga/LLaMA-Factory#662 huggingface/datasets#6787 https://discuss.huggingface.co/t/map-multiprocessing-issue/4085
老师好。问下我换成np=4(很小了)。第一个op成功后,第二个op又报错了。还有什么其他的修改建议不?
language_id_score_filter_process (num_proc=4): 100%|#########9| 75974136/75976181 [11:56<00:00, 40247.35 examples/s] language_id_score_filter_process (num_proc=4): 100%|##########| 75976181/75976181 [11:58<00:00, 105736.02 examples/s] 2024-09-15 17:01:38 | INFO | data_juicer.core.data:192 - OP [language_id_score_filter] Done in 20398.867s. Left 30544553 samples.
whitespace_normalization_mapper_process (num_proc=4): 0%| | 0/30544553 [00:00<?, ? examples/s] whitespace_normalization_mapper_process (num_proc=4): 0%| | 0/30544553 [06:58<?, ? examples/s] 2024-09-15 17:31:26 | ERROR | data_juicer.core.data:195 - An error occurred during Op [whitespace_normalization_mapper]. Traceback (most recent call last): File "/home/work/wangsicong/code/data-juicer/data_juicer/core/data.py", line 187, in process dataset = op.run(dataset, exporter=exporter, tracer=tracer) File "/home/work/wangsicong/code/data-juicer/data_juicer/ops/base_op.py", line 240, in run new_dataset = dataset.map( File "/home/work/wangsicong/code/data-juicer/data_juicer/core/data.py", line 248, in map new_ds = NestedDataset(super().map(*args, kargs)) File "/home/work/wangsicong/miniconda3/envs/data_juicer/lib/python3.10/site-packages/datasets/arrow_dataset.py", line 593, in wrapper out: Union["Dataset", "DatasetDict"] = func(self, *args, *kwargs) File "/home/work/wangsicong/miniconda3/envs/data_juicer/lib/python3.10/site-packages/datasets/arrow_dataset.py", line 558, in wrapper out: Union["Dataset", "DatasetDict"] = func(self, args, kwargs) File "/home/work/wangsicong/miniconda3/envs/data_juicer/lib/python3.10/site-packages/datasets/arrow_dataset.py", line 3197, in map for rank, done, content in iflatmap_unordered( File "/home/work/wangsicong/miniconda3/envs/data_juicer/lib/python3.10/site-packages/datasets/utils/py_utils.py", line 656, in iflatmap_unordered raise RuntimeError( RuntimeError: One of the subprocesses has abruptly died during map operation.To debug the error, disable multiprocessing.
This issue is marked as stale because there has been no activity for 21 days. Remove stale label or add new comments or this issue will be closed in 3 day.
Close this stale issue.
Before Asking 在提问之前
[X] I have read the README carefully. 我已经仔细阅读了 README 上的操作指引。
[X] I have pulled the latest code of main branch to run again and the problem still existed. 我已经拉取了主分支上最新的代码,重新运行之后,问题仍不能解决。
Search before asking 先搜索,再提问
Question
您好:
我的配置: export_shard_size: 0 export_in_parallel: false np: 10 # number of subprocess to process your dataset open_tracer: true text_keys: 'text'
use_checkpoint: true op_fusion: false cache_compress: 'gzip'
process:
language_id_score_filter: lang: [en]
min_score: 0.8
whitespace_normalization_mapper:
我的数据量:800万行数据,5.2G
报错信息: 2024-09-14 06:45:05 | ERROR | data_juicer.core.data:195 - An error occurred during Op [whitespace_normalization_mapper]. Traceback (most recent call last): File "/home/work/wangsicong/code/data-juicer/data_juicer/core/data.py", line 187, in process dataset = op.run(dataset, exporter=exporter, tracer=tracer) File "/home/work/wangsicong/code/data-juicer/data_juicer/ops/base_op.py", line 240, in run new_dataset = dataset.map( File "/home/work/wangsicong/code/data-juicer/data_juicer/core/data.py", line 248, in map new_ds = NestedDataset(super().map(*args, kargs)) File "/home/work/wangsicong/miniconda3/envs/data_juicer/lib/python3.10/site-packages/datasets/arrow_dataset.py", line 593, in wrapper out: Union["Dataset", "DatasetDict"] = func(self, *args, *kwargs) File "/home/work/wangsicong/miniconda3/envs/data_juicer/lib/python3.10/site-packages/datasets/arrow_dataset.py", line 558, in wrapper out: Union["Dataset", "DatasetDict"] = func(self, args, kwargs) File "/home/work/wangsicong/miniconda3/envs/data_juicer/lib/python3.10/site-packages/datasets/arrow_dataset.py", line 3197, in map for rank, done, content in iflatmap_unordered( File "/home/work/wangsicong/miniconda3/envs/data_juicer/lib/python3.10/site-packages/datasets/utils/py_utils.py", line 656, in iflatmap_unordered raise RuntimeError( RuntimeError: One of the subprocesses has abruptly died during map operation.To debug the error, disable multiprocessing. 2024-09-14 06:45:05 | INFO | data_juicer.core.data:200 - Writing checkpoint of dataset processed by last op...
Saving the dataset (0/40 shards): 0%| | 0/3427644 [00:00<?, ? examples/s] Saving the dataset (0/40 shards): 0%| | 1000/3427644 [36:42<2096:14:49, 2.20s/ examples] Saving the dataset (0/40 shards): 0%| | 1000/3427644 [36:43<2097:19:51, 2.20s/ examples] 2024-09-14 07:23:31 | ERROR | main:33 - An error has been caught in function '', process 'MainProcess' (23318), thread 'MainThread' (140652129257280):
Traceback (most recent call last):
File "/home/work/wangsicong/code/data-juicer/data_juicer/core/data.py", line 187, in process dataset = op.run(dataset, exporter=exporter, tracer=tracer) │ │ │ │ └ <data_juicer.core.tracer.Tracer object at 0x7fe90ff72110> │ │ │ └ <data_juicer.core.exporter.Exporter object at 0x7fe90ff73220> │ │ └ Dataset({ │ │ features: ['text', 'djstats__'], │ │ num_rows: 3427644 │ │ }) │ └ <function Mapper.run at 0x7feabc2cbd00> └ <data_juicer.ops.mapper.whitespace_normalization_mapper.WhitespaceNormalizationMapper object at 0x7fe90fb80520>
File "/home/work/wangsicong/code/data-juicer/data_juicer/ops/base_op.py", line 240, in run new_dataset = dataset.map( │ └ <function NestedDataset.map at 0x7fe9102265f0> └ Dataset({ features: ['text', 'djstats__'], num_rows: 3427644 })
File "/home/work/wangsicong/code/data-juicer/data_juicer/core/data.py", line 248, in map new_ds = NestedDataset(super().map(*args, **kargs)) │ │ └ {'num_proc': 40, 'with_rank': False, 'desc': 'whitespace_normalization_mapper_process', 'batched': True, 'batch_size': 1, 'ne... │ └ [<function WhitespaceNormalizationMapper.process at 0x7fe90f9ad900>] └ <class 'data_juicer.core.data.NestedDataset'>
File "/home/work/wangsicong/miniconda3/envs/data_juicer/lib/python3.10/site-packages/datasets/arrow_dataset.py", line 593, in wrapper out: Union["Dataset", "DatasetDict"] = func(self, *args, *kwargs) │ │ │ │ └ {'num_proc': 40, 'with_rank': False, 'desc': 'whitespace_normalization_mapper_process', 'batched': True, 'batch_size': 1, 'ne... │ │ │ └ (<function WhitespaceNormalizationMapper.process at 0x7fe90f9ad900>,) │ │ └ Dataset({ │ │ features: ['text', 'djstats__'], │ │ num_rows: 3427644 │ │ }) │ └ <function Dataset.map at 0x7feabc54d1b0> └ typing.Union File "/home/work/wangsicong/miniconda3/envs/data_juicer/lib/python3.10/site-packages/datasets/arrow_dataset.py", line 558, in wrapper out: Union["Dataset", "DatasetDict"] = func(self, args, **kwargs) │ │ │ │ └ {'num_proc': 40, 'with_rank': False, 'desc': 'whitespace_normalization_mapper_process', 'batched': True, 'batch_size': 1, 'ne... │ │ │ └ (<function WhitespaceNormalizationMapper.process at 0x7fe90f9ad900>,) │ │ └ Dataset({ │ │ features: ['text', 'djstats__'], │ │ num_rows: 3427644 │ │ }) │ └ <function Dataset.map at 0x7feabc54d120> └ typing.Union File "/home/work/wangsicong/miniconda3/envs/data_juicer/lib/python3.10/site-packages/datasets/arrow_dataset.py", line 3197, in map for rank, done, content in iflatmap_unordered( │ └ <function iflatmap_unordered at 0x7feac62ba8c0> └ 39 File "/home/work/wangsicong/miniconda3/envs/data_juicer/lib/python3.10/site-packages/datasets/utils/py_utils.py", line 656, in iflatmap_unordered raise RuntimeError(
RuntimeError: One of the subprocesses has abruptly died during map operation.To debug the error, disable multiprocessing.
During handling of the above exception, another exception occurred:
Traceback (most recent call last):
File "/home/work/wangsicong/code/data-juicer/data_juicer/core/data.py", line 197, in process exit(1)
File "/home/work/wangsicong/miniconda3/envs/data_juicer/lib/python3.10/_sitebuiltins.py", line 26, in call raise SystemExit(code) └ 1
SystemExit: 1
During handling of the above exception, another exception occurred:
Traceback (most recent call last):
File "/home/work/wangsicong/code/data-juicer/tools/process_data.py", line 15, in main executor.run() │ └ <function Executor.run at 0x7fe910227ac0> └ <data_juicer.core.executor.Executor object at 0x7fe90fb80670>
File "/home/work/wangsicong/code/data-juicer/data_juicer/core/executor.py", line 164, in run dataset = dataset.process(ops, │ │ └ [<data_juicer.ops.filter.language_id_score_filter.LanguageIDScoreFilter object at 0x7fe910036f20>, <data_juicer.ops.mapper.wh... │ └ <function NestedDataset.process at 0x7fe910226560> └ Dataset({ features: ['text'], num_rows: 8537246 })
File "/home/work/wangsicong/code/data-juicer/data_juicer/core/data.py", line 203, in process checkpointer.save_ckpt(dataset) │ │ └ Dataset({ │ │ features: ['text', 'djstats__'], │ │ num_rows: 3427644 │ │ }) │ └ <function CheckpointManager.save_ckpt at 0x7fe9102272e0> └ <data_juicer.utils.ckpt_utils.CheckpointManager object at 0x7fe90fb4cfd0>
File "/home/work/wangsicong/code/data-juicer/data_juicer/utils/ckpt_utils.py", line 124, in save_ckpt ds.save_to_disk(self.ckpt_ds_dir, num_proc=self.num_proc) │ │ │ │ │ └ 40 │ │ │ │ └ <data_juicer.utils.ckpt_utils.CheckpointManager object at 0x7fe90fb4cfd0> │ │ │ └ '/home/work/wangsicong/data/ckpt/latest' │ │ └ <data_juicer.utils.ckpt_utils.CheckpointManager object at 0x7fe90fb4cfd0> │ └ <function Dataset.save_to_disk at 0x7feabc543400> └ Dataset({ features: ['text', 'djstats__'], num_rows: 3427644 })
File "/home/work/wangsicong/miniconda3/envs/data_juicer/lib/python3.10/site-packages/datasets/arrow_dataset.py", line 1523, in save_to_disk for job_id, done, content in iflatmap_unordered( │ │ │ └ <function iflatmap_unordered at 0x7feac62ba8c0> │ │ └ 1000 │ └ False └ 0 File "/home/work/wangsicong/miniconda3/envs/data_juicer/lib/python3.10/site-packages/datasets/utils/py_utils.py", line 656, in iflatmap_unordered raise RuntimeError(
RuntimeError: One of the subprocesses has abruptly died during map operation.To debug the error, disable multiprocessing. Traceback (most recent call last): File "/home/work/wangsicong/code/data-juicer/data_juicer/core/data.py", line 187, in process dataset = op.run(dataset, exporter=exporter, tracer=tracer) File "/home/work/wangsicong/code/data-juicer/data_juicer/ops/base_op.py", line 240, in run new_dataset = dataset.map( File "/home/work/wangsicong/code/data-juicer/data_juicer/core/data.py", line 248, in map new_ds = NestedDataset(super().map(*args, kargs)) File "/home/work/wangsicong/miniconda3/envs/data_juicer/lib/python3.10/site-packages/datasets/arrow_dataset.py", line 593, in wrapper out: Union["Dataset", "DatasetDict"] = func(self, *args, *kwargs) File "/home/work/wangsicong/miniconda3/envs/data_juicer/lib/python3.10/site-packages/datasets/arrow_dataset.py", line 558, in wrapper out: Union["Dataset", "DatasetDict"] = func(self, args, kwargs) File "/home/work/wangsicong/miniconda3/envs/data_juicer/lib/python3.10/site-packages/datasets/arrow_dataset.py", line 3197, in map for rank, done, content in iflatmap_unordered( File "/home/work/wangsicong/miniconda3/envs/data_juicer/lib/python3.10/site-packages/datasets/utils/py_utils.py", line 656, in iflatmap_unordered raise RuntimeError( RuntimeError: One of the subprocesses has abruptly died during map operation.To debug the error, disable multiprocessing.
During handling of the above exception, another exception occurred:
Traceback (most recent call last): File "/home/work/wangsicong/code/data-juicer/data_juicer/core/data.py", line 197, in process exit(1) File "/home/work/wangsicong/miniconda3/envs/data_juicer/lib/python3.10/_sitebuiltins.py", line 26, in call raise SystemExit(code) SystemExit: 1
Additional 额外信息
No response