modelscope / data-juicer

A one-stop data processing system to make data higher-quality, juicier, and more digestible for (multimodal) LLMs! 🍎 🍋 🌽 ➡️ ➡️🍸 🍹 🍷为大模型提供更高质量、更丰富、更易”消化“的数据!
Apache License 2.0
2.5k stars 154 forks source link

[Bug]: video_split_by_duration_mapper RuntimeError: can't start new thread #362

Closed kike-0304 closed 1 week ago

kike-0304 commented 1 month ago

Before Reporting 报告之前

Search before reporting 先搜索,再报告

OS 系统

Ubuntu

Installation Method 安装方式

pip

Data-Juicer Version Data-Juicer版本

v0.1.2

Python Version Python版本

3.10

Describe the bug 描述这个bug

video_split_by_duration_mapper这个算子跑一会就会报RuntimeError: can't start new thread

To Reproduce 如何复现

video_split_by_duration_mapper这个算子跑一会就会报RuntimeError: can't start new thread,机器cpu计算资源充足

Configs 配置信息

process:

Logs 报错日志

2024-07-23 20:55:27 | INFO | data_juicer.core.executor:54 - Setting up data formatter... 2024-07-23 20:55:27 | INFO | data_juicer.core.executor:76 - Preparing exporter... 2024-07-23 20:55:27 | INFO | data_juicer.core.executor:153 - Loading dataset from data formatter... 2024-07-23 20:55:27 | INFO | data_juicer.format.formatter:185 - Unifying the input dataset formats... 2024-07-23 20:55:27 | INFO | data_juicer.format.formatter:200 - There are 50000 sample(s) in the original dataset. Filter (num_proc=4): 100%|##########| 50000/50000 [00:20<00:00, 2433.82 examples/s] 2024-07-23 20:55:48 | INFO | data_juicer.format.formatter:214 - 50000 samples left after filtering empty text. 2024-07-23 20:55:48 | INFO | data_juicer.format.formatter:237 - Converting relative paths in the dataset to their absolute version. (Based on the directory of input dataset file) Map (num_proc=4): 100%|##########| 50000/50000 [00:18<00:00, 2691.31 examples/s] 2024-07-23 20:56:07 | INFO | data_juicer.format.mixture_formatter:137 - sampled 50000 from 50000 2024-07-23 20:56:07 | INFO | data_juicer.format.mixture_formatter:143 - There are 50000 in final dataset 2024-07-23 20:56:07 | INFO | data_juicer.core.executor:159 - Preparing process operators... 2024-07-23 20:56:07 | INFO | data_juicer.core.executor:166 - Processing data... video_split_by_duration_mapper_process (num_proc=4): 3%|3 | 1711/50000 [42:31<28:22:59, 2.12s/ examples]moov atom not found video_split_by_duration_mapper_process (num_proc=4): 11%|# | 5489/50000 [5:39:39<56:01:58, 4.53s/ examples]Exception in thread Thread-1 (accepter): Traceback (most recent call last): File "/root/anaconda3/envs/sora/lib/python3.10/threading.py", line 1016, in _bootstrap_inner self.run() File "/root/anaconda3/envs/sora/lib/python3.10/threading.py", line 953, in run self._target(*self._args, **self._kwargs) File "/root/anaconda3/envs/sora/lib/python3.10/site-packages/multiprocess/managers.py", line 193, in accepter t.start() File "/root/anaconda3/envs/sora/lib/python3.10/threading.py", line 935, in start _start_new_thread(self._bootstrap, ()) RuntimeError: can't start new thread

Screenshots 截图

屏幕截图 2024-07-24 091204

Additional 额外信息

No response

drcege commented 1 month ago

这个问题一般是系统资源限制。可以尝试减小 np,或者设置 export MP_START_METHOD=spawn

github-actions[bot] commented 1 week ago

This issue is marked as stale because there has been no activity for 21 days. Remove stale label or add new comments or this issue will be closed in 3 day.

github-actions[bot] commented 1 week ago

Close this stale issue.