Closed kike-0304 closed 1 week ago
这个问题一般是系统资源限制。可以尝试减小 np,或者设置 export MP_START_METHOD=spawn
This issue is marked as stale because there has been no activity for 21 days. Remove stale label or add new comments or this issue will be closed in 3 day.
Close this stale issue.
Before Reporting 报告之前
[X] I have pulled the latest code of main branch to run again and the bug still existed. 我已经拉取了主分支上最新的代码,重新运行之后,问题仍不能解决。
[X] I have read the README carefully and no error occurred during the installation process. (Otherwise, we recommend that you can ask a question using the Question template) 我已经仔细阅读了 README 上的操作指引,并且在安装过程中没有错误发生。(否则,我们建议您使用Question模板向我们进行提问)
Search before reporting 先搜索,再报告
OS 系统
Ubuntu
Installation Method 安装方式
pip
Data-Juicer Version Data-Juicer版本
v0.1.2
Python Version Python版本
3.10
Describe the bug 描述这个bug
video_split_by_duration_mapper这个算子跑一会就会报RuntimeError: can't start new thread
To Reproduce 如何复现
video_split_by_duration_mapper这个算子跑一会就会报RuntimeError: can't start new thread,机器cpu计算资源充足
Configs 配置信息
process:
split_duration: 10
min_last_split_duration: 3
keep_original_sample: false
Logs 报错日志
2024-07-23 20:55:27 | INFO | data_juicer.core.executor:54 - Setting up data formatter... 2024-07-23 20:55:27 | INFO | data_juicer.core.executor:76 - Preparing exporter... 2024-07-23 20:55:27 | INFO | data_juicer.core.executor:153 - Loading dataset from data formatter... 2024-07-23 20:55:27 | INFO | data_juicer.format.formatter:185 - Unifying the input dataset formats... 2024-07-23 20:55:27 | INFO | data_juicer.format.formatter:200 - There are 50000 sample(s) in the original dataset. Filter (num_proc=4): 100%|##########| 50000/50000 [00:20<00:00, 2433.82 examples/s] 2024-07-23 20:55:48 | INFO | data_juicer.format.formatter:214 - 50000 samples left after filtering empty text. 2024-07-23 20:55:48 | INFO | data_juicer.format.formatter:237 - Converting relative paths in the dataset to their absolute version. (Based on the directory of input dataset file) Map (num_proc=4): 100%|##########| 50000/50000 [00:18<00:00, 2691.31 examples/s] 2024-07-23 20:56:07 | INFO | data_juicer.format.mixture_formatter:137 - sampled 50000 from 50000 2024-07-23 20:56:07 | INFO | data_juicer.format.mixture_formatter:143 - There are 50000 in final dataset 2024-07-23 20:56:07 | INFO | data_juicer.core.executor:159 - Preparing process operators... 2024-07-23 20:56:07 | INFO | data_juicer.core.executor:166 - Processing data... video_split_by_duration_mapper_process (num_proc=4): 3%|3 | 1711/50000 [42:31<28:22:59, 2.12s/ examples]moov atom not found video_split_by_duration_mapper_process (num_proc=4): 11%|# | 5489/50000 [5:39:39<56:01:58, 4.53s/ examples]Exception in thread Thread-1 (accepter): Traceback (most recent call last): File "/root/anaconda3/envs/sora/lib/python3.10/threading.py", line 1016, in _bootstrap_inner self.run() File "/root/anaconda3/envs/sora/lib/python3.10/threading.py", line 953, in run self._target(*self._args, **self._kwargs) File "/root/anaconda3/envs/sora/lib/python3.10/site-packages/multiprocess/managers.py", line 193, in accepter t.start() File "/root/anaconda3/envs/sora/lib/python3.10/threading.py", line 935, in start _start_new_thread(self._bootstrap, ()) RuntimeError: can't start new thread
Screenshots 截图
Additional 额外信息
No response