modelscope / data-juicer

A one-stop data processing system to make data higher-quality, juicier, and more digestible for (multimodal) LLMs! 🍎 🍋 🌽 ➡️ ➡️🍸 🍹 🍷为大模型提供更高质量、更丰富、更易”消化“的数据!
Apache License 2.0
2.58k stars 162 forks source link

[Feat] Automatically Handle `BrokenPipeError` Caused by Limited Memory #377

Open yxdyc opened 1 month ago

yxdyc commented 1 month ago

Search before continuing 先搜索,再继续

Description 描述

The hardware resources (especially memory) available at runtime can vary with different data recipes and datasets, potentially leading to BrokenPipeError in DJ's multiprocessing mode when resources are limited.

Ideally, we can automatically track the available resources and assist users in splitting the dataset into smaller subsets, allowing re-processing in batch mode.

Use case 使用场景

No response

Additional 额外信息

No response

Are you willing to submit a PR for this feature? 您是否乐意为此功能提交一个 PR?

github-actions[bot] commented 1 month ago

This issue is marked as stale because there has been no activity for 21 days. Remove stale label or add new comments or this issue will be closed in 3 day.

github-actions[bot] commented 4 weeks ago

Close this stale issue.

yxdyc commented 3 weeks ago

Update: We are currently enhancing the Data-Juicer processing engine to automatically divide data into smaller batches. This improvement aims to boost efficiency and significantly reduce the likelihood of encountering out-of-memory (OOM) issues.