modelscope / data-juicer

A one-stop data processing system to make data higher-quality, juicier, and more digestible for (multimodal) LLMs! 🍎 🍋 🌽 ➡️ ➡️🍸 🍹 🍷为大模型提供更高质量、更丰富、更易”消化“的数据!
Apache License 2.0
2.63k stars 166 forks source link

Automatically split input dataset in ray mode #415

Closed pan-x-c closed 1 week ago

pan-x-c commented 1 month ago

Description

Split the dataset files into small pieces and process them in different batches to avoid exceeding the memory limit of Ray.

github-actions[bot] commented 1 week ago

This PR is marked as stale because there has been no activity for 21 days. Remove stale label or add new comments or this PR will be closed in 3 day.

github-actions[bot] commented 1 week ago

Close this stale PR.