modelscope / data-juicer

A one-stop data processing system to make data higher-quality, juicier, and more digestible for (multimodal) LLMs! 🍎 🍋 🌽 ➡️ ➡️🍸 🍹 🍷为大模型提供更高质量、更丰富、更易”消化“的数据!
Apache License 2.0
2.63k stars 166 forks source link

Refine batch op branch #435

Closed BeachWang closed 1 week ago

BeachWang commented 1 week ago
  1. Change default op batch size from 1 to 1000.
  2. Change list(map()) to map() for filter OPs and keep origin coding for mapper OPs.
  3. Make sure that dataset is a NestedDataset instance in run function. NOTE: It does not make sure dataset to be NestedDataset instance when directly calling process function in Deduplicator and Selector OPs!