modelscope / data-juicer

A one-stop data processing system to make data higher-quality, juicier, and more digestible for (multimodal) LLMs! 🍎 🍋 🌽 ➡️ ➡️🍸 🍹 🍷为大模型提供更高质量、更丰富、更易”消化“的数据!
Apache License 2.0
2.63k stars 166 forks source link

Guidance for OP with multiple data fields to be processed #411

Closed yxdyc closed 1 week ago

yxdyc commented 1 month ago

Search before continuing 先搜索,再继续

Description 描述

Currently, users may be confused about supporting multiple fields for a given OP. For example, developing a OP that processes both text_key="question" and text_key="answer".

Besides, we need to add some guidance about the type of text related keys, e.g., must be str, rather than a list or dict, for the sake of efficiency and coding convenience (implicit assumptions for all text-related OPs).

Use case 使用场景

related issue: https://github.com/modelscope/data-juicer/issues/380

Additional 额外信息

No response

Are you willing to submit a PR for this feature? 您是否乐意为此功能提交一个 PR?

github-actions[bot] commented 1 week ago

This issue is marked as stale because there has been no activity for 21 days. Remove stale label or add new comments or this issue will be closed in 3 day.

github-actions[bot] commented 1 week ago

Close this stale issue.