A one-stop data processing system to make data higher-quality, juicier, and more digestible for (multimodal) LLMs! 🍎 🍋 🌽 ➡️ ➡️🍸 🍹 🍷为大模型提供更高质量、更丰富、更易”消化“的数据!
Apache License 2.0
2.61k
stars
163
forks
source link
[Feat] Support `dj_batched_group_ops` that allows for the configuration and application of multiple operators in smaller, manageable batches #413
[X] I have searched the Data-Juicer issues and found no similar feature requests. 我已经搜索了 Data-Juicer 的 issue 列表但是没有发现类似的功能需求。
Description 描述
Currently, the Data-Juicer's recipe and default executor support processing only a sequence of operations, such as OP1 and OP2, over the entire dataset in a linear fashion:
dataset.process([OP1, OP2])
However, to facilitate more granular control and optimize resource management, particularly in scenarios requiring batch-wise sequential processing, the following approach is envisaged:
for data_batch in dataset.batch_iterator(batch_size):
data_batch.process([OP1, OP2])
This method allows for the application of operators in smaller, manageable batches, potentially improving efficiency, reducing memory footprint and simplifying the code implementation.
To integrate this feature into the cfg.yaml configuration file, a special token, such as dj_batched_group_ops can be proposed. This token will enable users to specify batch processing parameters directly within the configuration, as illustrated below:
This issue is marked as stale because there has been no activity for 21 days. Remove stale label or add new comments or this issue will be closed in 3 day.
Search before continuing 先搜索,再继续
Description 描述
Currently, the Data-Juicer's recipe and default executor support processing only a sequence of operations, such as OP1 and OP2, over the entire dataset in a linear fashion:
However, to facilitate more granular control and optimize resource management, particularly in scenarios requiring batch-wise sequential processing, the following approach is envisaged:
This method allows for the application of operators in smaller, manageable batches, potentially improving efficiency, reducing memory footprint and simplifying the code implementation.
To integrate this feature into the cfg.yaml configuration file, a special token, such as
dj_batched_group_ops
can be proposed. This token will enable users to specify batch processing parameters directly within the configuration, as illustrated below:Use case 使用场景
No response
Additional 额外信息
No response
Are you willing to submit a PR for this feature? 您是否乐意为此功能提交一个 PR?