modelscope / data-juicer

Making data higher-quality, juicier, and more digestible for any large models! 🍎 🍋 🌽 ➡️ ➡️🍸 🍹 🍷为大模型提供更高质量、更丰富、更易”消化“的数据!
Apache License 2.0
2.99k stars 178 forks source link

[Bug]: KeyError: 'resource' #440

Open luckystar1992 opened 1 month ago

luckystar1992 commented 1 month ago

Before Reporting 报告之前

Search before reporting 先搜索,再报告

OS 系统

MacOS15.0

Installation Method 安装方式

source

Data-Juicer Version Data-Juicer版本

0.2.0

Python Version Python版本

3.10

Describe the bug 描述这个bug

最新的main分之的代码,运行算子的时候,出现以下错误信息:

Traceback (most recent call last):
  File "/Users/zyc/code/data-juicer/data_juicer/core/data.py", line 199, in process
    dataset, resource_util_per_op = Monitor.monitor_func(
                                    ^^^^^^^^^^^^^^^^^^^^^
  File "/Users/zyc/code/data-juicer/data_juicer/core/monitor.py", line 210, in monitor_func
    resource_util_dict['resource'] = mdict['resource']
                                     ~~~~~^^^^^^^^^^^^
  File "<string>", line 2, in __getitem__
  File "/Users/zyc/miniconda3/lib/python3.11/multiprocessing/managers.py", line 837, in _callmethod

    raise convert_to_error(kind, result)
KeyError: 'resource'

而在之前的代码中不会抱这个错误。试了几个算子都是这样。

To Reproduce 如何复现

python process_data.py --config ../configs/demo/process_demo.yaml

配置文件如下:

`# Process config example for dataset

global parameters

project_name: 'demo-process' dataset_path: '/Users/zyc/code/data-juicer/demos/data/demo-dataset-chatml.jsonl' np: 1 # number of subprocess to process your dataset text_keys: ["messages"] export_path: '/Users/zyc/code/data-juicer/outputs/demo-process/demo-processed_chatml.jsonl' use_cache: false # whether to use the cache management of Hugging Face datasets. It might take up lots of disk space when using cache ds_cache_dir: null # cache dir for Hugging Face datasets. In default, it\'s the same as the environment variable HF_DATASETS_CACHE, whose default value is usually "~/.cache/huggingface/datasets". If this argument is set to a valid path by users, it will override the default cache dir use_checkpoint: false # whether to use the checkpoint management to save the latest version of dataset to work dir when processing. Rerun the same config will reload the checkpoint and skip ops before it. Cache will be disabled when using checkpoint. If args of ops before the checkpoint are changed, all ops will be rerun from the beginning. temp_dir: null # the path to the temp directory to store intermediate caches when cache is disabled, these cache files will be removed on-the-fly. In default, it's None, so the temp dir will be specified by system. NOTICE: you should be caution when setting this argument because it might cause unexpected program behaviors when this path is set to an unsafe directory. open_tracer: true # whether to open the tracer to trace the changes during process. It might take more time when opening tracer op_list_to_trace: [] # only ops in this list will be traced by tracer. If it's empty, all ops will be traced. Only available when tracer is opened. trace_num: 10 # number of samples to show the differences between datasets before and after each op. Only available when tracer is opened. op_fusion: false # whether to fuse operators that share the same intermediate variables automatically. Op fusion might reduce the memory requirements slightly but speed up the whole process. cache_compress: null # the compression method of the cache file, which can be specified in ['gzip', 'zstd', 'lz4']. If this parameter is None, the cache file will not be compressed. We recommend you turn on this argument when your input dataset is larger than tens of GB and your disk space is not enough.

for distributed processing

executor_type: default # type of executor, support "default" or "ray" for now. ray_address: auto # the address of the Ray cluster.

only for data analysis

save_stats_in_one_file: false # whether to store all stats result into one file

process schedule: a list of several process operators with their arguments

process:

Mapper ops. Most of these ops need no arguments.

Configs 配置信息

No response

Logs 报错日志

No response

Screenshots 截图

![Uploading 截屏2024-09-29 下午5.22.36.png…]()

Additional 额外信息

No response

HYLcool commented 1 month ago

嗨 @luckystar1992 ,感谢你的使用与反馈!

我们这边未能复现你遇到的问题,请你拉取最新版本代码再进行尝试,如还是遇到类似问题,欢迎与我们继续讨论~

SnoopyXI commented 17 hours ago

@luckystar1992 你好,想问一下你解决这个问题了嘛?我也遇到了同样的问题呢!