luckystar1992 commented 1 month ago

Before Reporting 报告之前

[X] I have pulled the latest code of main branch to run again and the bug still existed. 我已经拉取了主分支上最新的代码，重新运行之后，问题仍不能解决。
[X] I have read the README carefully and no error occurred during the installation process. (Otherwise, we recommend that you can ask a question using the Question template) 我已经仔细阅读了 README 上的操作指引，并且在安装过程中没有错误发生。（否则，我们建议您使用Question模板向我们进行提问）

Search before reporting 先搜索，再报告

[X] I have searched the Data-Juicer issues and found no similar bugs. 我已经在 issue列表中搜索但是没有发现类似的bug报告。

OS 系统

MacOS15.0

Installation Method 安装方式

source

Data-Juicer Version Data-Juicer版本

0.2.0

Python Version Python版本

3.10

Describe the bug 描述这个bug

最新的main分之的代码，运行算子的时候，出现以下错误信息：

Traceback (most recent call last):
  File "/Users/zyc/code/data-juicer/data_juicer/core/data.py", line 199, in process
    dataset, resource_util_per_op = Monitor.monitor_func(
                                    ^^^^^^^^^^^^^^^^^^^^^
  File "/Users/zyc/code/data-juicer/data_juicer/core/monitor.py", line 210, in monitor_func
    resource_util_dict['resource'] = mdict['resource']
                                     ~~~~~^^^^^^^^^^^^
  File "<string>", line 2, in __getitem__
  File "/Users/zyc/miniconda3/lib/python3.11/multiprocessing/managers.py", line 837, in _callmethod

    raise convert_to_error(kind, result)
KeyError: 'resource'

而在之前的代码中不会抱这个错误。试了几个算子都是这样。

To Reproduce 如何复现

python process_data.py --config ../configs/demo/process_demo.yaml

配置文件如下：

`# Process config example for dataset

global parameters

project_name: 'demo-process' dataset_path: '/Users/zyc/code/data-juicer/demos/data/demo-dataset-chatml.jsonl' np: 1 # number of subprocess to process your dataset text_keys: ["messages"] export_path: '/Users/zyc/code/data-juicer/outputs/demo-process/demo-processed_chatml.jsonl' use_cache: false # whether to use the cache management of Hugging Face datasets. It might take up lots of disk space when using cache ds_cache_dir: null # cache dir for Hugging Face datasets. In default, it\'s the same as the environment variable HF_DATASETS_CACHE, whose default value is usually "~/.cache/huggingface/datasets". If this argument is set to a valid path by users, it will override the default cache dir use_checkpoint: false # whether to use the checkpoint management to save the latest version of dataset to work dir when processing. Rerun the same config will reload the checkpoint and skip ops before it. Cache will be disabled when using checkpoint. If args of ops before the checkpoint are changed, all ops will be rerun from the beginning. temp_dir: null # the path to the temp directory to store intermediate caches when cache is disabled, these cache files will be removed on-the-fly. In default, it's None, so the temp dir will be specified by system. NOTICE: you should be caution when setting this argument because it might cause unexpected program behaviors when this path is set to an unsafe directory. open_tracer: true # whether to open the tracer to trace the changes during process. It might take more time when opening tracer op_list_to_trace: [] # only ops in this list will be traced by tracer. If it's empty, all ops will be traced. Only available when tracer is opened. trace_num: 10 # number of samples to show the differences between datasets before and after each op. Only available when tracer is opened. op_fusion: false # whether to fuse operators that share the same intermediate variables automatically. Op fusion might reduce the memory requirements slightly but speed up the whole process. cache_compress: null # the compression method of the cache file, which can be specified in ['gzip', 'zstd', 'lz4']. If this parameter is None, the cache file will not be compressed. We recommend you turn on this argument when your input dataset is larger than tens of GB and your disk space is not enough.

for distributed processing

executor_type: default # type of executor, support "default" or "ray" for now. ray_address: auto # the address of the Ray cluster.

only for data analysis

save_stats_in_one_file: false # whether to store all stats result into one file

process schedule: a list of several process operators with their arguments

process:

Mapper ops. Most of these ops need no arguments.

generate_instruction_mapper: # filter text with total token number out of specific range hf_model: '/Users/zyc/data/models/qwen/Qwen2-1___5B-Instruct' # model name on huggingface to generate instruction. seed_file: '/Users/zyc/code/data-juicer/demos/data/demo-dataset-chatml.jsonl' # Seed file as instruction samples to generate new instructions, chatml format. instruct_num: 3 # the number of generated samples. similarity_threshold: 0.7 # the similarity score threshold between the generated samples and the seed samples.Range from 0 to 1. Samples with similarity score less than this threshold will be kept. prompt_template: null # Prompt template for generate samples. Please make sure the template contains "{augmented_data}", which corresponds to the augmented samples. qa_pair_template: null # Prompt template for generate question and answer pair description. Please make sure the template contains two "{}" to format question and answer. Default: '【问题】\n{}\n【回答】\n{}\n'. example_template: null # Prompt template for generate examples. Please make sure the template contains "{qa_pairs}", which corresponds to the question and answer pair description generated by param qa_pair_template. qa_extraction_pattern: null # Regular expression pattern for parsing question and answer from model response. enable_vllm: false # Whether to use vllm for inference acceleration. tensor_parallel_size: null # It is only valid when enable_vllm is True. The number of GPUs to use for distributed execution with tensor parallelism. max_model_len: null # It is only valid when enable_vllm is True. Model context length. If unspecified, will be automatically derived from the model config. max_num_seqs: 256 # It is only valid when enable_vllm is True. Maximum number of sequences to be processed in a single iteration. sampling_params: { "max_length": 1024 } `

Configs 配置信息

No response

Logs 报错日志

No response

Screenshots 截图

![Uploading 截屏2024-09-29 下午5.22.36.png…]()

Additional 额外信息

No response

HYLcool commented 1 month ago

嗨 @luckystar1992 ，感谢你的使用与反馈！

我们这边未能复现你遇到的问题，请你拉取最新版本代码再进行尝试，如还是遇到类似问题，欢迎与我们继续讨论~

SnoopyXI commented 17 hours ago

@luckystar1992 你好，想问一下你解决这个问题了嘛？我也遇到了同样的问题呢！

modelscope / data-juicer

[Bug]: KeyError: 'resource' #440

Before Reporting 报告之前

Search before reporting 先搜索，再报告

OS 系统

Installation Method 安装方式

Data-Juicer Version Data-Juicer版本

Python Version Python版本

Describe the bug 描述这个bug

To Reproduce 如何复现

global parameters

for distributed processing

only for data analysis

process schedule: a list of several process operators with their arguments

Mapper ops. Most of these ops need no arguments.

Configs 配置信息

Logs 报错日志

Screenshots 截图

Additional 额外信息