[X] I have pulled the latest code of main branch to run again and the bug still existed. 我已经拉取了主分支上最新的代码,重新运行之后,问题仍不能解决。
[X] I have read the README carefully and no error occurred during the installation process. (Otherwise, we recommend that you can ask a question using the Question template) 我已经仔细阅读了 README 上的操作指引,并且在安装过程中没有错误发生。(否则,我们建议您使用Question模板向我们进行提问)
Search before reporting 先搜索,再报告
[X] I have searched the Data-Juicer issues and found no similar bugs. 我已经在 issue列表 中搜索但是没有发现类似的bug报告。
OS 系统
MacOS15.0
Installation Method 安装方式
source
Data-Juicer Version Data-Juicer版本
0.2.0
Python Version Python版本
3.10
Describe the bug 描述这个bug
最新的main分之的代码,运行算子的时候,出现以下错误信息:
Traceback (most recent call last):
File "/Users/zyc/code/data-juicer/data_juicer/core/data.py", line 199, in process
dataset, resource_util_per_op = Monitor.monitor_func(
^^^^^^^^^^^^^^^^^^^^^
File "/Users/zyc/code/data-juicer/data_juicer/core/monitor.py", line 210, in monitor_func
resource_util_dict['resource'] = mdict['resource']
~~~~~^^^^^^^^^^^^
File "<string>", line 2, in __getitem__
File "/Users/zyc/miniconda3/lib/python3.11/multiprocessing/managers.py", line 837, in _callmethod
raise convert_to_error(kind, result)
KeyError: 'resource'
project_name: 'demo-process'
dataset_path: '/Users/zyc/code/data-juicer/demos/data/demo-dataset-chatml.jsonl'
np: 1 # number of subprocess to process your dataset
text_keys: ["messages"]
export_path: '/Users/zyc/code/data-juicer/outputs/demo-process/demo-processed_chatml.jsonl'
use_cache: false # whether to use the cache management of Hugging Face datasets. It might take up lots of disk space when using cache
ds_cache_dir: null # cache dir for Hugging Face datasets. In default, it\'s the same as the environment variable HF_DATASETS_CACHE, whose default value is usually "~/.cache/huggingface/datasets". If this argument is set to a valid path by users, it will override the default cache dir
use_checkpoint: false # whether to use the checkpoint management to save the latest version of dataset to work dir when processing. Rerun the same config will reload the checkpoint and skip ops before it. Cache will be disabled when using checkpoint. If args of ops before the checkpoint are changed, all ops will be rerun from the beginning.
temp_dir: null # the path to the temp directory to store intermediate caches when cache is disabled, these cache files will be removed on-the-fly. In default, it's None, so the temp dir will be specified by system. NOTICE: you should be caution when setting this argument because it might cause unexpected program behaviors when this path is set to an unsafe directory.
open_tracer: true # whether to open the tracer to trace the changes during process. It might take more time when opening tracer
op_list_to_trace: [] # only ops in this list will be traced by tracer. If it's empty, all ops will be traced. Only available when tracer is opened.
trace_num: 10 # number of samples to show the differences between datasets before and after each op. Only available when tracer is opened.
op_fusion: false # whether to fuse operators that share the same intermediate variables automatically. Op fusion might reduce the memory requirements slightly but speed up the whole process.
cache_compress: null # the compression method of the cache file, which can be specified in ['gzip', 'zstd', 'lz4']. If this parameter is None, the cache file will not be compressed. We recommend you turn on this argument when your input dataset is larger than tens of GB and your disk space is not enough.
for distributed processing
executor_type: default # type of executor, support "default" or "ray" for now.
ray_address: auto # the address of the Ray cluster.
only for data analysis
save_stats_in_one_file: false # whether to store all stats result into one file
process schedule: a list of several process operators with their arguments
process:
Mapper ops. Most of these ops need no arguments.
generate_instruction_mapper: # filter text with total token number out of specific range
hf_model: '/Users/zyc/data/models/qwen/Qwen2-1___5B-Instruct' # model name on huggingface to generate instruction.
seed_file: '/Users/zyc/code/data-juicer/demos/data/demo-dataset-chatml.jsonl' # Seed file as instruction samples to generate new instructions, chatml format.
instruct_num: 3 # the number of generated samples.
similarity_threshold: 0.7 # the similarity score threshold between the generated samples and the seed samples.Range from 0 to 1. Samples with similarity score less than this threshold will be kept.
prompt_template: null # Prompt template for generate samples. Please make sure the template contains "{augmented_data}", which corresponds to the augmented samples.
qa_pair_template: null # Prompt template for generate question and answer pair description. Please make sure the template contains two "{}" to format question and answer. Default: '【问题】\n{}\n【回答】\n{}\n'.
example_template: null # Prompt template for generate examples. Please make sure the template contains "{qa_pairs}", which corresponds to the question and answer pair description generated by param qa_pair_template.
qa_extraction_pattern: null # Regular expression pattern for parsing question and answer from model response.
enable_vllm: false # Whether to use vllm for inference acceleration.
tensor_parallel_size: null # It is only valid when enable_vllm is True. The number of GPUs to use for distributed execution with tensor parallelism.
max_model_len: null # It is only valid when enable_vllm is True. Model context length. If unspecified, will be automatically derived from the model config.
max_num_seqs: 256 # It is only valid when enable_vllm is True. Maximum number of sequences to be processed in a single iteration.
sampling_params: { "max_length": 1024 }
`
Before Reporting 报告之前
[X] I have pulled the latest code of main branch to run again and the bug still existed. 我已经拉取了主分支上最新的代码,重新运行之后,问题仍不能解决。
[X] I have read the README carefully and no error occurred during the installation process. (Otherwise, we recommend that you can ask a question using the Question template) 我已经仔细阅读了 README 上的操作指引,并且在安装过程中没有错误发生。(否则,我们建议您使用Question模板向我们进行提问)
Search before reporting 先搜索,再报告
OS 系统
MacOS15.0
Installation Method 安装方式
source
Data-Juicer Version Data-Juicer版本
0.2.0
Python Version Python版本
3.10
Describe the bug 描述这个bug
最新的main分之的代码,运行算子的时候,出现以下错误信息:
而在之前的代码中不会抱这个错误。试了几个算子都是这样。
To Reproduce 如何复现
python process_data.py --config ../configs/demo/process_demo.yaml
配置文件如下:
`# Process config example for dataset
global parameters
project_name: 'demo-process' dataset_path: '/Users/zyc/code/data-juicer/demos/data/demo-dataset-chatml.jsonl' np: 1 # number of subprocess to process your dataset text_keys: ["messages"] export_path: '/Users/zyc/code/data-juicer/outputs/demo-process/demo-processed_chatml.jsonl' use_cache: false # whether to use the cache management of Hugging Face datasets. It might take up lots of disk space when using cache ds_cache_dir: null # cache dir for Hugging Face datasets. In default, it\'s the same as the environment variable
HF_DATASETS_CACHE
, whose default value is usually "~/.cache/huggingface/datasets". If this argument is set to a valid path by users, it will override the default cache dir use_checkpoint: false # whether to use the checkpoint management to save the latest version of dataset to work dir when processing. Rerun the same config will reload the checkpoint and skip ops before it. Cache will be disabled when using checkpoint. If args of ops before the checkpoint are changed, all ops will be rerun from the beginning. temp_dir: null # the path to the temp directory to store intermediate caches when cache is disabled, these cache files will be removed on-the-fly. In default, it's None, so the temp dir will be specified by system. NOTICE: you should be caution when setting this argument because it might cause unexpected program behaviors when this path is set to an unsafe directory. open_tracer: true # whether to open the tracer to trace the changes during process. It might take more time when opening tracer op_list_to_trace: [] # only ops in this list will be traced by tracer. If it's empty, all ops will be traced. Only available when tracer is opened. trace_num: 10 # number of samples to show the differences between datasets before and after each op. Only available when tracer is opened. op_fusion: false # whether to fuse operators that share the same intermediate variables automatically. Op fusion might reduce the memory requirements slightly but speed up the whole process. cache_compress: null # the compression method of the cache file, which can be specified in ['gzip', 'zstd', 'lz4']. If this parameter is None, the cache file will not be compressed. We recommend you turn on this argument when your input dataset is larger than tens of GB and your disk space is not enough.for distributed processing
executor_type: default # type of executor, support "default" or "ray" for now. ray_address: auto # the address of the Ray cluster.
only for data analysis
save_stats_in_one_file: false # whether to store all stats result into one file
process schedule: a list of several process operators with their arguments
process:
Mapper ops. Most of these ops need no arguments.
qa_pair_template
. qa_extraction_pattern: null # Regular expression pattern for parsing question and answer from model response. enable_vllm: false # Whether to use vllm for inference acceleration. tensor_parallel_size: null # It is only valid when enable_vllm is True. The number of GPUs to use for distributed execution with tensor parallelism. max_model_len: null # It is only valid when enable_vllm is True. Model context length. If unspecified, will be automatically derived from the model config. max_num_seqs: 256 # It is only valid when enable_vllm is True. Maximum number of sequences to be processed in a single iteration. sampling_params: { "max_length": 1024 } `Configs 配置信息
No response
Logs 报错日志
No response
Screenshots 截图
![Uploading 截屏2024-09-29 下午5.22.36.png…]()
Additional 额外信息
No response