modelscope / data-juicer

Making data higher-quality, juicier, and more digestible for foundation models! 🍎 🍋 🌽 ➡️ ➡️🍸 🍹 🍷为大模型提供更高质量、更丰富、更易”消化“的数据!
Apache License 2.0
3.03k stars 181 forks source link

AttributeError: 'FusedFilter' object has no attribute '_name' #495

Open xunmenglt opened 5 days ago

xunmenglt commented 5 days ago

配置文件如下:

project_name: 'code' dataset_path: ‘processed_starcode.jsonl' # path to your dataset directory or file export_path: 'dataset.jsonl'

text_keys: 'text'

export_in_parallel: false # whether to export the result dataset in parallel to a single file, which usually takes less time. It only works when export_shard_size is 0, and its default number of processes is the same as the argument np. Notice: If it's True, sometimes exporting in parallel might require much more time due to the IO blocking, especially for very large datasets. When this happens, False is a better choice, although it takes more time. np: 40 # number of subprocess to process your dataset text_keys: 'text' # the key name of field where the sample texts to be processed, e.g., text, instruction, output, ...

Note: currently, we support specify only ONE key for each op, for cases requiring multiple keys, users can specify the op multiple times. We will only use the first key of text_keys when you set multiple keys.

suffixes: [] # the suffix of files that will be read. For example: '.txt', 'txt' or ['txt', '.pdf', 'docx'] use_cache: false # whether to use the cache management of Hugging Face datasets. It might take up lots of disk space when using cache ds_cache_dir: /opt/data/private/liuteng/dataset/dj_cache # cache dir for Hugging Face datasets. In default, it\'s the same as the environment variable HF_DATASETS_CACHE, whose default value is usually "~/.cache/huggingface/datasets". If this argument is set to a valid path by users, it will override the default cache dir use_checkpoint: false # whether to use the checkpoint management to save the latest version of dataset to work dir when processing. Rerun the same config will reload the checkpoint and skip ops before it. Cache will be disabled when using checkpoint. If args of ops before the checkpoint are changed, all ops will be rerun from the beginning. temp_dir: /opt/data/private/liuteng/dataset/dj_cache open_tracer: true # whether to open the tracer to trace the changes during process. It might take more time when opening tracer op_list_to_trace: [] # only ops in this list will be traced by tracer. If it's empty, all ops will be traced. Only available when tracer is opened. trace_num: 10 # number of samples to show the differences between datasets before and after each op. Only available when tracer is opened. op_fusion: true # whether to fuse operators that share the same intermediate variables automatically. Op fusion might reduce the memory requirements slightly but speed up the whole process. cache_compress: zstd # the compression method of the cache file, which can be specified in ['gzip', 'zstd', 'lz4']. If this parameter is None, the cache file will not be compressed. We recommend you turn on this argument when your input dataset is larger than tens of GB and your disk space is not enough.

save_stats_in_one_file: true # whether to store all stats result into one file

process:

报错

AttributeError: 'FusedFilter' object has no attribute '_name'

cd73c8173ac1d1a6ebb7d4979f94fed

HYLcool commented 4 days ago

Hi @xunmenglt , thanks for your report!

Sorry for this problem. We found this issue before and we already fix it in the PR #464 and merge it into the main branch now. Please pull the latest code in the main branch and try again.