配置文件如下：

project_name: 'code' dataset_path: ‘processed_starcode.jsonl' # path to your dataset directory or file export_path: 'dataset.jsonl'

text_keys: 'text'

export_in_parallel: false # whether to export the result dataset in parallel to a single file, which usually takes less time. It only works when export_shard_size is 0, and its default number of processes is the same as the argument np. Notice: If it's True, sometimes exporting in parallel might require much more time due to the IO blocking, especially for very large datasets. When this happens, False is a better choice, although it takes more time. np: 40 # number of subprocess to process your dataset text_keys: 'text' # the key name of field where the sample texts to be processed, e.g., text, instruction, output, ...

Note: currently, we support specify only ONE key for each op, for cases requiring multiple keys, users can specify the op multiple times. We will only use the first key of `text_keys` when you set multiple keys.

suffixes: [] # the suffix of files that will be read. For example: '.txt', 'txt' or ['txt', '.pdf', 'docx'] use_cache: false # whether to use the cache management of Hugging Face datasets. It might take up lots of disk space when using cache ds_cache_dir: /opt/data/private/liuteng/dataset/dj_cache # cache dir for Hugging Face datasets. In default, it\'s the same as the environment variable HF_DATASETS_CACHE, whose default value is usually "~/.cache/huggingface/datasets". If this argument is set to a valid path by users, it will override the default cache dir use_checkpoint: false # whether to use the checkpoint management to save the latest version of dataset to work dir when processing. Rerun the same config will reload the checkpoint and skip ops before it. Cache will be disabled when using checkpoint. If args of ops before the checkpoint are changed, all ops will be rerun from the beginning. temp_dir: /opt/data/private/liuteng/dataset/dj_cache open_tracer: true # whether to open the tracer to trace the changes during process. It might take more time when opening tracer op_list_to_trace: [] # only ops in this list will be traced by tracer. If it's empty, all ops will be traced. Only available when tracer is opened. trace_num: 10 # number of samples to show the differences between datasets before and after each op. Only available when tracer is opened. op_fusion: true # whether to fuse operators that share the same intermediate variables automatically. Op fusion might reduce the memory requirements slightly but speed up the whole process. cache_compress: zstd # the compression method of the cache file, which can be specified in ['gzip', 'zstd', 'lz4']. If this parameter is None, the cache file will not be compressed. We recommend you turn on this argument when your input dataset is larger than tens of GB and your disk space is not enough.

save_stats_in_one_file: true # whether to store all stats result into one file

process:

clean_email_mapper:
clean_links_mapper:
fix_unicode_mapper:
punctuation_normalization_mapper:
whitespace_normalization_mapper:
clean_copyright_mapper:
alphanumeric_filter: # 18766 tokenization: false min_ratio: 0.2 # < 3sigma (0.3791) max_ratio: 0.9163 # 3sigma
alphanumeric_filter: # 146432
tokenization: true min_ratio: 0.546 # 3sigma max_ratio: 3.65 # 3sigma
average_line_length_filter: # for code min_len: 10 # > 3sigma (0) -- 48790 max_len: 150 # < 3sigma (15603) -- 233275
character_repetition_filter: max_ratio: 0.36 # 3sigma -- 346875
maximum_line_length_filter: # for code max_len: 1000 # remove 256670 samples
text_length_filter: max_len: 96714 # 3sigma -- 190006
words_num_filter: min_num: 20 # remove 1504958 samples max_num: 6640 # 3sigma -- remove 179847 samples
word_repetition_filter: rep_len: 10 max_ratio: 0.357 # 3sigma -- 598462
document_simhash_deduplicator: tokenization: space window_size: 6 lowercase: true ignore_pattern: '\p{P}' num_blocks: 6 hamming_distance: 4

报错

AttributeError: 'FusedFilter' object has no attribute '_name'

cd73c8173ac1d1a6ebb7d4979f94fed

modelscope / data-juicer

AttributeError: 'FusedFilter' object has no attribute '_name' #495

配置文件如下：

Note: currently, we support specify only ONE key for each op, for cases requiring multiple keys, users can specify the op multiple times. We will only use the first key of `text_keys` when you set multiple keys.

报错

modelscope / data-juicer

AttributeError: 'FusedFilter' object has no attribute '_name' #495

配置文件如下：

Note: currently, we support specify only ONE key for each op, for cases requiring multiple keys, users can specify the op multiple times. We will only use the first key of text_keys when you set multiple keys.

报错

Note: currently, we support specify only ONE key for each op, for cases requiring multiple keys, users can specify the op multiple times. We will only use the first key of `text_keys` when you set multiple keys.