Open xunmenglt opened 5 days ago
Hi @xunmenglt , thanks for your report!
Sorry for this problem. We found this issue before and we already fix it in the PR #464 and merge it into the main branch now. Please pull the latest code in the main branch and try again.
配置文件如下:
project_name: 'code' dataset_path: ‘processed_starcode.jsonl' # path to your dataset directory or file export_path: 'dataset.jsonl'
text_keys: 'text'
export_in_parallel: false # whether to export the result dataset in parallel to a single file, which usually takes less time. It only works when export_shard_size is 0, and its default number of processes is the same as the argument np. Notice: If it's True, sometimes exporting in parallel might require much more time due to the IO blocking, especially for very large datasets. When this happens, False is a better choice, although it takes more time. np: 40 # number of subprocess to process your dataset text_keys: 'text' # the key name of field where the sample texts to be processed, e.g.,
text
,instruction
,output
, ...Note: currently, we support specify only ONE key for each op, for cases requiring multiple keys, users can specify the op multiple times. We will only use the first key of
text_keys
when you set multiple keys.suffixes: [] # the suffix of files that will be read. For example: '.txt', 'txt' or ['txt', '.pdf', 'docx'] use_cache: false # whether to use the cache management of Hugging Face datasets. It might take up lots of disk space when using cache ds_cache_dir: /opt/data/private/liuteng/dataset/dj_cache # cache dir for Hugging Face datasets. In default, it\'s the same as the environment variable
HF_DATASETS_CACHE
, whose default value is usually "~/.cache/huggingface/datasets". If this argument is set to a valid path by users, it will override the default cache dir use_checkpoint: false # whether to use the checkpoint management to save the latest version of dataset to work dir when processing. Rerun the same config will reload the checkpoint and skip ops before it. Cache will be disabled when using checkpoint. If args of ops before the checkpoint are changed, all ops will be rerun from the beginning. temp_dir: /opt/data/private/liuteng/dataset/dj_cache open_tracer: true # whether to open the tracer to trace the changes during process. It might take more time when opening tracer op_list_to_trace: [] # only ops in this list will be traced by tracer. If it's empty, all ops will be traced. Only available when tracer is opened. trace_num: 10 # number of samples to show the differences between datasets before and after each op. Only available when tracer is opened. op_fusion: true # whether to fuse operators that share the same intermediate variables automatically. Op fusion might reduce the memory requirements slightly but speed up the whole process. cache_compress: zstd # the compression method of the cache file, which can be specified in ['gzip', 'zstd', 'lz4']. If this parameter is None, the cache file will not be compressed. We recommend you turn on this argument when your input dataset is larger than tens of GB and your disk space is not enough.save_stats_in_one_file: true # whether to store all stats result into one file
process:
clean_email_mapper:
clean_links_mapper:
fix_unicode_mapper:
punctuation_normalization_mapper:
whitespace_normalization_mapper:
clean_copyright_mapper:
alphanumeric_filter: # 18766 tokenization: false min_ratio: 0.2 # < 3sigma (0.3791) max_ratio: 0.9163 # 3sigma
alphanumeric_filter: # 146432
tokenization: true min_ratio: 0.546 # 3sigma max_ratio: 3.65 # 3sigma
average_line_length_filter: # for code min_len: 10 # > 3sigma (0) -- 48790 max_len: 150 # < 3sigma (15603) -- 233275
character_repetition_filter: max_ratio: 0.36 # 3sigma -- 346875
maximum_line_length_filter: # for code max_len: 1000 # remove 256670 samples
text_length_filter: max_len: 96714 # 3sigma -- 190006
words_num_filter: min_num: 20 # remove 1504958 samples max_num: 6640 # 3sigma -- remove 179847 samples
word_repetition_filter: rep_len: 10 max_ratio: 0.357 # 3sigma -- 598462
document_simhash_deduplicator: tokenization: space window_size: 6 lowercase: true ignore_pattern: '\p{P}' num_blocks: 6 hamming_distance: 4
报错
AttributeError: 'FusedFilter' object has no attribute '_name'