Closed minxiansheng closed 3 months ago
input_ids不是应该连起来的吗,我咋经过处理后是这样的,每个id都会回车一下,这样是对的吗还是错的? {'input_ids': [104922, 71137, 115453, 198, 16, 220, 90476, 119, 46448, 198, 16, 13, 16, 84238, 244, 43316, 100466, 198, 16, 13, 17, 84238, 244, 43316, 104282, 198, 16, 13, 18, 220, 105255, 101121, 198,
在经过tokenized_datasets = raw_datasets.map( # 先token tokenize_wo_pad_function, batched = True, num_proc = data_args.preprocessing_num_workers, remove_columns = column_names, load_from_cache_file = not data_args.overwrite_cache, desc = "Running tokenizer on dataset", ) lm_datasets = tokenized_datasets.map( # 后分组 group_text_function, batched = True, num_proc = data_args.preprocessing_num_workers, load_from_cache_file = not data_args.overwrite_cache, desc = f"Grouping texts in chunks of {block_size}", ) 得到的。
显示而已。
input_ids不是应该连起来的吗,我咋经过处理后是这样的,每个id都会回车一下,这样是对的吗还是错的? {'input_ids': [104922, 71137, 115453, 198, 16, 220, 90476, 119, 46448, 198, 16, 13, 16, 84238, 244, 43316, 100466, 198, 16, 13, 17, 84238, 244, 43316, 104282, 198, 16, 13, 18, 220, 105255, 101121, 198,
在经过tokenized_datasets = raw_datasets.map( # 先token tokenize_wo_pad_function, batched = True, num_proc = data_args.preprocessing_num_workers, remove_columns = column_names, load_from_cache_file = not data_args.overwrite_cache, desc = "Running tokenizer on dataset", ) lm_datasets = tokenized_datasets.map( # 后分组 group_text_function, batched = True, num_proc = data_args.preprocessing_num_workers, load_from_cache_file = not data_args.overwrite_cache, desc = f"Grouping texts in chunks of {block_size}", ) 得到的。