shibing624 / MedicalGPT

MedicalGPT: Training Your Own Medical GPT Model with ChatGPT Training Pipeline. 训练医疗大模型,实现了包括增量预训练(PT)、有监督微调(SFT)、RLHF、DPO、ORPO。
Apache License 2.0
3.24k stars 492 forks source link

增量预训练,这样的input_ids的格式是不是有问题,帮忙看看 #389

Closed minxiansheng closed 3 months ago

minxiansheng commented 3 months ago

input_ids不是应该连起来的吗,我咋经过处理后是这样的,每个id都会回车一下,这样是对的吗还是错的? {'input_ids': [104922, 71137, 115453, 198, 16, 220, 90476, 119, 46448, 198, 16, 13, 16, 84238, 244, 43316, 100466, 198, 16, 13, 17, 84238, 244, 43316, 104282, 198, 16, 13, 18, 220, 105255, 101121, 198,

在经过tokenized_datasets = raw_datasets.map( # 先token tokenize_wo_pad_function, batched = True, num_proc = data_args.preprocessing_num_workers, remove_columns = column_names, load_from_cache_file = not data_args.overwrite_cache, desc = "Running tokenizer on dataset", ) lm_datasets = tokenized_datasets.map( # 后分组 group_text_function, batched = True, num_proc = data_args.preprocessing_num_workers, load_from_cache_file = not data_args.overwrite_cache, desc = f"Grouping texts in chunks of {block_size}", ) 得到的。

shibing624 commented 3 months ago

显示而已。