shibing624 / MedicalGPT

MedicalGPT: Training Your Own Medical GPT Model with ChatGPT Training Pipeline. 训练医疗大模型,实现了包括增量预训练(PT)、有监督微调(SFT)、RLHF、DPO、ORPO。
Apache License 2.0
2.94k stars 452 forks source link

运行python supervised_finetuning.py,发生KeyError: 'conversation' #294

Closed ospreyclaw closed 5 months ago

ospreyclaw commented 6 months ago

!python supervised_finetuning.py \ --model_type llama \ --model_name_or_path ./merged-pt \ --train_file_dir ./data/finetune \ --validation_file_dir ./data/finetune \ --per_device_train_batch_size 4 \ --per_device_eval_batch_size 4 \ --do_train \ --do_eval \ --use_peft True \ --fp16 \ --max_train_samples 1000 \ --max_eval_samples 10 \ --num_train_epochs 1 \ --learning_rate 2e-5 \ --warmup_ratio 0.05 \ --weight_decay 0.05 \ --logging_strategy steps \ --logging_steps 10 \ --eval_steps 50 \ --evaluation_strategy steps \ --save_steps 500 \ --save_strategy steps \ --save_total_limit 3 \ --gradient_accumulation_steps 1 \ --preprocessing_num_workers 1 \ --output_dir outputs-sft-v1 \ --overwrite_output_dir \ --ddp_timeout 30000 \ --logging_first_step True \ --target_modules all \ --lora_rank 8 \ --lora_alpha 16 \ --lora_dropout 0.05 \ --torch_dtype float16 \ --device_map auto \ --report_to tensorboard \ --ddp_find_unused_parameters False \ --gradient_checkpointing True

Running tokenizer on train dataset: 0%| | 0/1000 [00:00<?, ? examples/s] Traceback (most recent call last): File "/MedicalGPT/supervised_finetuning.py", line 1383, in main() File "/MedicalGPT/supervised_finetuning.py", line 1094, in main train_dataset = train_dataset.shuffle().map( File "/anaconda3/envs/medicalgpt/lib/python3.9/site-packages/datasets/arrow_dataset.py", line 592, in wrapper out: Union["Dataset", "DatasetDict"] = func(self, *args, *kwargs) File "/anaconda3/envs/medicalgpt/lib/python3.9/site-packages/datasets/arrow_dataset.py", line 557, in wrapper out: Union["Dataset", "DatasetDict"] = func(self, args, kwargs) File "/anaconda3/envs/medicalgpt/lib/python3.9/site-packages/datasets/arrow_dataset.py", line 3093, in map for rank, done, content in Dataset._map_single(dataset_kwargs): File "/anaconda3/envs/medicalgpt/lib/python3.9/site-packages/datasets/arrow_dataset.py", line 3470, in _map_single batch = apply_function_on_filtered_inputs( File "/anaconda3/envs/medicalgpt/lib/python3.9/site-packages/datasets/arrow_dataset.py", line 3349, in apply_function_on_filtered_inputs processed_inputs = function(fn_args, additional_args, **fn_kwargs) File "/MedicalGPT/supervised_finetuning.py", line 1043, in preprocess_function for dialog in get_dialog(examples): File "/MedicalGPT/supervised_finetuning.py", line 1020, in get_dialog for i, source in enumerate(examples['conversation']): File "/anaconda3/envs/medicalgpt/lib/python3.9/site-packages/datasets/formatting/formatting.py", line 270, in getitem value = self.data[key] KeyError: 'conversation' 请问如何修复?

LanShanPi commented 6 months ago

me too!!!

shibing624 commented 6 months ago

https://github.com/shibing624/MedicalGPT/blob/main/convert_dataset.py

LanShanPi commented 6 months ago

已解决,谢谢