shibing624 / MedicalGPT

MedicalGPT: Training Your Own Medical GPT Model with ChatGPT Training Pipeline. 训练医疗大模型,实现了包括增量预训练(PT)、有监督微调(SFT)、RLHF、DPO、ORPO。
Apache License 2.0
3.24k stars 492 forks source link

使用SFT后的模型推理时出现报错,麻烦答主帮帮忙看下! #267

Closed SoYuCry closed 10 months ago

SoYuCry commented 10 months ago

Describe the bug

Please provide a clear and concise description of what the bug is. If applicable, add screenshots to help explain your problem, especially for visualization related problems.

我SFT使用参数是 set CUDA_VISIBLE_DEVICES=1 && /lustre/home/acct-phyyjl/phyyjl-xzhr/.conda/envs/LLM/bin/python supervised_finetuning.py --model_type llama --model_name_or_path /lustre/home/acct-phyyjl/phyyjl-xzhr/Desktop/models_hf_LLAMA/7B-chat --train_file_dir ./data/finetune/train --validation_file_dir ./data/finetune/test --per_device_train_batch_size 4 --per_device_eval_batch_size 4 --do_train --do_eval --use_peft True --fp16 --max_train_samples -1 --max_eval_samples -1 --num_train_epochs 3 --learning_rate 2e-5 --warmup_ratio 0.05 --weight_decay 0.05 --logging_strategy steps --logging_steps 10 --eval_steps 500 --evaluation_strategy steps --save_steps 500 --save_strategy steps --save_total_limit 300 --gradient_accumulation_steps 1 --preprocessing_num_workers 4 --output_dir /lustre/home/acct-phyyjl/phyyjl-xzhr/Desktop/models_hf_LLAMA/7B-chat-SFT --overwrite_output_dir --ddp_timeout 30000 --logging_first_step True --target_modules all --lora_rank 8 --lora_alpha 16 --lora_dropout 0.05 --torch_dtype bfloat16 --device_map auto --report_to tensorboard --ddp_find_unused_parameters False --gradient_checkpointing True --cache_dir ./cache

Merge 后,使用 inference.py 推理,出现报错。

image

(PS,我在对 Pertrain 之后的 SFT模型时没有报错。loss 也正常)

同时出现了 loss 为 0 的问题: image

shibing624 commented 10 months ago

用bf16训练