全参数SFT后无法正常infer

nuoma commented 4 months ago

基于yi-6B模型，进行全参数SFT后，infer结果为空。transformer版本为4.37.2

报错：
Some weights of LlamaForCausalLM were not initialized from the model checkpoint at *path*and are newly initialized:
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.
LlamaTokenizerFast(name_or_path='/data/mn/shibing624/MedicalGPT-1.6.3-231215/outputs/20240208_yi6B_tuluv2', vocab_size=64000, model_max_length=4096, is_fast=True, padding_side='left', truncation_side='right', special_tokens={'bos_token': '<|startoftext|>', 'eos_token': '<|endoftext|>', 'unk_token': '<unk>', 'pad_token': '<unk>'}, clean_up_tokenization_spaces=False),  added_tokens_decoder={
        0: AddedToken("<unk>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=True),
        1: AddedToken("<s>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=True),
        2: AddedToken("</s>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=True),
        64000: AddedToken("<|startoftext|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=True),
        64001: AddedToken("<|endoftext|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=True),

训练bash：
CUDA_VISIBLE_DEVICES=0,1,2,3 torchrun --nproc_per_node 4 ../supervised_finetuning.py \
    --model_type auto \
    --model_name_or_path /data/llm/models/Pretrained/yi-6B/01ai/Yi-6B \
    --tokenizer_name_or_path /data/llm/models/Pretrained/yi-6B/01ai/Yi-6B \
    --train_file_dir ../data/finetune/tuluv2/ \
    --per_device_train_batch_size 2\
    --do_train \
    --max_train_samples -1 \
    --num_train_epochs 3 \
    --learning_rate 2e-5 \
    --weight_decay 0. \
    --bf16 \
    --use_peft False \
    --logging_strategy steps \
    --logging_steps 10 \
    --save_strategy epoch \
    --save_total_limit 5 \
    --gradient_accumulation_steps 1 \
    --preprocessing_num_workers 8 \
    --output_dir ../outputs/20240208_yi6B_tuluv2 \
    --overwrite_output_dir \
    --ddp_timeout 30000 \
    --logging_first_step True \
    --torch_dtype bfloat16 \
    --device_map auto \
    --report_to tensorboard \
    --ddp_find_unused_parameters False \
    --gradient_checkpointing True \
    --cache_dir ./cache \
    --model_max_length 4096 \
    --deepspeed ../deepspeed_zero_stage2_config_no16.json \
    --template_name yi

推理命令：
CUDA_VISIBLE_DEVICES=0 python inference.py  --model_type auto --base_model  /data/mn/shibing624/MedicalGPT-1.6.3-231215/outputs/20240208_yi6B_tuluv2    --tokenizer_path  /data/mn/shibing624/MedicalGPT-1.6.3-231215/outputs/20240208_yi6B_tuluv2 --template_name yi --interactive --gpus 0

nuoma commented 4 months ago

我先自己看看，出问题了，权重没保存

xd-Nanan commented 3 months ago

我先自己看看，出问题了，权重没保存

您好，我使用sft的全量训练之后发现权重也是没保存，请问这个问题您有解决嘛？

Yian320 commented 3 months ago

我先自己看看，出问题了，权重没保存

您好，我使用sft的全量训练之后发现权重也是没保存，请问这个问题您有解决嘛？您好，这个问题有解决吗？

shibing624 / MedicalGPT

全参数SFT后无法正常infer #331