lm-sys / FastChat

An open platform for training, serving, and evaluating large language models. Release repo for Vicuna and Chatbot Arena.
Apache License 2.0
36.63k stars 4.52k forks source link

Fine tuning the flan-t5-large model gives garbage output #1983

Open fsleeman opened 1 year ago

fsleeman commented 1 year ago

I was able to run the fine tuning script for the flan-t5-large model on a V100 and save the results without issues. Training was done with the example dummy conversations file with this command:

    --model_name_or_path google/flan-t5-large \
    --data_path data/dummy_conversation.json \
    --bf16 False \
    --output_dir ./checkpoints_flant5_large_dummy \
    --num_train_epochs 1 \
    --per_device_train_batch_size 1 \
    --per_device_eval_batch_size 1 \
    --gradient_accumulation_steps 4 \
    --evaluation_strategy "no" \
    --save_strategy "steps" \
    --save_steps 300 \
    --save_total_limit 1 \
    --learning_rate 2e-5 \
    --weight_decay 0. \
    --warmup_ratio 0.03 \
    --lr_scheduler_type "cosine" \
    --logging_steps 1 \
    --fsdp "full_shard auto_wrap" \
    --fsdp_transformer_layer_cls_to_wrap T5Block \
    --tf32 False \
    --model_max_length 2048 \
    --report_to none  \
    --preprocessed_path FastChat/large_dummy.json \
    --gradient_checkpointing True

Then I loaded the new model with this command: python3 -m fastchat.serve.cli --model-path checkpoints_flant5_large_dummy/ and got this when I tried to interact:

Human: what is your name?
Assistant: Yes,mètres I’mbling amédias languageuniversal APIlungul languagechemical modelwählt....

I was able to verify reasonable responses when using the original google/flan-t5-large model so the environment is likely ok. I am probably running this incorrectly but cannot find any further documentation. There are some answered questions about Vicuna, but not much for Flan-T5. Does anyone know what might be wrong? Either way, it might good to have a little more documentation about the end-to-end fine tuning process for the Flan-F5 models.

DachengLi1 commented 1 year ago

Looks fine to me. Give It 3 epochs?