Fine tuning the flan-t5-large model gives garbage output

I was able to run the fine tuning script for the flan-t5-large model on a V100 and save the results without issues. Training was done with the example dummy conversations file with this command:

    --model_name_or_path google/flan-t5-large \
    --data_path data/dummy_conversation.json \
    --bf16 False \
    --output_dir ./checkpoints_flant5_large_dummy \
    --num_train_epochs 1 \
    --per_device_train_batch_size 1 \
    --per_device_eval_batch_size 1 \
    --gradient_accumulation_steps 4 \
    --evaluation_strategy "no" \
    --save_strategy "steps" \
    --save_steps 300 \
    --save_total_limit 1 \
    --learning_rate 2e-5 \
    --weight_decay 0. \
    --warmup_ratio 0.03 \
    --lr_scheduler_type "cosine" \
    --logging_steps 1 \
    --fsdp "full_shard auto_wrap" \
    --fsdp_transformer_layer_cls_to_wrap T5Block \
    --tf32 False \
    --model_max_length 2048 \
    --report_to none  \
    --preprocessed_path FastChat/large_dummy.json \
    --gradient_checkpointing True

Then I loaded the new model with this command: python3 -m fastchat.serve.cli --model-path checkpoints_flant5_large_dummy/ and got this when I tried to interact:

Human: what is your name?
Assistant: Yes,mètres I’mbling amédias languageuniversal APIlungul languagechemical modelwählt....

I was able to verify reasonable responses when using the original google/flan-t5-large model so the environment is likely ok. I am probably running this incorrectly but cannot find any further documentation. There are some answered questions about Vicuna, but not much for Flan-T5. Does anyone know what might be wrong? Either way, it might good to have a little more documentation about the end-to-end fine tuning process for the Flan-F5 models.

lm-sys / FastChat

Fine tuning the flan-t5-large model gives garbage output #1983