meta-llama / llama-recipes

Scripts for fine-tuning Meta Llama with composable FSDP & PEFT methods to cover single/multi-node GPUs. Supports default & custom datasets for applications such as summarization and Q&A. Supporting a number of candid inference solutions such as HF TGI, VLLM for local or cloud deployment. Demo apps to showcase Meta Llama for WhatsApp & Messenger.
15.37k stars 2.22k forks source link

Llama 3.2-11B-vision fully fine-tuned model file question #727

Open Kidand opened 1 month ago

Kidand commented 1 month ago

During the use of LoRA fine-tuning, everything was normal, but the following issue arose during full-scale fine-tuning.

I use the following script for full fine-tuning :

#!/bin/bash

NNODES=1
NPROC_PER_NODE=4
LR=1e-5
NUM_EPOCHS=1
BATCH_SIZE_TRAINING=2
MODEL_NAME="/xxx/models--meta-llama--Llama-3.2-11B-Vision-Instruct/snapshots/075e8feb24b6a50981f6fdc161622f741a8760b1"
DIST_CHECKPOINT_ROOT_FOLDER="./finetuned_model"
DIST_CHECKPOINT_FOLDER="fine-tuned"
DATASET="custom_dataset"
CUSTOM_DATASET_TEST_SPLIT="test"
CUSTOM_DATASET_FILE="recipes/quickstart/finetuning/datasets/xxx_dataset.py"
RUN_VALIDATION=True
BATCHING_STRATEGY="padding"
OUTPUT_DIR="finetune/output"

torchrun --master_port 12412 \
         --nnodes $NNODES \
         --nproc_per_node $NPROC_PER_NODE \
         recipes/quickstart/finetuning/finetuning.py \
         --enable_fsdp \
         --lr $LR \
         --num_epochs $NUM_EPOCHS \
         --batch_size_training $BATCH_SIZE_TRAINING \
         --model_name $MODEL_NAME \
         --dist_checkpoint_root_folder $DIST_CHECKPOINT_ROOT_FOLDER \
         --dist_checkpoint_folder $DIST_CHECKPOINT_FOLDER \
         --use_fast_kernels \
         --dataset $DATASET \
         --custom_dataset.test_split $CUSTOM_DATASET_TEST_SPLIT \
         --custom_dataset.file $CUSTOM_DATASET_FILE \
         --run_validation $RUN_VALIDATION \
         --batching_strategy $BATCHING_STRATEGY \
         --output_dir $OUTPUT_DIR

The model was not saved to the finetune/output folder I specified, and moreover, the model weight files appear as follows, preventing me from performing inference.

ls
__0_0.distcp  __1_0.distcp  __2_0.distcp  __3_0.distcp  train_params.yaml

How can I save the weights of a fully fine-tuned model to a specified path, ensuring that the saved model weight file follows the standard transformers structure?

wukaixingxp commented 1 month ago

Hi! We are working on the model conversion script by merging this PR. You can try it as a temp solution. Thanks!

wukaixingxp commented 4 weeks ago

The PR has been merged. Please try it and let me know if you have more questions.