Open liucheny opened 3 months ago
@thedaffodil can you explain a bit more?
do you mean you cloned the original llava repo and run the training specified in the link you provided but changed the weights to be the weights of llava-med?
@thedaffodil Is it to modify the weight and directly fine tune it
@thedaffodil Is the environment Llavamed or Llava
name: llava channels:
my yaml file is like above. I use llava repo with llava-med weights
i am trying to finetune llava initialized with llava-med on my own task.
so far i tried running llava/train/train_mem.py with parameters: --deepspeed ./scripts/zero3.json --model_name_or_path microsoft/llava-med-v1.5-mistral-7b --data_path ./playground/data/llava_v1_5_mix665k.json --image_folder ./playground/data --vision_tower openai/clip-vit-large-patch14-336 --mm_vision_select_layer -2 --mm_use_im_start_end True --mm_use_im_patch_token False --image_aspect_ratio pad --group_by_modality_length True --bf16 False --output_dir ./checkpoints/llava-v1.5-13b --num_train_epochs 1 --per_device_train_batch_size 16 --per_device_eval_batch_size 4 --gradient_accumulation_steps 1 --evaluation_strategy "no" --save_strategy "steps" --save_steps 50000 --save_total_limit 1 --learning_rate 2e-5 --weight_decay 0. --warmup_ratio 0.03 --lr_scheduler_type "cosine" --logging_steps 1 --tf32 False --model_max_length 2048 --gradient_checkpointing True --dataloader_num_workers 4 --lazy_preprocess True --report_to wandb
but i notice that the loading of the model is done with LLavaLLama model instead of a mistral one, and i can't figure out where to modify this.
any ideas? and generally where can i find more info on how to finetune llava med?
I use the this command:
deepspeed llava/train/train_mem.py \ --lora_enable True --lora_r 128 --lora_alpha 256 --mm_projector_lr 2e-5 \ --deepspeed ./scripts/zero3.json \ --model_name_or_path ./llava-med-v1.5-mistral-7b \ --version v1 \ --data_path ./dataSlake/train.json \ --image_folder ./dataSlake/imgs \ --vision_tower openai/clip-vit-large-patch14-336 \ --mm_projector_type mlp2x_gelu \ --mm_vision_select_layer -2 \ --mm_use_im_start_end False \ --mm_use_im_patch_token False \ --image_aspect_ratio pad \ --group_by_modality_length True \ --bf16 True \ --output_dir ./checkpoints/llava-version1 \ --num_train_epochs 1 \ --per_device_train_batch_size 4 \ --per_device_eval_batch_size 4 \ --gradient_accumulation_steps 1 \ --evaluation_strategy "no" \ --save_strategy "steps" \ --save_steps 50000 \ --save_total_limit 1 \ --learning_rate 2e-4 \ --weight_decay 0. \ --warmup_ratio 0.03 \ --lr_scheduler_type "cosine" \ --logging_steps 1 \ --tf32 True \ --model_max_length 2048 \ --gradient_checkpointing True \ --dataloader_num_workers 2 \ --lazy_preprocess True \ --report_to wandb
after that I merged the output model and the base model to get weights with the code in the link below https://github.com/haotian-liu/LLaVA/blob/main/scripts/merge_lora_weights.py
then I could use fine tuned model to eval.
you can ask your further questions via my email if you need help.
@thedaffodil I am here https://huggingface.co/microsoft/llava-med-v1.5-mistral-7b/tree/main Download the model from above and fine tune it with your script. The result shows that you are using a model of type llava_stistral to instruct a model of type llava_1lama This is not supported for all configurations of models and can yield errors.
You are attempting to use Flash Attention 2.0 with a model not initialized on GPU. Make sure to move the model to GPU after initializing it on CPU with model.to('cuda')
And torch.distributed.elastic.multiprocesse.errors ChildFailedError:
llava/train/train_mem.py FAILED
Failures: [1]: time : 2024-08-20_07:47:25 host : 9c813e5131ac rank : 1 (local_rank: 1) exitcode : -7 (pid: 1044) error_file: <N/A> traceback : Signal 7 (SIGBUS) received by PID 1044 [2]: time : 2024-08-20_07:47:25 host : 9c813e5131ac rank : 2 (local_rank: 2) exitcode : -7 (pid: 1045) error_file: <N/A> traceback : Signal 7 (SIGBUS) received by PID 1045 [3]: time : 2024-08-20_07:47:25 host : 9c813e5131ac rank : 3 (local_rank: 3) exitcode : -7 (pid: 1046) error_file: <N/A> traceback : Signal 7 (SIGBUS) received by PID 1046
Root Cause (first observed failure): 0 time : 2024-08-20_07:47:25 host : 9c813e5131ac rank : 0 (local_rank: 0) exitcode : -7 (pid: 1043) error_file: <N/A> traceback : Signal 7 (SIGBUS) received by PID 1043
while you are fine-tuning, your output model folder name should consist "finetune" not "llava".
while you are merging, your output folder name should consist "llava"
I fine tuned the model with the code here https://github.com/haotian-liu/LLaVA/blob/main/scripts/v1_5/finetune_task_lora.sh