microsoft / LLaVA-Med

Large Language-and-Vision Assistant for Biomedicine, built towards multimodal GPT-4 level capabilities.
Other
1.44k stars 174 forks source link

Can't we train and fine tune the Llavamed model #87

Open liucheny opened 1 month ago

thedaffodil commented 1 month ago

I fine tuned the model with the code here https://github.com/haotian-liu/LLaVA/blob/main/scripts/v1_5/finetune_task_lora.sh

GonyRosenman commented 1 month ago

@thedaffodil can you explain a bit more?

do you mean you cloned the original llava repo and run the training specified in the link you provided but changed the weights to be the weights of llava-med?

liucheny commented 4 weeks ago

@thedaffodil Is it to modify the weight and directly fine tune it

liucheny commented 4 weeks ago

@thedaffodil Is the environment Llavamed or Llava

thedaffodil commented 4 weeks ago

name: llava channels:

my yaml file is like above. I use llava repo with llava-med weights

GonyRosenman commented 3 weeks ago

i am trying to finetune llava initialized with llava-med on my own task.

so far i tried running llava/train/train_mem.py with parameters: --deepspeed ./scripts/zero3.json --model_name_or_path microsoft/llava-med-v1.5-mistral-7b --data_path ./playground/data/llava_v1_5_mix665k.json --image_folder ./playground/data --vision_tower openai/clip-vit-large-patch14-336 --mm_vision_select_layer -2 --mm_use_im_start_end True --mm_use_im_patch_token False --image_aspect_ratio pad --group_by_modality_length True --bf16 False --output_dir ./checkpoints/llava-v1.5-13b --num_train_epochs 1 --per_device_train_batch_size 16 --per_device_eval_batch_size 4 --gradient_accumulation_steps 1 --evaluation_strategy "no" --save_strategy "steps" --save_steps 50000 --save_total_limit 1 --learning_rate 2e-5 --weight_decay 0. --warmup_ratio 0.03 --lr_scheduler_type "cosine" --logging_steps 1 --tf32 False --model_max_length 2048 --gradient_checkpointing True --dataloader_num_workers 4 --lazy_preprocess True --report_to wandb

but i notice that the loading of the model is done with LLavaLLama model instead of a mistral one, and i can't figure out where to modify this.

any ideas? and generally where can i find more info on how to finetune llava med?

thedaffodil commented 3 weeks ago

I use the this command:

!/bin/bash

deepspeed llava/train/train_mem.py \ --lora_enable True --lora_r 128 --lora_alpha 256 --mm_projector_lr 2e-5 \ --deepspeed ./scripts/zero3.json \ --model_name_or_path ./llava-med-v1.5-mistral-7b \ --version v1 \ --data_path ./dataSlake/train.json \ --image_folder ./dataSlake/imgs \ --vision_tower openai/clip-vit-large-patch14-336 \ --mm_projector_type mlp2x_gelu \ --mm_vision_select_layer -2 \ --mm_use_im_start_end False \ --mm_use_im_patch_token False \ --image_aspect_ratio pad \ --group_by_modality_length True \ --bf16 True \ --output_dir ./checkpoints/llava-version1 \ --num_train_epochs 1 \ --per_device_train_batch_size 4 \ --per_device_eval_batch_size 4 \ --gradient_accumulation_steps 1 \ --evaluation_strategy "no" \ --save_strategy "steps" \ --save_steps 50000 \ --save_total_limit 1 \ --learning_rate 2e-4 \ --weight_decay 0. \ --warmup_ratio 0.03 \ --lr_scheduler_type "cosine" \ --logging_steps 1 \ --tf32 True \ --model_max_length 2048 \ --gradient_checkpointing True \ --dataloader_num_workers 2 \ --lazy_preprocess True \ --report_to wandb

after that I merged the output model and the base model to get weights with the code in the link below https://github.com/haotian-liu/LLaVA/blob/main/scripts/merge_lora_weights.py

then I could use fine tuned model to eval.

you can ask your further questions via my email if you need help.

liucheny commented 3 weeks ago

@thedaffodil I am here https://huggingface.co/microsoft/llava-med-v1.5-mistral-7b/tree/main Download the model from above and fine tune it with your script. The result shows that you are using a model of type llava_stistral to instruct a model of type llava_1lama This is not supported for all configurations of models and can yield errors. You are attempting to use Flash Attention 2.0 with a model not initialized on GPU. Make sure to move the model to GPU after initializing it on CPU with model.to('cuda') And torch.distributed.elastic.multiprocesse.errors ChildFailedError:

llava/train/train_mem.py FAILED

Failures: [1]: time : 2024-08-20_07:47:25 host : 9c813e5131ac rank : 1 (local_rank: 1) exitcode : -7 (pid: 1044) error_file: <N/A> traceback : Signal 7 (SIGBUS) received by PID 1044 [2]: time : 2024-08-20_07:47:25 host : 9c813e5131ac rank : 2 (local_rank: 2) exitcode : -7 (pid: 1045) error_file: <N/A> traceback : Signal 7 (SIGBUS) received by PID 1045 [3]: time : 2024-08-20_07:47:25 host : 9c813e5131ac rank : 3 (local_rank: 3) exitcode : -7 (pid: 1046) error_file: <N/A> traceback : Signal 7 (SIGBUS) received by PID 1046

Root Cause (first observed failure): 0 time : 2024-08-20_07:47:25 host : 9c813e5131ac rank : 0 (local_rank: 0) exitcode : -7 (pid: 1043) error_file: <N/A> traceback : Signal 7 (SIGBUS) received by PID 1043

thedaffodil commented 2 weeks ago

while you are fine-tuning, your output model folder name should consist "finetune" not "llava".

while you are merging, your output folder name should consist "llava"