haotian-liu / LLaVA

[NeurIPS'23 Oral] Visual Instruction Tuning (LLaVA) built towards GPT-4V level capabilities and beyond.
https://llava.hliu.cc
Apache License 2.0
19.36k stars 2.13k forks source link

[Question] Reproduceability: Which files ggo where? #342

Open chigkim opened 1 year ago

chigkim commented 1 year ago

I'd love to try to reproduce the model from pretraining to finetuning. It's awesome that there are training and finetuning scripts. However, there are so many parts of dataset, I'm not sure where to put what. After cloning, should I put LLaVA-Pretrain, LLaVA-CC3M-Pretrain-595K, and LLaVA-Instruct-150K inside LLaVA/? What about images.zip files inside LLaVA-Pretrain and LLaVA-CC3M-Pretrain-595K? Where should I extract them into? The zip files don't seem to have their own root folders.

Here are what I have gathered so far:

LLaVA
    LLaVA-Pretrain
        blip_laion_cc_sbu_558k.json
        blip_laion_cc_sbu_558k_meta.json
        images.zip
    LLaVA-CC3M-Pretrain-595K
        chat.json
        images.zip
        metadata.json
        LLaVA-Instruct-150K
        complex_reasoning_77k.json
        conversation_58k.json
        detail_23k.json
        llava_instruct_80k.json
        llava_instruct_150k.json

I downloaded http://images.cocodataset.org/zips/train2017.zip. Is there any other additional dataset required? If so, where can I download them? Thank you so much.

chigkim commented 1 year ago

Actually, I just realized that I'm supposed to edit pretrain.sh and finetune.sh inside the scripts folder.

I'd appreciate if someone could help me to edit and point to write things.

  1. for both pretrain.sh and finetune.sh, --deepspeed /path/to/deepspeed.json Where can I download deepspeed.json?
  2. For pretrain.sh, --image_folder /path/to/images Which one do I extract and use? LLaVA-CC3M-Pretrain-595K/images.zip LLaVA-Pretrain/images.zip
  3. For pretrain.sh, --data_path /path/to/pretrain_data.json Which one should I use? LLaVA-CC3M-Pretrain-595K/chat.json LLaVA-CC3M-Pretrain-595K/metadata.json LLaVA-Pretrain/blip_laion_cc_sbu_558k.json LLaVA-Pretrain/blip_laion_cc_sbu_558k_meta.json
  4. for both pretrain.sh and finetune.sh, --model_max_length 2048 Do I use 4096 for Llama-2? Thanks so much!
chigkim commented 1 year ago

Going to close this and open a discussion.

haotian-liu commented 1 year ago

Sorry for the confusion. I will update the docs to make it clearer later this week.

I am re-opening this, and please feel free to discuss anything in README, that is ambiguous or unclear to you :)

harrytea commented 1 year ago

Going to close this and open a discussion.

I have the same question

chigkim commented 1 year ago

This is what I found so far. @haotian-liu Please correct me I'm wrong.

--deepspeed

There seems some deepspeed configuration in scripts folder.

I don't know what the differences are, but I spotted one of the issues specifying one of those files in --deepspeed flag.

--data_path

--image_folder

--pretrain_mm_mlp_adapter

Either you point to the file you got after pretrain, or get it below. Choose one of the models that you're going to finetune from here. https://huggingface.co/liuhaotian Then download mm_projector.bin after you click files. For example" llava-336px-pretrain-llama-2-13b-chat/mm_projector.bin

--model_max_length

I haven't figured out --model_max_length for Llama-2. Llama-2 ha 4096 context length, so probably you put 4096?

sohaibsoussi commented 7 months ago

Hi, I encountered a problem when running the shell script below for fine-tuning purposes. It always tells me that the 'lava' module is not found, even though I tried installing it using both conda and pip:

The shell script:

!/bin/bash

deepspeed /home/sohaib/LPRI_projects/Image-txt--txt/LLAVA/LLaVA/llava/train/train_mem.py --deepspeed /home/sohaib/LPRI_projects/Image-txt--txt/LLAVA/LLaVA/scripts/zero2.json --lora_enable True --lora_r 128 --lora_alpha 256 --mm_projector_lr 2e-5 \ --bits 4 --model_name_or_path /home/sohaib/LPRI_projects/Image-txt--txt/LLAVA/llava-v1.5-7b --version llava_llama_2 \ --data_path /home/sohaib/LPRI_projects/Image-txt--txt/LLAVA/ok_vqa_dataset/train --validation_data_path /home/sohaib/LPRI_projects/Image-txt--txt/LLAVA/ok_vqa_dataset/validation --image_folder /home/sohaib/LPRI_projects/Image-txt--txt/LLAVA/ok_vqa_dataset/images --vision_tower openai/clip-vit-large-patch14-336 --mm_projector_type mlp2x_gelu --mm_vision_select_layer -2 --mm_use_im_start_end False --mm_use_im_patch_token False --image_aspect_ratio pad --group_by_modality_length True --bf16 True --output_dir /home/sohaib/LPRI_projects/Image-txt--txt/LLAVA/checkpoints/ok_vqa_finetuning --num_train_epochs 500 --per_device_train_batch_size 2 --per_device_eval_batch_size 2 --gradient_accumulation_step 64 --evaluation_strategy "epoch" --save_strategy "steps" --save_steps 50000 --save_total_limit 1 --learning_rate 2e-4 --weight_decay 0. --warmup_ratio 0.03 --lr_scheduler_type "cosine" --logging_steps 1 --tf32 True --model_max_length 2048 --gradient_checkpointing True --dataloader_num_workers 4 --lazy_preprocess True --report_to wandb

Log after runing the shell script: deepspeed /home/sohaib/LPRI_projects/Image-txt--txt/LLAVA/LLaVA/llava/train/train_mem.py [2024-02-20 23:14:05,751] [INFO] [real_accelerator.py:191:get_accelerator] Setting ds_accelerator to cuda (auto detect) [2024-02-20 23:14:07,655] [WARNING] [runner.py:202:fetch_hostfile] Unable to find hostfile, will proceed with training with local resources only. [2024-02-20 23:14:07,711] [INFO] [runner.py:568:main] cmd = /usr/bin/python3 -u -m deepspeed.launcher.launch --world_info=eyJsb2NhbGhvc3QiOiBbMF19 --master_addr=127.0.0.1 --master_port=29500 --enable_each_rank_log=None /home/sohaib/LPRI_projects/Image-txt--txt/LLAVA/LLaVA/llava/train/train_mem.py [2024-02-20 23:14:10,006] [INFO] [real_accelerator.py:191:get_accelerator] Setting ds_accelerator to cuda (auto detect) [2024-02-20 23:14:10,383] [INFO] [launch.py:145:main] WORLD INFO DICT: {'localhost': [0]} [2024-02-20 23:14:10,383] [INFO] [launch.py:151:main] nnodes=1, num_local_procs=1, node_rank=0 [2024-02-20 23:14:10,383] [INFO] [launch.py:162:main] global_rank_mapping=defaultdict(<class 'list'>, {'localhost': [0]}) [2024-02-20 23:14:10,383] [INFO] [launch.py:163:main] dist_world_size=1 [2024-02-20 23:14:10,383] [INFO] [launch.py:165:main] Setting CUDA_VISIBLE_DEVICES=0 [2024-02-20 23:14:10,384] [INFO] [launch.py:253:main] process 1377826 spawned with command: ['/usr/bin/python3', '-u', '/home/sohaib/LPRI_projects/Image-txt--txt/LLAVA/LLaVA/llava/train/train_mem.py', '--local_rank=0'] Traceback (most recent call last): File "/home/sohaib/LPRI_projects/Image-txt--txt/LLAVA/LLaVA/llava/train/train_mem.py", line 1, in from llava.train.train import train ModuleNotFoundError: No module named 'llava' [2024-02-20 23:14:11,386] [INFO] [launch.py:316:sigkill_handler] Killing subprocess 1377826 [2024-02-20 23:14:11,386] [ERROR] [launch.py:322:sigkill_handler] ['/usr/bin/python3', '-u', '/home/sohaib/LPRI_projects/Image-txt--txt/LLAVA/LLaVA/llava/train/train_mem.py', '--local_rank=0'] exits with return code = 1 --deepspeed /home/sohaib/LPRI_projects/Image-txt--txt/LLAVA/LLaVA/scripts/zero2.json train_ok_vqa.sh: 4: --deepspeed: not found --lora_enable True --lora_r 128 --lora_alpha 256 --mm_projector_lr 2e-5 train_ok_vqa.sh: 5: --lora_enable: not found --bits 4 --model_name_or_path /home/sohaib/LPRI_projects/Image-txt--txt/LLAVA/llava-v1.5-7b --version llava_llama_2 train_ok_vqa.sh: 9: --bits: not found --data_path /home/sohaib/LPRI_projects/Image-txt--txt/LLAVA/ok_vqa_dataset/train --validation_data_path /home/sohaib/LPRI_projects/Image-txt--txt/LLAVA/ok_vqa_dataset/validation --image_folder /home/sohaib/LPRI_projects/Image-txt--txt/LLAVA/ok_vqa_dataset/images --vision_tower openai/clip-vit-large-patch14-336 --mm_projector_type mlp2x_gelu --mm_vision_select_layer -2 --mm_use_im_start_end False --mm_use_im_patch_token False --image_aspect_ratio pad --group_by_modality_length True --bf16 True --output_dir /home/sohaib/LPRI_projects/Image-txt--txt/LLAVA/checkpoints/ok_vqa_finetuning train_ok_vqa.sh: 12: --data_path: not found --num_train_epochs 500 --per_device_train_batch_size 2 --per_device_eval_batch_size 2 --gradient_accumulation_step 64 --evaluation_strategy epoch --save_strategy steps --save_steps 50000 --save_total_limit 1 --learning_rate 2e-4 --weight_decay 0. --warmup_ratio 0.03 --lr_scheduler_type cosine --logging_steps 1 --tf32 True --model_max_length 2048 --gradient_checkpointing True --dataloader_num_workers 4 --lazy_preprocess True --report_to wandb train_ok_vqa.sh: 24: --num_train_epochs: not found

NanAlbert commented 5 months ago

This is what I found so far. @haotian-liu Please correct me I'm wrong.

--deepspeed

There seems some deepspeed configuration in scripts folder.

  • zero2.json
  • zero3.json
  • zero3_offload.json

I don't know what the differences are, but I spotted one of the issues specifying one of those files in --deepspeed flag.

--data_path

--image_folder

--pretrain_mm_mlp_adapter

Either you point to the file you got after pretrain, or get it below. Choose one of the models that you're going to finetune from here. https://huggingface.co/liuhaotian Then download mm_projector.bin after you click files. For example" llava-336px-pretrain-llama-2-13b-chat/mm_projector.bin

--model_max_length

I haven't figured out --model_max_length for Llama-2. Llama-2 ha 4096 context length, so probably you put 4096?

Thank you for the informative summary. Also, I would like to ask why the LLaVA-CC3M-Pretrain-595K/images.zip is only 6GB and does not contain 595K images. Do you know the reason for this?