mbzuai-oryx / GeoChat

[CVPR 2024 🔥] GeoChat, the first grounded Large Vision Language Model for Remote Sensing
https://mbzuai-oryx.github.io/GeoChat
448 stars 36 forks source link

【GPU Memory】 #6

Closed Luo-Z13 closed 8 months ago

Luo-Z13 commented 11 months ago

Hello, I'm wondering about the minimum GPU memory required for training. Could you provide some information on this?

KjAeRsTuIsK commented 11 months ago

Hi @Luo-Z13 , thank you for your interest. We trained the model on 4 A100 40 GB gpus. You can train on one A100 80GB or on a single 40 GB A100 by using the quantised models,in 4 or 8 bit.

vvuonghn commented 10 months ago

How long your model training?

KjAeRsTuIsK commented 8 months ago

Hi @vvuonghn, we finetuned the model for around 10 hrs for the complete dataset, and further fine-tuned for 4-5 hours on the grounding part of the dataset. Please let me know if you have any further queries.

Amazingren commented 3 months ago

Hi @vvuonghn, we finetuned the model for around 10 hrs for the complete dataset, and further fine-tuned for 4-5 hours on the grounding part of the dataset. Please let me know if you have any further queries.

Hi @KjAeRsTuIsK ,

Thanks for you nice work.

May I ask how to fine-tune the model on the grounding part of the datasets?

I already fine-tuned it with this:

################## VICUNA ##################
PROMPT_VERSION=v1
MODEL_VERSION="vicuna-v1.5-7b"
gpu_ids=0,1,2,3
################## VICUNA ##################

 deepspeed --master_port=$((RANDOM + 10000)) --include localhost:$gpu_ids geochat/train/train_mem.py \
    --deepspeed ./scripts/zero2.json \
    --lora_enable True \
    --model_name_or_path /data/.../geochat/llava-v1.5-7b \
    --version $PROMPT_VERSION \
    --data_path /data/.../geochat/GeoChat_Instruct.json \
    --image_folder /data/.../geochat/final_images_llava  \
    --vision_tower openai/clip-vit-large-patch14-336 \
    --mm_projector_type mlp2x_gelu \
    --pretrain_mm_mlp_adapter /data/.../geochat/llava-v1.5-mlp2x-336px-pretrain-vicuna-7b-v1.5/mm_projector.bin \
    --mm_vision_select_layer -2 \
    --mm_use_im_start_end False \
    --mm_use_im_patch_token False \
    --image_aspect_ratio pad \
    --bf16 True \
    --output_dir /data/.../geochat/outckpts/geochat_reproduce \
    --num_train_epochs 1 \
    --per_device_train_batch_size 18 \
    --per_device_eval_batch_size 4 \
    --gradient_accumulation_steps 2 \
    --evaluation_strategy "no" \
    --save_strategy "epoch" \
    --save_steps 7000 \
    --save_total_limit 1 \
    --learning_rate 2e-4 \
    --weight_decay 0. \
    --warmup_ratio 0.03 \
    --lr_scheduler_type "cosine" \
    --logging_steps 1 \
    --tf32 True \
    --model_max_length 2048 \
    --gradient_checkpointing True \
    --lazy_preprocess True \
    --dataloader_num_workers 16 \
    --report_to wandb

What should I do next for fine-tuning it on the grounding part of the datasets?

I am not so familiar with the finturning of llava. Could you give me more detailed instructions when you are free recently?

Bests