Open pumetu opened 1 week ago
Hello,
I've been trying to qwen2 0.5B and tinyclip using the repository, but I'm running into CUDA OOM issues on the dense2dense distillation step. Im running on 4 80GB A100s, I was wondering if I would need 16 A100s like mentioned in the paper to finetune.
Here is the script I am running
module load GCC module load CUDA/12.2.2 module load Anaconda3 source activate /envs/llava-mod nvcc --version nvidia-smi export WANDB_PROJECT=multimodal # Deepspeed config DEEPSPEED_CONFIG='./llavamod/config/dpconfig/zero2_offload.json' # Dataset JSON_FILE=( '/datasets/multimodal/train_json/sharegpt4v_1246k.json' '/datasets/multimodal/train_json/lrv_tune_331k.json' '/datasets/multimodal/train_json/lvis_tune_220k_.json' 'datasets/multimodal/train_json/la_tune_256k.json' '/datasets/multimodal/train_json/svit_tune_157k.json' '/datasets/multimodal/allava_laion/ALLaVA-Caption-LAION-4V.json' ) JSON_FILE_PATHS="${JSON_FILE[@]}" IMAGE_FOLDER='/datasets/multimodal' # Teacher REF_MLLM='/models/Qwen2-7B' # Student POLICY_LLM='/models/Qwen2-0.5B' POLICY_MLP_ADAPTOR='/models/pretrain/qwen2-0.5B-tinyclip-pretrain-table/mm_projector.bin' # Vision encoder VISION_ENCODER='/models/TinyCLIP-ViT-8M-16-Text-3M-YFCC15M' # KD config POLICY_MODEL_TYPE='dense' REF_MODEL_TYPE='dense' LOSS_TYPE='only_kd' # kd_lm | only_kd DISTILL_ALL_TOKENS=False # False: only response, True: multimodal instruction + response # MoE config MOE_LOSS_ENABLE=False MOE_ENABLE=False MOE_FINETUNE=False MOE_MODE="sparse" NUM_EXPERTS=4 TOP_K_EXPERTS=2 USE_RESIDUAL=False ROUTER_AUX_LOSS_COEF=0.01 CAPACITY_FACTOR=1.5 # Output dir OUTPUT_DIR='models/finetune/qwen2-0.5B-tinyclip-mimicdd-qwen2-7B' HF_DATASETS_OFFLINE=1 TRANSFORMERS_OFFLINE=1 deepspeed llavamod/train/align_train.py \ --deepspeed ${DEEPSPEED_CONFIG} \ --ref_model_name_or_path ${REF_MLLM} \ --policy_model_name_or_path ${POLICY_LLM} --policy_pretrain_mm_mlp_adapter ${POLICY_MLP_ADAPTOR} \ --policy_model_type ${POLICY_MODEL_TYPE} --ref_model_type ${REF_MODEL_TYPE} --loss_type ${LOSS_TYPE} \ --moe_loss_enable ${MOE_LOSS_ENABLE} --moe_enable ${MOE_ENABLE} --moe_finetune ${MOE_FINETUNE} \ --num_experts ${NUM_EXPERTS} --top_k_experts ${TOP_K_EXPERTS} --capacity_factor ${CAPACITY_FACTOR} \ --moe_mode ${MOE_MODE} --use_residual ${USE_RESIDUAL} --router_aux_loss_coef ${ROUTER_AUX_LOSS_COEF} \ --train_modules mlp.gate_proj mlp.up_proj mlp.down_proj wg \ --distill_all_tokens ${DISTILL_ALL_TOKENS} \ --version qwen \ --data_path ${JSON_FILE_PATHS} \ --image_folder ${IMAGE_FOLDER} \ --image_tower ${VISION_ENCODER} \ --image_projector_type mlp2x_gelu \ --mm_vision_select_layer -2 \ --mm_use_im_start_end False \ --mm_use_im_patch_token False \ --image_aspect_ratio pad \ --bf16 True \ --output_dir ${OUTPUT_DIR} \ --num_train_epochs 1 \ --per_device_train_batch_size 4 \ --per_device_eval_batch_size 2 \ --gradient_accumulation_steps 8 \ --attn_implementation sdpa \ --evaluation_strategy "no" \ --save_strategy "steps" \ --save_steps 1000 \ --save_total_limit 2 \ --learning_rate 2e-5 \ --weight_decay 0. \ --warmup_ratio 0.03 \ --lr_scheduler_type "cosine" \ --logging_steps 50 \ --tf32 True \ --model_max_length 2048 \ --gradient_checkpointing False \ --dataloader_num_workers 4 \ --lazy_preprocess True \ --report_to wandb
Yes, you require more GPUs since currently the teacher is deployed in an online manner. It is also suggested to pre-extract the teacher's output of the training data. In this way, it becomes unnecessary to deploy the teacher model.
Hello,
I've been trying to qwen2 0.5B and tinyclip using the repository, but I'm running into CUDA OOM issues on the dense2dense distillation step. Im running on 4 80GB A100s, I was wondering if I would need 16 A100s like mentioned in the paper to finetune.
Here is the script I am running