Question of training utilizing A6000

xogud3373 commented 10 months ago

Hello, first of all, I would like to express my deep gratitude for your excellent research.

I'm currently conducting training with A6000 x 8 GPUs. But, I got below errors.

Is there a way to resolve this issue by not using flash-attention or by modifying another part of the code??

I did below train code.

torchrun --nproc_per_node 8 --master_port 29001 video_chatgpt/train/train_mem.py \
          --model_name_or_path ./LLaVA-Lightning-7B-v1-1 \
          --version v1 \
          --data_path video_chatgpt_training.json \
          --video_folder st_outputs1 \
          --tune_mm_mlp_adapter True \
          --mm_use_vid_start_end \
          --bf16 True \
          --output_dir ./Video-ChatGPT_7B-1.1_Checkpoints \
          --num_train_epochs 3 \
          --per_device_train_batch_size 4 \
          --per_device_eval_batch_size 4 \
          --gradient_accumulation_steps 8 \
          --evaluation_strategy "no" \
          --save_strategy "steps" \
          --save_steps 3000 \
          --save_total_limit 3 \
          --learning_rate 2e-5 \
          --weight_decay 0. \
          --warmup_ratio 0.03 \
          --lr_scheduler_type "cosine" \
          --logging_steps 100 \
          --tf32 True \
          --model_max_length 2048 \
          --gradient_checkpointing True \
          --lazy_preprocess True

leesungjae7469 commented 10 months ago

When i tried to train this model, i couldn't train with A6000.

CallmeBOKE commented 10 months ago

Same issue here.

JakePark-Kor commented 10 months ago

I met same issue, if anyone has found the solution of it plz share :)

xogud3373 commented 10 months ago

I removed a 'replace_llama_attn_with_flash_attn()' statement from the 'video_chatgpt/train/train_mem.py' path and then the training proceeded. Could removing this code cause any issues with performance?

Abyss-J commented 10 months ago

I used A40 GPUs and got same issue here. How should I solve this problem？

mmaaz60 commented 5 months ago

Hi @EveryOne,

Flash Attention only works on A100 or H100. In case if you want to train on any other GPU, commenting out the line at https://github.com/mbzuai-oryx/Video-ChatGPT/blob/f27bf8c29b77efcc2ca07e398e92aa1de09f5063/video_chatgpt/train/train_mem.py#L4 should work. Thanks and Good Luck!

Please let me know if you will have any questions.

mbzuai-oryx / Video-ChatGPT

Question of training utilizing A6000 #71