huggingface / transformers

🤗 Transformers: State-of-the-art Machine Learning for Pytorch, TensorFlow, and JAX.
https://huggingface.co/transformers
Apache License 2.0
135.72k stars 27.17k forks source link

OOM for fine-tuning vlm llava-next-110B with QLoRA on 8 A100 GPUs #33379

Closed Neo9061 closed 1 month ago

Neo9061 commented 2 months ago

System Info

  1. Below are my dependencies version.
flash_attn==2.6.3
numpy==1.24.4
Pillow==10.4.0
Requests==2.32.3
transformers==4.44.2
accelerate==0.34.0
peft==0.12.0
datasets==2.21.0
wandb==0.17.8
evaluate==0.4.2
sacrebleu==2.4.3
rouge_score==0.1.2
huggingface-hub==0.24.6
trl==0.10.1
  1. I have 8 A100 GPUs.
  2. I used QLoRA fine-tuning. For reproducing my result, see code below.
  3. The model i finetuned is llava-next-110b-hf.

Can anyone explain if this is expected?

Who can help?

No response

Information

Tasks

Reproduction

  1. Git clone the huggingface repo and used vsft_llava script with QLoRA
  2. Below are my command. The hardware I used is 8 A100 GPUs, which should be sufficient to do QLoRA finetuning on VLM 110B. Can anyone explain if this is expected?
python entry_vsft_llava.py \
    --dataset_name HuggingFaceH4/llava-instruct-mix-vsft \
    --model_name_or_path ../checkpoints/llava-next-110b-hf \
    --per_device_train_batch_size 1 \
    --gradient_accumulation_steps 1 \
    --output_dir /opt/ml/model/fine-tuned-results/sft-llava-1.5-7b-hf \
    --bf16 \
    --torch_dtype bfloat16 \
    --gradient_checkpointing \
    --use_peft \
    --load_in_4bit \
    --use_bnb_nested_quant \
    --dataloader_num_workers 32 \
    --max_seq_length 128 \
    --lora_target_modules=all-linear

Expected behavior

No OOM.

LysandreJik commented 2 months ago

cc @zucchini-nlp @BenjaminBossan

BenjaminBossan commented 2 months ago

AFAICT, there is nothing in the script that would result in distributed learning. @Neo9061 can you observe that the model is sharded to all GPUs, or is it loading to a single GPU? When do you get the OOM error, before training starts or during training?

Neo9061 commented 2 months ago

AFAICT, there is nothing in the script that would result in distributed learning. @Neo9061 can you observe that the model is sharded to all GPUs, or is it loading to a single GPU? When do you get the OOM error, before training starts or during training?

Thanks Benjamin for the reply. For QLoRA, i believe it is loaded on single GPU firstly and then broadcast to other GPUs? that is what I heard before from your team when working on llama 405B QloRA.

The OOM happens when the training starts - forward pass. And I have monitor of nvidia-smi on the side. All the GPUs are utilized. And I start observing the memory consumption on first GPU starts to increase and then goes OOM.

BenjaminBossan commented 2 months ago

Thanks Benjamin for the reply. For QLoRA, i believe it is loaded on single GPU firstly and then broadcast to other GPUs? that is what I heard before from your team when working on llama 405B QloRA.

I'm not an expert on this topic, but I would be surprised if that is so. If it were loaded on a single GPU first, it would mean that you can only ever train a model in distributed style if it can fit on a single GPU, which would preclude most larger models.

The OOM happens when the training starts - forward pass. And I have monitor of nvidia-smi on the side. All the GPUs are utilized. And I start observing the memory consumption on first GPU starts to increase and then goes OOM.

I think what's happening here is that the SFTTrainer uses PyTorch Data Parallel under the hood (not to be confused with Distributed Data Parallel). Data Parallel means that the full model is copied to each device and then data is split between devices. Therefore, there is no memory saving compared to just having a single device.

You should investigate using FSDP or DeepSpeed to enable model parallel training, which should prevent OOM while utilizing all GPUs. A good starting point for that is accelerate.

zucchini-nlp commented 2 months ago

For what it's worth, accelerate doesn't shard VLMs equally (when 'auto' device map is used) and one of the GPUs most of the times gets a but more memory than others. I didn't have time to dig into that yet, so I don't know what might be the reason

Neo9061 commented 2 months ago

I think what's happening here is that the SFTTrainer uses PyTorch Data Parallel under the hood (not to be confused with Distributed Data Parallel). Data Parallel means that the full model is copied to each device and then data is split between devices. Therefore, there is no memory saving compared to just having a single device.

accelerate doesn't shard VLMs equally (when 'auto' device map is used) and one of the GPUs most of the times gets a but more memory than others

@BenjaminBossan @zucchini-nlp does that mean the SFTTrainer or TRL has not developed fully for vlm compared to lm? as for lm, using accelerate will have no problem to train similar size of model on the same instance. Who is the POC that can give definite answer on this topic? I am looking for using HF code to fine-tune VLM.

BenjaminBossan commented 2 months ago

does that mean the SFTTrainer or TRL has not developed fully for vlm compared to lm?

AFAICT, this does not really have anything to do with LM vs VLM. Different models use different amounts of memory, even if the parameter count is approximately the same (e.g. due to the size of hidden states). It can easily happen that one model barely fits in memory and the next model does not, even if the models have the same size.

Again, did you look into model parallel (or tensor parallel) training? I think that's the only way you'll be able to train a 110B parameter model. Data parallel training won't solve your problem.

as for lm, using accelerate will have no problem to train similar size of model on the same instance.

How are you using accelerate?

Neo9061 commented 2 months ago

did you look into model parallel (or tensor parallel) training?

Do you have any reference scripts of distributed training for vlm?

How are you using accelerate?

I believe accelerate has natually integrated into SFTTrainer or TRL? as if I don't install accelerate, it will gives me error when triggering training.

zucchini-nlp commented 2 months ago

this does not really have anything to do with LM vs VLM. Yes, the thing I mentioned is 1GiB difference at most

BenjaminBossan commented 2 months ago

I believe accelerate has natually integrated into SFTTrainer or TRL? as if I don't install accelerate, it will gives me error when triggering training.

Okay, I just wanted to ensure that you're not doing anything extra. In that case, I'm pretty sure that Data Parallel is being used, but I'm not an expert in Trainer or SFTTrainer.

Do you have any reference scripts of distributed training for vlm?

This is not specific to VLM, but it should still apply:

github-actions[bot] commented 1 month ago

This issue has been automatically marked as stale because it has not had recent activity. If you think this still needs to be addressed please comment on this thread.

Please note that issues that do not follow the contributing guidelines are likely to be ignored.