Closed Neo9061 closed 1 month ago
cc @zucchini-nlp @BenjaminBossan
AFAICT, there is nothing in the script that would result in distributed learning. @Neo9061 can you observe that the model is sharded to all GPUs, or is it loading to a single GPU? When do you get the OOM error, before training starts or during training?
AFAICT, there is nothing in the script that would result in distributed learning. @Neo9061 can you observe that the model is sharded to all GPUs, or is it loading to a single GPU? When do you get the OOM error, before training starts or during training?
Thanks Benjamin for the reply. For QLoRA, i believe it is loaded on single GPU firstly and then broadcast to other GPUs? that is what I heard before from your team when working on llama 405B QloRA.
The OOM happens when the training starts - forward pass. And I have monitor of nvidia-smi on the side. All the GPUs are utilized. And I start observing the memory consumption on first GPU starts to increase and then goes OOM.
Thanks Benjamin for the reply. For QLoRA, i believe it is loaded on single GPU firstly and then broadcast to other GPUs? that is what I heard before from your team when working on llama 405B QloRA.
I'm not an expert on this topic, but I would be surprised if that is so. If it were loaded on a single GPU first, it would mean that you can only ever train a model in distributed style if it can fit on a single GPU, which would preclude most larger models.
The OOM happens when the training starts - forward pass. And I have monitor of nvidia-smi on the side. All the GPUs are utilized. And I start observing the memory consumption on first GPU starts to increase and then goes OOM.
I think what's happening here is that the SFTTrainer
uses PyTorch Data Parallel under the hood (not to be confused with Distributed Data Parallel). Data Parallel means that the full model is copied to each device and then data is split between devices. Therefore, there is no memory saving compared to just having a single device.
You should investigate using FSDP or DeepSpeed to enable model parallel training, which should prevent OOM while utilizing all GPUs. A good starting point for that is accelerate.
For what it's worth, accelerate
doesn't shard VLMs equally (when 'auto' device map is used) and one of the GPUs most of the times gets a but more memory than others. I didn't have time to dig into that yet, so I don't know what might be the reason
I think what's happening here is that the SFTTrainer uses PyTorch Data Parallel under the hood (not to be confused with Distributed Data Parallel). Data Parallel means that the full model is copied to each device and then data is split between devices. Therefore, there is no memory saving compared to just having a single device.
accelerate doesn't shard VLMs equally (when 'auto' device map is used) and one of the GPUs most of the times gets a but more memory than others
@BenjaminBossan @zucchini-nlp does that mean the SFTTrainer or TRL has not developed fully for vlm compared to lm? as for lm, using accelerate will have no problem to train similar size of model on the same instance. Who is the POC that can give definite answer on this topic? I am looking for using HF code to fine-tune VLM.
does that mean the SFTTrainer or TRL has not developed fully for vlm compared to lm?
AFAICT, this does not really have anything to do with LM vs VLM. Different models use different amounts of memory, even if the parameter count is approximately the same (e.g. due to the size of hidden states). It can easily happen that one model barely fits in memory and the next model does not, even if the models have the same size.
Again, did you look into model parallel (or tensor parallel) training? I think that's the only way you'll be able to train a 110B parameter model. Data parallel training won't solve your problem.
as for lm, using accelerate will have no problem to train similar size of model on the same instance.
How are you using accelerate?
did you look into model parallel (or tensor parallel) training?
Do you have any reference scripts of distributed training for vlm?
How are you using accelerate?
I believe accelerate has natually integrated into SFTTrainer or TRL? as if I don't install accelerate, it will gives me error when triggering training.
this does not really have anything to do with LM vs VLM. Yes, the thing I mentioned is 1GiB difference at most
I believe accelerate has natually integrated into SFTTrainer or TRL? as if I don't install accelerate, it will gives me error when triggering training.
Okay, I just wanted to ensure that you're not doing anything extra. In that case, I'm pretty sure that Data Parallel is being used, but I'm not an expert in Trainer
or SFTTrainer
.
Do you have any reference scripts of distributed training for vlm?
This is not specific to VLM, but it should still apply:
This issue has been automatically marked as stale because it has not had recent activity. If you think this still needs to be addressed please comment on this thread.
Please note that issues that do not follow the contributing guidelines are likely to be ignored.
System Info
Can anyone explain if this is expected?
Who can help?
No response
Information
Tasks
examples
folder (such as GLUE/SQuAD, ...)Reproduction
Expected behavior
No OOM.