[HELP] Zero-3 on partial model to fix the input/output constant constraint

microsoft / DeepSpeed

DeepSpeed is a deep learning optimization library that makes distributed training and inference easy, efficient, and effective.

https://www.deepspeed.ai/

Apache License 2.0

35.22k stars 4.08k forks source link

[HELP] Zero-3 on partial model to fix the input/output constant constraint #6642

Open BoyeGuillaume opened 6 days ago

BoyeGuillaume commented 6 days ago

Hello,

We have a multimodal model that is composed of multiple small "embedding" models followed by an large LLM. Because of scale of the training, we need for a multinode setup and we would like to use Zero-3 to reduce the memory footprint of the optimizer state.

Because of this, the input/output of the model may vary in size (and the same can be said for the model architecture). This prohibit us from using Zero-3 all together. Do you know if there is a way to start using Zero-3 in the middle of the model (aka. the boundary between the LLM and the embedded inputs). Notice that we still need gradients to be backpropagated to those embedders and as such we cannot simply consider the embeddings as the input.

I know of the deepspeed.utils.set_z3_leaf_modules method introduced in #4966, however it doesn't fit our use cases.

Do you have any idea or suggestion on how we could achieve this (if possible) ?

Thanks for you time and help !

tjruwase commented 2 days ago

@BoyeGuillaume, thanks for your question. I think getting more details would be helpful to understand your specific need.

Yes, I think that using a ZeRO-3_LLM in the middle of our model should work to forward loss and propagate back gradients, as below:

input -> SLM -> embed -> ZeRO-3_LLM -> loss
SLM <- grad <- ZeRO-3_LLM <- grad

Can you please try that and share any issues?

In case you are unaware, HF multimodal IDEFICS-80B was trained with ZeRO-3. The following links might be useful.

I know of the deepspeed.utils.set_z3_leaf_modules method introduced in #4966, however it doesn't fit our use cases.

You are correct that deepspeed.utils.set_z3_leaf_modules is irrelevant for this case.