microsoft / DeepSpeed

DeepSpeed is a deep learning optimization library that makes distributed training and inference easy, efficient, and effective.
https://www.deepspeed.ai/
Apache License 2.0
35.22k stars 4.08k forks source link

[HELP] Zero-3 on partial model to fix the input/output constant constraint #6642

Open BoyeGuillaume opened 6 days ago

BoyeGuillaume commented 6 days ago

Hello,

We have a multimodal model that is composed of multiple small "embedding" models followed by an large LLM. Because of scale of the training, we need for a multinode setup and we would like to use Zero-3 to reduce the memory footprint of the optimizer state.

Because of this, the input/output of the model may vary in size (and the same can be said for the model architecture). This prohibit us from using Zero-3 all together. Do you know if there is a way to start using Zero-3 in the middle of the model (aka. the boundary between the LLM and the embedded inputs). Notice that we still need gradients to be backpropagated to those embedders and as such we cannot simply consider the embeddings as the input.

I know of the deepspeed.utils.set_z3_leaf_modules method introduced in #4966, however it doesn't fit our use cases.

Do you have any idea or suggestion on how we could achieve this (if possible) ?

Thanks for you time and help !

tjruwase commented 2 days ago

@BoyeGuillaume, thanks for your question. I think getting more details would be helpful to understand your specific need.

Yes, I think that using a ZeRO-3_LLM in the middle of our model should work to forward loss and propagate back gradients, as below:

input -> SLM -> embed -> ZeRO-3_LLM -> loss
SLM <- grad <- ZeRO-3_LLM <- grad

Can you please try that and share any issues?

In case you are unaware, HF multimodal IDEFICS-80B was trained with ZeRO-3. The following links might be useful.

  1. https://x.com/StasBekman/status/1694004904761987249
  2. https://huggingface.co/HuggingFaceM4/idefics2-8b/discussions/30

I know of the deepspeed.utils.set_z3_leaf_modules method introduced in #4966, however it doesn't fit our use cases.

You are correct that deepspeed.utils.set_z3_leaf_modules is irrelevant for this case.

BoyeGuillaume commented 2 days ago

Thank you for your help, I'll check if this can fix the issue

BoyeGuillaume commented 2 days ago

I maybe wrong but it seems that in the case of IDEFICS-80B the image projection and the text embeddings are consider as input of the network (by that I mean that you cannot train the SLM as there is no gradient past the embedding)

BoyeGuillaume commented 2 days ago

My question is about whether it would be possible to apply Zero-3 optimization on only a portion of the "model" (aka. the LLM part) that is always the same.

tjruwase commented 2 days ago

Thanks for the clarification of your scenario. Yes, ZeRO-3 can be applied to only the LLM portion of the model. In our RLHF work, the actor, critic, reward, and reference models are configured with different ZeRO-* optimizations, as in this example script.

I am curious if the above rlhf example matches your scenario. Can you share your scenario code or pseudo-code, so we can discuss more concretely.

BoyeGuillaume commented 2 days ago

Thanks I'll check this out,

Concerning our architecture, our entire model fit within a single pytorch.Module that consists of the LLM and all of the models for embedding modalities. We then use the huggingface Trainer (a modified version as we have additional masking to do) and launch the entire pipeline with pytorch (for distributed training). The huggingface trainer takes the deepspeed configuration directly.

BoyeGuillaume commented 2 days ago

Or it seems to be doable, however we will probably end up getting rid of the huggingface trainer (as it seems it is doing a lot of dirty things in the background 🙃)

Thanks for the help

tjruwase commented 1 day ago

Concerning our architecture, our entire model fit within a single pytorch.Module that consists of the LLM and all of the models for embedding modalities

In that case, another option could be to use stage3_param_persistence_threshold configuration to restrict ZeRO-3 optimization to only the large parameters of the model. https://www.deepspeed.ai/docs/config-json/#zero-optimizations-for-fp16-training Image

You can examine the following line in your log to observe the effectiveness of this approach https://github.com/microsoft/DeepSpeed/blob/6e6563d3c8d7527713cc48d4a3adce51f22e83a2/deepspeed/runtime/zero/parameter_offload.py#L253-L255