huggingface / accelerate

🚀 A simple way to launch, train, and use PyTorch models on almost any device and distributed configuration, automatic mixed precision (including fp8), and easy-to-configure FSDP and DeepSpeed support
https://huggingface.co/docs/accelerate
Apache License 2.0
7.96k stars 968 forks source link

[RFC] Supporting multiple models with DeepSpeed #2496

Closed pacman100 closed 2 months ago

pacman100 commented 8 months ago

System Info

PyTorch 2.2.1
DeepSpeed 0.13.4

Information

Tasks

Reproduction

Abstract

We are considering supporting multiple models with DeepSpeed when using Accelerate. We will be using the term model and DeepSpeed engine interchangeably.

Motivation and Background

Currently, when using Accelerate's integration of DeepSpeed, only a single model is supported. This limits the use cases such as RLHF, GANs, Knowledge Distillation etc which involve multiple models. We also have interest in this feature as per the below feature requests:

  1. https://huggingface.slack.com/archives/C06CEE9C1M4/p1706821695816299
  2. Passing multiple models with DeepSpeed will fail · Issue #253 · huggingface/accelerate (github.com)
  3. Passing multiple models with DeepSpeed will fail · Issue #1388 · huggingface/accelerate (github.com)

The reasons for restricting to only a single model support is given below:

  1. The user can only provide a single DeepSpeed config plugin/DeepSpeed config file corresponding to a single model. Ideally, the user would have different DeepSpeed configs for different models.
  2. DeepSpeed needs to keep track of the model, its optimizer and scheduler. Therefore, we currently have only one global DeepSpeed engine wrapper to control the backward and optimizer/scheduler step.

Proposal

The aim would be to solve the 2 challenges above. This would need:

  1. Support multiple DeepSpeed configurations. I believe the questionnaire with minimal DeepSpeed plugin shouldn’t be changed and should continue to support a single model. This should act like the default config to be used by all the models.
  2. Flexibility to use different deepspeed config should be part of prepare method. For example, given 4 models in rlhf scenario, I should be able to do the below:
# rlhf
...
model_1 = actor_model()
model_2 = critic_model()
model_3 = reference_model()
model_4 = reward_model()

optimizer_1 = torch.optim.AdamW(model_1.parameters(), lr=lr_1)
optimizer_2 = torch.optim.AdamW(model_2.parameters(), lr=lr_2)
scheduler_1 = get_scheduler("cosine_with_warmup", optimizer_1, warmup_steps=w_1, total_steps=n_1)
scheduler_2 = get_scheduler("cosine_with_warmup", optimizer_2, warmup_steps=w_2, total_steps=n_2)

model_1, optimizer_1, scheduler_1 = accelerator.prepare(model_1, optimizer_1, scheduler_1)  # uses the default DeepSpeed config passed via Accelerate config

model_2, optimizer_2, scheduler_2 = accelerator.prepare(model_2, optimizer_2, scheduler_2, deepspeed_config="path_or_dict_to_deepspeed_config_json")

model_3 = accelerator.prepare(model_3, deepspeed_config="path_or_dict_to_deepspeed_config_json")

model_4 = accelerator.prepare(model_4, deepspeed_config="path_or_dict_to_deepspeed_config_json")

for batch in train_dataloader:
    prompts = batch["prompts"]
    generations = model_1.generate(prompts) #outputs prompts+answers
    log_probs = model_1(generations)
    ref_log_probs = model_3(generations)
    reward_scores = model_4(generations)
    values = model_2(generations)
    for ppo_step in range(ppo_steps):
        old_rewards = compute_rewards(prompts, log_probs,ref_log_probs,  reward_score) # reward-kl_divergence
        batch = {'input_ids': seq, "attention_mask": attention_mask}
        advantages, returns = get_advantages_and_returns(values, old_rewards)
        new_log_probs = model_1(**batch, use_cache=False).logits
        model_1_loss = compute_actor_loss(new_log_probs, log_probs, advantages)
        accelerator.backward(model_1_loss) # challenge - need to know which deepspeed engine to use
        optimizer_1.step()
        scheduler_1.step()
        optimizer_1.zero_grad()

        new_value = model_2(**batch)
        model_2_loss = critic_loss_fn(new_value, values, returns)
        accelerator.backward(model_2_loss) # challenge - need to know which deepspeed engine to use
        optimizer_2.step()
        scheduler_2.step()
        optimizer_2.zero_grad()
...

Challenges for which user would need to do extra work:

  1. Now, the issue here happens when accelerator.backward(model_1_loss) or accelerator.backward(model_2_loss) is called. Behind the scenes, currently self.deepspeed_engine_wrapped.backward(loss, **kwargs) is called as we currently support only 1 DeepSpeed engine. Now, if we have multiple DeepSpeed engines, how do we know which deepspeed engine’s backward to call? Should a kwarg such as accelerator.backward(model_1_loss, model=model_1) be passed and internally have a mapping between the model and the respective DeepSpeed engine? However, passing such a kwarg deviates from the minimal API of Accelerate.
  2. How do we handle zero_init if, for example, 2 models are using ZeRO-3 while remaining 2 are using ZeRO-2? If the default DeepSpeed config passed by user is ZeRO-3 with zero_init=True, the user is then tasked with disabling it when loading the models which use ZeRO-2 via with zero3_init_context_manager(enabled=False) context manager.

Compatibility

This feature needs to be backwards compatible with Accelerate as well as Trainer. The Trainer API will have no changes.

Alternatives Considered

  1. At present, if only a single model needs to be trained while the remaining models are only used for inference and are smaller models which can fit in GPU memory, then the user can simply avoid passing it to accelerator.prepare() method.
  2. They can manually use the DeepSpeed API directly to create DeepSpeed engines for the remaining models as one in the TRl library when using DPO algo for using Stage 3 for frozen reference model to share it across GPUs.
  3. Creating a super-model encapsulating the different models in a single class.

Dependencies

  1. PyTorch
  2. DeepSpeed

Expected behavior

Enabling usecases involving multiple models with Accelerate's DeepSpeed integration.

pacman100 commented 8 months ago

Hello @stas00, @tjruwase, @muellerzr and @BenjaminBossan;

Would be interested in knowing your thoughts.

tjruwase commented 8 months ago

@pacman100, thanks for asking. DeepSpeed has provided supported for multiple models since our release DeepSpeed-Chat release in April 2023.

DeepSpeed-Chat implementation is available here.

Here is a good entry point for the support named DeepSpeedRLHFEngine.

We would be excited to collaborate on integrating into accelerate.

stas00 commented 8 months ago

This is exciting, thank you for finding time to working on this important need, Sourab!

1) I think this one is trivial - stash the engine into the model once you created it.

# inside: model_1, optimizer_1, scheduler_1 = accelerator.prepare(model_1, optimizer_1, scheduler_1)
deepspeed_engine = ...
model_1.deepspeed_engine = deepspeed_engine

now each engine is tied to its model, and you can operate on it from the each model.

If you do this same change for the previously existing functionality it'd still work for a single engine case.

I suppose the only concern here is a circular reference, which might need a manual untangling when the accelerator is destroyed - this is of no concern for normal functionality since this usually implies the end of the program - but it'd could impact tests - we don't want memory leaks.

  1. yeah, this is a tricky one. The problem is that when I designed the original hack to get from_pretrained to know whether to activate zero.Init or not it was a different world. I hadn't imagined that there might be more than one deepspeed engine.

OK, so perhaps one approach is to redesign completely how transformers interacts with external engines. But let's first study the updated needs. Besides Deepspeed ZeRO, do you know if FSDP plans on implementing zero.Init? Are there any other frameworks that need to tap into the model instantiating moment in from_pretrained? And if none at the moment should we prepare for the future when others will?

RobertLuo1 commented 7 months ago

Hope to see any update on it!

vivym commented 7 months ago

Hope to see any update on it!

npuichigo commented 7 months ago

@stas00 any update on this?

ShuaibinLi commented 7 months ago

Hope to see any update on it!

HsuWanTing commented 5 months ago

Is there any update on this?

SalomonKisters commented 5 months ago

Any updates?

muellerzr commented 2 months ago

actively working on this now!

muellerzr commented 2 months ago

We have a path forward for doing this, here's the basic plan we plan on having as an early and experimental API as part of accelerate 1.0.0

General API

The idea here is if you intend on having the same DS configuration operating across all models, and all models need to step at the same time, then you can just create one accelerator as before with .prepare(). As a result, when calling accelerator.backward() it will call the backward for every model that was prepared.

_ = accelerator.prepare(...)
accelerator.backward()

However, if you need interoperability then as part of the DeepSpeedPlugin, you can give names to each model that we will tag by reference, such that:

plugin = DeepSpeedPlugin(model_to_reference={"teacher": model1, "student": model2}

With this, during the call to accelerator.backward you can specify which model's backward should be used as a result:

accelerator.backward(loss, ds_model_ref_name="teacher")

When using different configurations

Given there can be scenarios where this is not intended, such as when using a reference model for DPO, we intend to have users create a second deepspeed plugin to use here that can then be enabled or disabled (with the first one being passed in the list as the enabled one by default).

E.g.:

accelerator = Accelerator(deepspeed_plugins=[plugin1, plugin2])

From here, you can do:

plugin1.enable()

This will setup any environmental variables needed (such as triggering or un-triggering zero3 init if this configuration doesn't use it), and disabling a plugin that is not the first plugin will automatically re-enable the first plugin.

(And also used for accelerator.prepare)

This will also be aliased as:

with plugin1:
   ...

Feedback:

If there are aspects you think which we are missing in terms of multiple-model DeepSpeed, or there is something confusing with the API, do not hesitate to give us some feedback here. It's a very early API that took us quite a while to settle on a decent solution, but we're more than open to if this won't fulfill certain needs of users.