microsoft / DeepSpeed

DeepSpeed is a deep learning optimization library that makes distributed training and inference easy, efficient, and effective.
https://www.deepspeed.ai/
Apache License 2.0
35.16k stars 4.07k forks source link

[BUG]Issue with Training Moe by lora in deepspeed library #4646

Open Freddie-wei opened 11 months ago

Freddie-wei commented 11 months ago

Describe the bug I have encountered several issues while attempting to implement a combination of moe technique and lora fine-tuning with the llama2 model using deepspeed. I am using deepspeed zero stage2 as stage3 does not support moe.

The problems arise when I pass the model parameters to the optimizer and then initialize deepspeed with the optimizer. Initially, I received the errors "all params in moe group must be moe params" and "Parameter object has no attribute group name". In order to resolve these issues, I added the attributes of allreduce and group name to all parameters, following the implementation logic of is_moe_param() and split_params_into_different_moe_groups_for_optimizer() in the moe module.

However, I am now encountering the error "AssertionError: expert data parallel group is not initialized". Please find the screenshot of the specific error below. I kindly request assistance in resolving this problem or guidance on how to approach it. Thank you.

To Reproduce Steps to reproduce the behavior:

  1. build moe model like this: for layer_num in range(model.config.num_hidden_layers-5, model.config.num_hidden_layers): model.model.layers[layer_num].mlp = MoE( hidden_size=model.config.hidden_size, expert = model.model.layers[layer_num].mlp, num_experts = 1, ep_size = ep_size, use_residual = False, k = 1, min_capacity = min_capacity, noisy_gate_policy = noisy_gate_policy)
  2. lora target modules :["up_proj", "down_proj","gate_proj"]
  3. params setting: def get_optimizer_grouped_parameters(model, weight_decay, no_dacay_name_list=["bias", "LayerNorm.weight"]): params_moe = [] params_non_moe = [] for layer_nums in range(27, 32): params_moe = [p for n, p in model.base_model.model.model.layers[layer_nums].named_parameters() if (not any(nd in n for nd in no_dacay_name_list) and p.requires_grad)] for n, p in model.named_parameters(): if not any(nd in n for nd in no_dacay_name_list) and p.requires_grad: if not any(p.shape==moe_p.shape and torch.equal(p.data, moe_p.data) for moe_p in params_moe): params_non_moe.append(p) for parameter in params_moe: setattr(parameter, "allreduce", False) setattr(parameter, "group_name", "parameters_moe") for parameter in params_non_moe: setattr(parameter, "allreduce", True) setattr(parameter, "group_name", "parameters_non_moe") optimizer_grouped_parameters = [ { "params": params_non_moe, "weight_decay": weight_decay, "name": "parameters_no", }, { "params": params_moe, "weight_decay": weight_decay, "name": "parameters_moe", } ] return split_params_into_different_moe_groups_for_optimizer(optimizer_grouped_parameters) 4.initialize engine: model_engine, optimizer, training_dataloader, lr_scheduler = initialize(config=ds_config, model=model, optimizer=optimizer, model_parameters=model.parameters(), lr_scheduler=lr_scheduler, dist_init_required=True)

Expected behavior no bug

Screenshots image

marsggbo commented 10 months ago

any update for this issue? @Freddie-wei

marsggbo commented 10 months ago

I guess I find the reason. You may refer to https://github.com/microsoft/Megatron-DeepSpeed/issues/164#issuecomment-1827714843

mmderakhshani commented 7 months ago

@Freddie-wei, did you find a solution to this issue?

YunxinLi commented 7 months ago

@Freddie-wei, did you find a solution to this issue?

Hello, did you have find a solution to this problem?