ryandeng1 commented 9 months ago

Describe the bug I tried running deepspeed zero 3 on a new huggingface model and got the following error:

      [2023-12-13 04:12:18,837] [WARNING] [parameter_offload.py:86:_apply_to_tensors_only] A module has unknown inputs or outputs type (<class 'transformers.cache_utils.DynamicCache'>) and the tensors embedded in it cannot be detected. The ZeRO-3 hooks designed to trigger before or after backward pass of the module relies on knowing the input and output tensors and therefore may not get triggered properly.
      Invalidate trace cache @ step 14: expected module 19, but got module 34
      Traceback (most recent call last):
        File "/home/ubuntu/mixtral_hf/deepspeed_zero.py", line 36, in <module>
          outputs = model.generate(inputs, max_new_tokens=20)
        File "/home/ubuntu/anaconda3/envs/mixtral/lib/python3.10/site-packages/torch/utils/_contextlib.py", line 115, in decorate_context
          return func(*args, **kwargs)
        File "/home/ubuntu/mixtral_hf/transformers/src/transformers/generation/utils.py", line 1731, in generate
          return self.greedy_search(
        File "/home/ubuntu/mixtral_hf/transformers/src/transformers/generation/utils.py", line 2592, in greedy_search
          outputs = self(
        File "/home/ubuntu/anaconda3/envs/mixtral/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1518, in _wrapped_call_impl
          return self._call_impl(*args, **kwargs)
        File "/home/ubuntu/anaconda3/envs/mixtral/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1581, in _call_impl
          hook_result = hook(self, args, result)
        File "/home/ubuntu/anaconda3/envs/mixtral/lib/python3.10/site-packages/deepspeed/utils/nvtx.py", line 15, in wrapped_fn
          ret_val = func(*args, **kwargs)
        File "/home/ubuntu/anaconda3/envs/mixtral/lib/python3.10/site-packages/deepspeed/runtime/zero/parameter_offload.py", line 350, in _end_of_forward_hook
          self.get_param_coordinator(training=False).reset_step()
        File "/home/ubuntu/anaconda3/envs/mixtral/lib/python3.10/site-packages/deepspeed/runtime/zero/partitioned_param_coordinator.py", line 203, in reset_step
          raise RuntimeError(f"still have inflight params "
          RuntimeError: still have inflight params [{'id': 9, 'status': 'AVAILABLE', 'numel': 58720256, 'ds_numel': 58720256, 'shape': (14336, 4096), 'ds_shape': (14336, 4096), 'requires_grad': True, 'grad_shape': None, 'persist': False, 'active_sub_modules': set(), 'ds_tensor.shape': torch.Size([58720256])}, {'id': 11, 'status': 'INFLIGHT', 'numel': 58720256, 'ds_numel': 58720256, 'shape': (14336, 4096), 'ds_shape': (14336, 4096), 'requires_grad': True, 'grad_shape': None, 'persist': False, 'active_sub_modules': set(), 'ds_tensor.shape': torch.Size([58720256])}, {'id': 10, 'status': 'INFLIGHT', 'numel': 58720256, 'ds_numel': 58720256, 'shape': (4096, 14336), 'ds_shape': (4096, 14336), 'requires_grad': True, 'grad_shape': None, 'persist': False, 'active_sub_modules': set(), 'ds_tensor.shape': torch.Size([58720256])}, {'id': 15, 'status': 'INFLIGHT', 'numel': 58720256, 'ds_numel': 58720256, 'shape': (14336, 4096), 'ds_shape': (14336, 4096), 'requires_grad': True, 'grad_shape': None, 'persist': False, 'active_sub_modules': set(), 'ds_tensor.shape': torch.Size([58720256])}, {'id': 17, 'status': 'INFLIGHT', 'numel': 58720256, 'ds_numel': 58720256, 'shape': (14336, 4096), 'ds_shape': (14336, 4096), 'requires_grad': True, 'grad_shape': None, 'persist': False, 'active_sub_modules': set(), 'ds_tensor.shape': torch.Size([58720256])}, {'id': 16, 'status': 'INFLIGHT', 'numel': 58720256, 'ds_numel': 58720256, 'shape': (4096, 14336), 'ds_shape': (4096, 14336), 'requires_grad': True, 'grad_shape': None, 'persist': False, 'active_sub_modules': set(), 'ds_tensor.shape': torch.Size([58720256])}, {'id': 21, 'status': 'INFLIGHT', 'numel': 58720256, 'ds_numel': 58720256, 'shape': (14336, 4096), 'ds_shape': (14336, 4096), 'requires_grad': True, 'grad_shape': None, 'persist': False, 'active_sub_modules': set(), 'ds_tensor.shape': torch.Size([58720256])}, {'id': 23, 'status': 'INFLIGHT', 'numel': 58720256, 'ds_numel': 58720256, 'shape': (14336, 4096), 'ds_shape': (14336, 4096), 'requires_grad': True, 'grad_shape': None, 'persist': False, 'active_sub_modules': set(), 'ds_tensor.shape': torch.Size([58720256])}, {'id': 22, 'status': 'INFLIGHT', 'numel': 58720256, 'ds_numel': 58720256, 'shape': (4096, 14336), 'ds_shape': (4096, 14336), 'requires_grad': True, 'grad_shape': None, 'persist': False, 'active_sub_modules': set(), 'ds_tensor.shape': torch.Size([58720256])}, {'id': 27, 'status': 'INFLIGHT', 'numel': 58720256, 'ds_numel': 58720256, 'shape': (14336, 4096), 'ds_shape': (14336, 4096), 'requires_grad': True, 'grad_shape': None, 'persist': False, 'active_sub_modules': set(), 'ds_tensor.shape': torch.Size([58720256])}]

To Reproduce Steps to reproduce the behavior:

Simple inference script to reproduce:

  model_id = "mistralai/Mixtral-8x7B-v0.1"
  ds_config = {
      "bf16": {
          "enabled": True,
      },
      "zero_optimization": {
          "stage": 3,
          "offload_param": {
              "device": "cpu",
          }
      },
      "train_micro_batch_size_per_gpu": 1,
  }

  hfdsc = HfDeepSpeedConfig(ds_config)

  tokenizer = AutoTokenizer.from_pretrained(model_id)
  model = AutoModelForCausalLM.from_pretrained(model_id, torch_dtype=torch.bfloat16)
  model.eval()

  ds_engine = deepspeed.initialize(model=model, config_params=ds_config)[0]
  ds_engine.module.eval()
  model = ds_engine.module

  inputs = tokenizer.encode("DeepSpeed is", return_tensors="pt").to("cuda")
  outputs = model.generate(inputs, max_new_tokens=20)   
  output_str = tokenizer.decode(outputs[0])

What packages are required and their versions

HuggingFace 4.65
Deepspeed 0.12.4
Torch 2.1
Cuda 12.1

ds_report output Please run ds_report to give us details about your setup.

    DeepSpeed C++/CUDA extension op report
    --------------------------------------------------
    NOTE: Ops not installed will be just-in-time (JIT) compiled at
          runtime if needed. Op compatibility means that your system
          meet the required dependencies to JIT install the op.
    --------------------------------------------------
    JIT compiled ops requires ninja
    ninja .................. [OKAY]
    --------------------------------------------------
    op name ................ installed .. compatible
    --------------------------------------------------
    async_io ............... [NO] ....... [OKAY]
    fused_adam ............. [NO] ....... [OKAY]
    cpu_adam ............... [NO] ....... [OKAY]
    cpu_adagrad ............ [NO] ....... [OKAY]
    cpu_lion ............... [NO] ....... [OKAY]
     [WARNING]  Please specify the CUTLASS repo directory as environment variable $CUTLASS_PATH
    evoformer_attn ......... [NO] ....... [NO]
    fused_lamb ............. [NO] ....... [OKAY]
    fused_lion ............. [NO] ....... [OKAY]
    inference_core_ops ..... [NO] ....... [OKAY]
    cutlass_ops ............ [NO] ....... [OKAY]
    quantizer .............. [NO] ....... [OKAY]
    ragged_device_ops ...... [NO] ....... [OKAY]
    ragged_ops ............. [NO] ....... [OKAY]
    random_ltd ............. [NO] ....... [OKAY]
     [WARNING]  sparse_attn requires a torch version >= 1.5 and < 2.0 but detected 2.1
     [WARNING]  using untested triton version (2.1.0), only 1.0.0 is known to be compatible
    sparse_attn ............ [NO] ....... [NO]
    spatial_inference ...... [NO] ....... [OKAY]
    transformer ............ [NO] ....... [OKAY]
    stochastic_transformer . [NO] ....... [OKAY]
    transformer_inference .. [NO] ....... [OKAY]
    --------------------------------------------------
    DeepSpeed general environment info:
    torch install path ............... ['/home/ubuntu/anaconda3/envs/mixtral/lib/python3.10/site-packages/torch']
    torch version .................... 2.1.1
    deepspeed install path ........... ['/home/ubuntu/anaconda3/envs/mixtral/lib/python3.10/site-packages/deepspeed']
    deepspeed info ................... 0.12.4, unknown, unknown
    torch cuda version ............... 12.1
    torch hip version ................ None
    nvcc version ..................... 12.1
    deepspeed wheel compiled w. ...... torch 2.1, cuda 12.1
    shared memory (/dev/shm) size .... 124.52 GB

Screenshots If applicable, add screenshots to help explain your problem.

System info (please complete the following information):

AWS g5.16x large instance
OS: Ubuntu 22.04
GPU: Nvidia A10G
- OS: [e.g. Ubuntu 18.04]
- GPU count: 1
- Python version: 3.10.13

Chenzongchao commented 9 months ago

same question

ikergarcia1996 commented 9 months ago

Same problem

Yuhuajoe commented 9 months ago

same problem

LZHgrla commented 9 months ago

same problem It's similar to https://github.com/microsoft/DeepSpeed/issues/4094

I modify the num_experts_per_tok to 8 in config.json of mixtral-8x7b and everything works well. So I think the problem is caused by the offload for unused parameters.

I try to use "stage3_prefetch_bucket_size": 0 as described in https://github.com/microsoft/DeepSpeed/issues/4094. After that, it does not report any error, but reports some warnings

[2023-12-19 19:41:54,820] [WARNING] [parameter_offload.py:86:_apply_to_tensors_only] A module has unknown inputs or outputs type (<class 'transformers.cache_utils.DynamicCache'>) and the tensors embedded in it cannot be detected. The ZeRO-3 hooks designed to trigger before or after backward pass of the module relies on knowing the input and output tensors and therefore may not get triggered properly.
Invalidate trace cache @ step 14: expected module 19, but got module 34
Invalidate trace cache @ step 19: expected module 44, but got module 34
Invalidate trace cache @ step 56: expected module 133, but got module 123
Invalidate trace cache @ step 14: expected module 14, but got module 24
Invalidate trace cache @ step 14: expected module 34, but got module 14
Invalidate trace cache @ step 14: expected module 29, but got module 14
Invalidate trace cache @ step 14: expected module 14, but got module 29
Invalidate trace cache @ step 14: expected module 14, but got module 29
Invalidate trace cache @ step 14: expected module 29, but got module 19
Invalidate trace cache @ step 61: expected module 148, but got module 133
<s> DeepSpeed is a deep learning optimization library that makes distributed deep learning fast and efficient. DeepSpeed is designed to be

ryandeng1 commented 9 months ago

Changing that parameter fundamentally changes the model

same problem It's similar to #4094

I modify the num_experts_per_tok to 8 in config.json of mixtral-8x7b and everything works well. So I think the problem is caused by the offload for unused parameters.
I try to use "stage3_prefetch_bucket_size": 0 as described in [BUG] Unused params lead to "still have inflight params" error #4094. After that, it does not report any error, but reports some warnings

[2023-12-19 19:41:54,820] [WARNING] [parameter_offload.py:86:_apply_to_tensors_only] A module has unknown inputs or outputs type (<class 'transformers.cache_utils.DynamicCache'>) and the tensors embedded in it cannot be detected. The ZeRO-3 hooks designed to trigger before or after backward pass of the module relies on knowing the input and output tensors and therefore may not get triggered properly.
Invalidate trace cache @ step 14: expected module 19, but got module 34
Invalidate trace cache @ step 19: expected module 44, but got module 34
Invalidate trace cache @ step 56: expected module 133, but got module 123
Invalidate trace cache @ step 14: expected module 14, but got module 24
Invalidate trace cache @ step 14: expected module 34, but got module 14
Invalidate trace cache @ step 14: expected module 29, but got module 14
Invalidate trace cache @ step 14: expected module 14, but got module 29
Invalidate trace cache @ step 14: expected module 14, but got module 29
Invalidate trace cache @ step 14: expected module 29, but got module 19
Invalidate trace cache @ step 61: expected module 148, but got module 133
<s> DeepSpeed is a deep learning optimization library that makes distributed deep learning fast and efficient. DeepSpeed is designed to be

Changing that parameter fundamentally changes the model right? By default, it should only route to 2 experts per token.

LZHgrla commented 9 months ago

Changing that parameter fundamentally changes the model
same problem It's similar to #4094

I modify the num_experts_per_tok to 8 in config.json of mixtral-8x7b and everything works well. So I think the problem is caused by the offload for unused parameters.

I try to use "stage3_prefetch_bucket_size": 0 as described in [BUG] Unused params lead to "still have inflight params" error #4094. After that, it does not report any error, but reports some warnings
[2023-12-19 19:41:54,820] [WARNING] [parameter_offload.py:86:_apply_to_tensors_only] A module has unknown inputs or outputs type (<class 'transformers.cache_utils.DynamicCache'>) and the tensors embedded in it cannot be detected. The ZeRO-3 hooks designed to trigger before or after backward pass of the module relies on knowing the input and output tensors and therefore may not get triggered properly.
Invalidate trace cache @ step 14: expected module 19, but got module 34
Invalidate trace cache @ step 19: expected module 44, but got module 34
Invalidate trace cache @ step 56: expected module 133, but got module 123
Invalidate trace cache @ step 14: expected module 14, but got module 24
Invalidate trace cache @ step 14: expected module 34, but got module 14
Invalidate trace cache @ step 14: expected module 29, but got module 14
Invalidate trace cache @ step 14: expected module 14, but got module 29
Invalidate trace cache @ step 14: expected module 14, but got module 29
Invalidate trace cache @ step 14: expected module 29, but got module 19
Invalidate trace cache @ step 61: expected module 148, but got module 133
<s> DeepSpeed is a deep learning optimization library that makes distributed deep learning fast and efficient. DeepSpeed is designed to be
Changing that parameter fundamentally changes the model right? By default, it should only route to 2 experts per token.

Yes

mynewstart commented 9 months ago

Changing that parameter fundamentally changes the model
same problem It's similar to #4094

I modify the num_experts_per_tok to 8 in config.json of mixtral-8x7b and everything works well. So I think the problem is caused by the offload for unused parameters.

I try to use "stage3_prefetch_bucket_size": 0 as described in [BUG] Unused params lead to "still have inflight params" error #4094. After that, it does not report any error, but reports some warnings
[2023-12-19 19:41:54,820] [WARNING] [parameter_offload.py:86:_apply_to_tensors_only] A module has unknown inputs or outputs type (<class 'transformers.cache_utils.DynamicCache'>) and the tensors embedded in it cannot be detected. The ZeRO-3 hooks designed to trigger before or after backward pass of the module relies on knowing the input and output tensors and therefore may not get triggered properly.
Invalidate trace cache @ step 14: expected module 19, but got module 34
Invalidate trace cache @ step 19: expected module 44, but got module 34
Invalidate trace cache @ step 56: expected module 133, but got module 123
Invalidate trace cache @ step 14: expected module 14, but got module 24
Invalidate trace cache @ step 14: expected module 34, but got module 14
Invalidate trace cache @ step 14: expected module 29, but got module 14
Invalidate trace cache @ step 14: expected module 14, but got module 29
Invalidate trace cache @ step 14: expected module 14, but got module 29
Invalidate trace cache @ step 14: expected module 29, but got module 19
Invalidate trace cache @ step 61: expected module 148, but got module 133
<s> DeepSpeed is a deep learning optimization library that makes distributed deep learning fast and efficient. DeepSpeed is designed to be
Changing that parameter fundamentally changes the model right? By default, it should only route to 2 experts per token.
Yes

Will that impact the performance?

LZHgrla commented 9 months ago

Changing that parameter fundamentally changes the model
same problem It's similar to #4094

I modify the num_experts_per_tok to 8 in config.json of mixtral-8x7b and everything works well. So I think the problem is caused by the offload for unused parameters.

I try to use "stage3_prefetch_bucket_size": 0 as described in [BUG] Unused params lead to "still have inflight params" error #4094. After that, it does not report any error, but reports some warnings
[2023-12-19 19:41:54,820] [WARNING] [parameter_offload.py:86:_apply_to_tensors_only] A module has unknown inputs or outputs type (<class 'transformers.cache_utils.DynamicCache'>) and the tensors embedded in it cannot be detected. The ZeRO-3 hooks designed to trigger before or after backward pass of the module relies on knowing the input and output tensors and therefore may not get triggered properly.
Invalidate trace cache @ step 14: expected module 19, but got module 34
Invalidate trace cache @ step 19: expected module 44, but got module 34
Invalidate trace cache @ step 56: expected module 133, but got module 123
Invalidate trace cache @ step 14: expected module 14, but got module 24
Invalidate trace cache @ step 14: expected module 34, but got module 14
Invalidate trace cache @ step 14: expected module 29, but got module 14
Invalidate trace cache @ step 14: expected module 14, but got module 29
Invalidate trace cache @ step 14: expected module 14, but got module 29
Invalidate trace cache @ step 14: expected module 29, but got module 19
Invalidate trace cache @ step 61: expected module 148, but got module 133
<s> DeepSpeed is a deep learning optimization library that makes distributed deep learning fast and efficient. DeepSpeed is designed to be
Changing that parameter fundamentally changes the model right? By default, it should only route to 2 experts per token.
Yes
Will that impact the performance?

Of course :)

mynewstart commented 9 months ago

Will the inference speed slow down if I understand correctly, and will the model's performance deteriorate?

Changing that parameter fundamentally changes the model
same problem It's similar to #4094

I modify the num_experts_per_tok to 8 in config.json of mixtral-8x7b and everything works well. So I think the problem is caused by the offload for unused parameters.

I try to use "stage3_prefetch_bucket_size": 0 as described in [BUG] Unused params lead to "still have inflight params" error #4094. After that, it does not report any error, but reports some warnings
[2023-12-19 19:41:54,820] [WARNING] [parameter_offload.py:86:_apply_to_tensors_only] A module has unknown inputs or outputs type (<class 'transformers.cache_utils.DynamicCache'>) and the tensors embedded in it cannot be detected. The ZeRO-3 hooks designed to trigger before or after backward pass of the module relies on knowing the input and output tensors and therefore may not get triggered properly.
Invalidate trace cache @ step 14: expected module 19, but got module 34
Invalidate trace cache @ step 19: expected module 44, but got module 34
Invalidate trace cache @ step 56: expected module 133, but got module 123
Invalidate trace cache @ step 14: expected module 14, but got module 24
Invalidate trace cache @ step 14: expected module 34, but got module 14
Invalidate trace cache @ step 14: expected module 29, but got module 14
Invalidate trace cache @ step 14: expected module 14, but got module 29
Invalidate trace cache @ step 14: expected module 14, but got module 29
Invalidate trace cache @ step 14: expected module 29, but got module 19
Invalidate trace cache @ step 61: expected module 148, but got module 133
<s> DeepSpeed is a deep learning optimization library that makes distributed deep learning fast and efficient. DeepSpeed is designed to be
Changing that parameter fundamentally changes the model right? By default, it should only route to 2 experts per token.
Yes
Will that impact the performance?
Of course :)

Will the inference speed slow down if I understand correctly, and will the model's performance deteriorate?

tuyaao commented 8 months ago

same question

jingwangsg commented 8 months ago

same problem It's similar to #4094

I modify the num_experts_per_tok to 8 in config.json of mixtral-8x7b and everything works well. So I think the problem is caused by the offload for unused parameters.
I try to use "stage3_prefetch_bucket_size": 0 as described in [BUG] Unused params lead to "still have inflight params" error #4094. After that, it does not report any error, but reports some warnings

[2023-12-19 19:41:54,820] [WARNING] [parameter_offload.py:86:_apply_to_tensors_only] A module has unknown inputs or outputs type (<class 'transformers.cache_utils.DynamicCache'>) and the tensors embedded in it cannot be detected. The ZeRO-3 hooks designed to trigger before or after backward pass of the module relies on knowing the input and output tensors and therefore may not get triggered properly.
Invalidate trace cache @ step 14: expected module 19, but got module 34
Invalidate trace cache @ step 19: expected module 44, but got module 34
Invalidate trace cache @ step 56: expected module 133, but got module 123
Invalidate trace cache @ step 14: expected module 14, but got module 24
Invalidate trace cache @ step 14: expected module 34, but got module 14
Invalidate trace cache @ step 14: expected module 29, but got module 14
Invalidate trace cache @ step 14: expected module 14, but got module 29
Invalidate trace cache @ step 14: expected module 14, but got module 29
Invalidate trace cache @ step 14: expected module 29, but got module 19
Invalidate trace cache @ step 61: expected module 148, but got module 133
<s> DeepSpeed is a deep learning optimization library that makes distributed deep learning fast and efficient. DeepSpeed is designed to be

The first solution may deteriorate performance as many have doubted. The second solution works, however the inference runs extremely slowly, with a bunch of warnings yelling "Invalidate trace cache @ step 14: expected module 14, but got module xxx".

tjruwase commented 8 months ago

Guys, thanks for the great debugging and collaboration here to understand this problem. The fundamental issue is that zero3 caches the parameter trace to enable parameter prefetching to reduce all-gather latency. Unfortunately, since MoE layers can activate different experts across iterations, the parameter trace cache is invalidated when the expert changes. The warning messages are for the trace cache invalidations. In this case, the warning is avoidable since prefetching is disabled by setting "stage3_prefetch_bucket_size": 0, so a minor fix is required in this case. However, in general inference speed will be very slow as observed.

We have not previously tested zero3 and MoE, but we will prioritize this investigation now given the interest.

hijkzzz commented 8 months ago

I got the error with "stage_prefetch_bucket_size": 0 + zero3

Invalidate trace cache @ step 1323: expected module 2476, but got module 2510                                    | 20/2466 [02:15<4:12:07,  6.18s/it, gpt_loss=1.28, loss_mean=1.22, balancing_loss=8]

[rank0]:[E ProcessGroupNCCL.cpp:754] [Rank 0] Watchdog caught collective operation timeout: WorkNCCL(SeqNum=151178, OpType=_ALLGATHER_BASE, NumelIn=65536, NumelOut=262144, Timeout(ms)=1800000) ran for 1800326 milliseconds before timing out.
[rank0]:[E ProcessGroupNCCL.cpp:768] Some NCCL operations have failed or timed out. Due to the asynchronous nature of CUDA kernels, subsequent GPU operations might run on corrupted/incomplete data.
[rank0]:[E ProcessGroupNCCL.cpp:774] To avoid data inconsistency, we are taking the entire process down.
[rank0]:[E ProcessGroupNCCL.cpp:1282] [Rank 0] NCCL watchdog thread terminated with exception: [Rank 0] Watchdog caught collective operation timeout: WorkNCCL(SeqNum=151178, OpType=_ALLGATHER_BASE, NumelIn=65536, NumelOut=262144, Timeout(ms)=1800000) ran for 1800326 milliseconds before timing out.
Exception raised from checkTimeout at /opt/pytorch/pytorch/torch/csrc/distributed/c10d/ProcessGroupNCCL.cpp:756 (most recent call first):
frame #0: c10::Error::Error(c10::SourceLocation, std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> >) + 0x99 (0x7f9ddd19c8f9 in /usr/local/lib/python3.10/dist-packages/torch/lib/libc10.so)
frame #1: c10d::ProcessGroupNCCL::WorkNCCL::checkTimeout(c10::optional<std::chrono::duration<long, std::ratio<1l, 1000l> > >) + 0x1f2 (0x7f9d7ef58142 in /usr/local/lib/python3.10/dist-packages/torch/lib/libtorch_cuda.so)
frame #2: c10d::ProcessGroupNCCL::workCleanupLoop() + 0x178 (0x7f9d7ef5e538 in /usr/local/lib/python3.10/dist-packages/torch/lib/libtorch_cuda.so)
frame #3: c10d::ProcessGroupNCCL::ncclCommWatchdog() + 0x8e (0x7f9d7ef5eb2e in /usr/local/lib/python3.10/dist-packages/torch/lib/libtorch_cuda.so)
frame #4: <unknown function> + 0xdc253 (0x7f9ddccb0253 in /usr/lib/x86_64-linux-gnu/libstdc++.so.6)
frame #5: <unknown function> + 0x94ac3 (0x7f9e8ad78ac3 in /usr/lib/x86_64-linux-gnu/libc.so.6)
frame #6: <unknown function> + 0x126660 (0x7f9e8ae0a660 in /usr/lib/x86_64-linux-gnu/libc.so.6)

BBerabi commented 8 months ago

I am also observing the same issue even with "stage_prefetch_bucket_size": 0. The runtime error about inflight parameters does not occur but the process just hangs indefinitely and crashes at the end with timeout.

Did someone manage to fine-tune Mixtral with zero3 and huggingface? Could you share your deepspeed config? @K-Nick @LZHgrla @ryandeng1

LZHgrla commented 8 months ago

@BBerabi You can try it with xtuner, https://github.com/InternLM/xtuner/tree/main/xtuner/configs/mixtral

But remember that, using deepspeed_zero3 instead of deepspeed_zero3_offload

mynewstart commented 8 months ago

I can fully fine-tune Mistral7b*8 instruct with deepspeed zero3 on 2 A100-80GB instances, the code won't hook and run smoothly. I didn't change anything except disabling the evaluation part to calculate ppl for val data set. The fine-tuned model looks normal but I still don't know why it can happen. I just provide my training environment for your inference. Transformer version: 4.36.2, deepspeed 0.12.5, deepspeed zero_3 config:


  "gradient_accumulation_steps": 8,
  "train_micro_batch_size_per_gpu": 4,
  "prescale_gradients": false,
  "zero_allow_untested_optimizer": true,
  "zero_optimization": {
    "stage": 3, 
    "offload_param": {
        "device": "none"
    }, 
    "offload_optimizer": {
        "device": "none"
    }, 
    "stage3_param_persistence_threshold": 1.000000e+04, 
    "stage3_max_live_parameters": 3.000000e+07, 
    "stage3_prefetch_bucket_size": 3.000000e+07, 
    "memory_efficient_linear": false
  }, 
  "steps_per_print": 1,
  "gradient_clipping": 1.0,
  "wall_clock_breakdown": true,
  "bf16": {
    "enabled": true
  }
}```

awzhgw commented 8 months ago

I have same question...how resolve it ?

tohtana commented 8 months ago

Hi all, If you want to generate text with Mixtral, DeepSpeed-FastGen (DeepSpeed-MII) will be the first choice. The example is available here. I verified that Mixtral works just by modifying the model name.

It is easier to use "non-persistent" mode for testing purpose, but "persistent" mode will give you the best performance. Please refer to DeepSpeed-MII for more details.

hijkzzz commented 8 months ago

Is there any progress ?

tohtana commented 8 months ago

Hi @hijkzzz and all,

4966 should have fixed this issue. You can find working example there.

The PR was already merged into master. Please feel free to try, but I still recommend using DeepSpeed-FastGen for text generation. It is much faster and supports Mixtral.

xs1997zju commented 8 months ago

@mynewstart Hey, would u mind to share your complete deepspeed config?

ftgreat commented 8 months ago

@tohtana In my testing of Mixtral fine-tune phrase using Zero3, training process hanged at step5 for the same datasets. This patch seems not fixed my hang issue during training. As you declared, this patch should have fixed for text generation issue using Zero3.

After my debugging, I found the hang probably are related to these lines from MixtralSparseMoeBlock implementation as follows and hangs happened when some experts have been assigned to no tokens in training batch. https://github.com/huggingface/transformers/blob/e547458c43dfdbbb8f6a7757237e234c44e20a8f/src/transformers/models/mixtral/modeling_mixtral.py#L823-L824

Could you please give me some explanation about why this implementation caused hang using Zero3? (Zero2 runs normally). Thanks for your reply.

tohtana commented 8 months ago

Thank you for sharing the issue, @ftgreat. The same issue is reported at #4966. Let me take a look.

ftgreat commented 8 months ago

Thank you for sharing the issue, @ftgreat. The same issue is reported at #4966. Let me take a look.

@tohtana I wrote a monkey patch using dense moe impl instead of mixtral sparse moe. Tested ok for my cases, no hangs happend. https://github.com/ftgreat/llmkit/blob/main/huggingface/mixtral/mixtral_dense_moe_monkey_patch.py

Still wanna detailed explanations about the cause of sparse moe impl. Thanks.

tohtana commented 8 months ago

@ftgreat The root cause of this issue is that DeepSpeed tries to run reduce-scatter for only a part of experts.

ZeRO3 sets hooks on parameters to run reduce-scatter. However, the hook is not fired unless the expert is activated at a forward pass. Our data parallel processes may activate different sets of experts. We need all processes to join such a communication collective, but the reduce-scatter is called only on some processes in this case.

Since we already implemented the API to set a leaf module for ZeRO3, the solution will be to delay reduce-scatter until the backward pass of the leaf module finishes. I will work on this direction.

penpaperkeycode commented 8 months ago

@ftgreat Hello, I would like to know if your monkey patch can achieve the same results as the original mixtral forward.

Is this method currently the best approach right?

ftgreat commented 7 months ago

https://github.com/ftgreat/llmkit/blob/main/huggingface/mixtral/mixtral_dense_moe_monkey_patch.py

Add some unittest cases. https://github.com/ftgreat/llmkit/blob/main/huggingface/mixtral/mixtral_dense_moe_monkey_patch.py#L69

tohtana commented 7 months ago

I've opened a PR (#5008) to fix the issue causing hangs during backward passes. Please feel free to test it with your model.

Sniper970119 commented 7 months ago

https://github.com/ftgreat/llmkit/blob/main/huggingface/mixtral/mixtral_dense_moe_monkey_patch.py

Add some unittest cases. https://github.com/ftgreat/llmkit/blob/main/huggingface/mixtral/mixtral_dense_moe_monkey_patch.py#L69

I tried the code and run without error, but the loss is all 0. and grad is 1?

{'TFlops': 0.0, 'total_grad': 1.0, 'loss': 0.0, 'learning_rate': 2.5e-08, 'epoch': 0.0}
{'TFlops': 0.0, 'total_grad': 1.0, 'loss': 0.0, 'learning_rate': 2.5e-08, 'epoch': 0.0}
{'TFlops': 0.0, 'total_grad': 1.0, 'loss': 0.0, 'learning_rate': 2.5e-08, 'epoch': 0.0}
{'TFlops': 0.0, 'total_grad': 1.0, 'loss': 0.0, 'learning_rate': 2.5e-08, 'epoch': 0.0}
{'TFlops': 0.0, 'total_grad': 1.0, 'loss': 0.0, 'learning_rate': 2.5e-08, 'epoch': 0.0}
{'TFlops': 0.0, 'total_grad': 1.0, 'loss': 0.0, 'learning_rate': 2.5e-08, 'epoch': 0.0}
{'TFlops': 0.15630008137498733, 'total_grad': 1.0, 'loss': 0.0, 'learning_rate': 5e-08, 'epoch': 0.0}
{'TFlops': 0.15629951739661085, 'total_grad': 1.0, 'loss': 0.0, 'learning_rate': 5e-08, 'epoch': 0.0}
{'TFlops': 0.15629945559100883, 'total_grad': 1.0, 'loss': 0.0, 'learning_rate': 5e-08, 'epoch': 0.0}
{'TFlops': 0.15629979895608306, 'total_grad': 1.0, 'loss': 0.0, 'learning_rate': 5e-08, 'epoch': 0.0}
{'TFlops': 0.1562995422905477, 'total_grad': 1.0, 'loss': 0.0, 'learning_rate': 5e-08, 'epoch': 0.0}
{'TFlops': 0.15630035091860164, 'total_grad': 1.0, 'loss': 0.0, 'learning_rate': 5e-08, 'epoch': 0.0}
{'TFlops': 0.3167301641335448, 'total_grad': 1.0, 'loss': 0.0, 'learning_rate': 7.500000000000001e-08, 'epoch': 0.0}
{'TFlops': 0.31672936396061685, 'total_grad': 1.0, 'loss': 0.0, 'learning_rate': 7.500000000000001e-08, 'epoch': 0.0}
{'TFlops': 0.3167279821737141, 'total_grad': 1.0, 'loss': 0.0, 'learning_rate': 7.500000000000001e-08, 'epoch': 0.0}
{'TFlops': 0.31672982925892, 'total_grad': 1.0, 'loss': 0.0, 'learning_rate': 7.500000000000001e-08, 'epoch': 0.0}
{'TFlops': 0.3167278887625346, 'total_grad': 1.0, 'loss': 0.0, 'learning_rate': 7.500000000000001e-08, 'epoch': 0.0}
{'TFlops': 0.31672019927269446, 'total_grad': 1.0, 'loss': 0.0, 'learning_rate': 7.500000000000001e-08, 'epoch': 0.0}
{'TFlops': 0.3167748692022585, 'total_grad': 1.0, 'loss': 0.0, 'learning_rate': 1e-07, 'epoch': 0.0}
{'TFlops': 0.31677583885383076, 'total_grad': 1.0, 'loss': 0.0, 'learning_rate': 1e-07, 'epoch': 0.0}
{'TFlops': 0.31677668333703635, 'total_grad': 1.0, 'loss': 0.0, 'learning_rate': 1e-07, 'epoch': 0.0}
{'TFlops': 0.3167755179866533, 'total_grad': 1.0, 'loss': 0.0, 'learning_rate': 1e-07, 'epoch': 0.0}
{'TFlops': 0.31677014443953183, 'total_grad': 1.0, 'loss': 0.0, 'learning_rate': 1e-07, 'epoch': 0.0}
{'TFlops': 0.31678280112644813, 'total_grad': 1.0, 'loss': 0.0, 'learning_rate': 1e-07, 'epoch': 0.0}

nilsec commented 7 months ago

Any update on this?

hchoi-moveworks commented 6 months ago

Hey @tohtana

Deepspeed FastGen would only be for inference right? Does it support full standard finetuning of Mixtral / MoE models?

tohtana commented 6 months ago

We have an example to run training for Mixtral here: https://github.com/microsoft/DeepSpeed/pull/5008#issuecomment-1910607845

@Sniper970119 @nilsec @hchoi-moveworks Can you try this solution?

hchoi-moveworks commented 6 months ago

Thanks @tohtana !!

Would deepspeed also support also using CPU offloading for Mixtral model ?

If not, https://github.com/microsoft/DeepSpeed/pull/5008#issuecomment-1910607845 assumes that we need 2 A100 nodes right?

tohtana commented 6 months ago

@hchoi-moveworks The CPU offloading should work, but I haven't tried. I think I needed four A100s to run the example.

Sniper970119 commented 6 months ago

We have an example to run training for Mixtral here: #5008 (comment)

@Sniper970119 @nilsec @hchoi-moveworks Can you try this solution?

Hi，I tried the script and get an error at L132.

 File "test_moe_train.py", line 132, in training_function
    accelerator.backward(loss)
  File ".../lib/python3.8/site-packages/accelerate/accelerator.py", line 1899, in backward
             self.deepspeed_engine_wrapped.backward(loss, **kwargs)self.deepspeed_engine_wrapped.backward(loss, **kwargs)self.deepspeed_engine_wrapped.backward(loss, **kwargs)
 File ".../lib/python3.8/site-packages/accelerate/utils/deepspeed.py", line 167, in backward
            self.deepspeed_engine_wrapped.backward(loss, **kwargs)
 File "/.../lib/python3.8/site-packages/deepspeed/utils/nvtx.py", line 15, in wrapped_fn
      self.engine.backward(loss, **kwargs)
  File ".../lib/python3.8/site-packages/deepspeed/runtime/zero/stage3.py", line 2213, in backward
        self.loss_scaler.backward(loss.float(), retain_graph=retain_graph)        self.loss_scaler.backward(loss.float(), retain_graph=retain_graph)
self.loss_scaler.backward(loss.float(), retain_graph=retain_graph)self.loss_scaler.backward(loss.float(), retain_graph=retain_graph)

RuntimeError: The size of tensor a (0) must match the size of tensor b (14336) at non-singleton dimension 1
RuntimeErrorRuntimeError: : The size of tensor a (0) must match the size of tensor b (14336) at non-singleton dimension 1The size of tensor a (0) must match the size of tensor b (14336) at non-singleton dimension 1

and

deepspeed                     0.13.4
sentence-transformers         2.2.2
transformers                  4.38.1
transformers-stream-generator 0.0.4

What should I do? Or need more information？

Sniper970119 commented 6 months ago

We have an example to run training for Mixtral here: #5008 (comment) @Sniper970119 @nilsec @hchoi-moveworks Can you try this solution?

Hi，I tried the script and get an error at L132.

 File "test_moe_train.py", line 132, in training_function
    accelerator.backward(loss)
  File ".../lib/python3.8/site-packages/accelerate/accelerator.py", line 1899, in backward
             self.deepspeed_engine_wrapped.backward(loss, **kwargs)self.deepspeed_engine_wrapped.backward(loss, **kwargs)self.deepspeed_engine_wrapped.backward(loss, **kwargs)
 File ".../lib/python3.8/site-packages/accelerate/utils/deepspeed.py", line 167, in backward
            self.deepspeed_engine_wrapped.backward(loss, **kwargs)
 File "/.../lib/python3.8/site-packages/deepspeed/utils/nvtx.py", line 15, in wrapped_fn
      self.engine.backward(loss, **kwargs)
  File ".../lib/python3.8/site-packages/deepspeed/runtime/zero/stage3.py", line 2213, in backward
        self.loss_scaler.backward(loss.float(), retain_graph=retain_graph)        self.loss_scaler.backward(loss.float(), retain_graph=retain_graph)
self.loss_scaler.backward(loss.float(), retain_graph=retain_graph)self.loss_scaler.backward(loss.float(), retain_graph=retain_graph)

RuntimeError: The size of tensor a (0) must match the size of tensor b (14336) at non-singleton dimension 1
RuntimeErrorRuntimeError: : The size of tensor a (0) must match the size of tensor b (14336) at non-singleton dimension 1The size of tensor a (0) must match the size of tensor b (14336) at non-singleton dimension 1

and

deepspeed                     0.13.4
sentence-transformers         2.2.2
transformers                  4.38.1
transformers-stream-generator 0.0.4

What should I do? Or need more information？

@tohtana Could you help me?

hanxiaotian commented 4 months ago

I'm trying Mixtral inference with multiple process, however, when each rank take different input, it hangs on some rank. Anyone encountered same problem? Thanks

microsoft / DeepSpeed

[BUG] Deepspeed Zero 3 Inference InFlight Params with new HuggingFace Mixtral Model #4808

4966 should have fixed this issue. You can find working example there.