[BUG]: bugs surfaced while training MoE(Mixtral)

ericxsun commented 8 months ago

🐛 Describe the bug

With the main branch applications/ColossalMoE, I got such error:

grad = grad.to(master_moe_param.dtype).to(master_moe_param.device)
AttributeError: 'NoneType' object has no attribute 'to'

start script:

NUM_GPU=2
MODEL="mistralai/Mixtral-8x7B-v0.1"
SEQ_LENGTH=128
BATCH_SIZE=1
LR=0.00001

# hybrid
python -m torch.distributed.run --nnodes=1 --nproc_per_node $NUM_GPU --rdzv_id=1 --rdzv_backend=c10d --rdzv_endpoint=localhost:12346 \
    train.py \
    --num_epoch 1 \
    --model_name $MODEL \
    --plugin "hybrid" \
    --batch_size $BATCH_SIZE \
    --lr $LR \
    --zero_stage 2 \
    --pp_size 1 \
    --dp_size 1 \
    --ep_size 2

the full trace

Traceback (most recent call last):
  File "debug_moe.py", line 86, in <module>
    debug_main()
  File "debug_moe.py", line 82, in debug_main
    main(args, cfg)
  File "pretrain.py", line 308, in main
    raise e
  File "pretrain.py", line 286, in main
    optimizer.step()
  File ".local/lib/python3.10/site-packages/colossalai/zero/low_level/low_level_optim.py", line 616, in step

    grad = grad.to(master_moe_param.dtype).to(master_moe_param.device)
AttributeError: 'NoneType' object has no attribute 'to'

without EPMixtralSparseMoeBlock in mixtral_policy, I met another problem:

hangs at the following point:

Environment

transformers 4.38.2
colossalai main, feat/moe branch
pytorch 2.1.2cu118

ericxsun commented 8 months ago

The feat/moe branch runs successfully. I believe that the replace_moe_layer https://github.com/hpcaitech/ColossalAI/blob/1d96a562bb73d33424a8f91ac7463fa4e3b7dada/applications/ColossalMoE/train_moe.py#L219 might be the crucial distinction compare to EPMixtralSparseMoeBlock in main branch

cc @flybird11111

ericxsun commented 8 months ago

However, upon implementing the checkpoint in the feat/moe branch, I encountered an issue with restoring from the previously saved checkpoint not occurring correctly.

ericxsun commented 8 months ago

with replace_moe_layer implemeted in feat/moe, and save_shard_model in main branch, I can save and restore it correctly,

However, it cannot be loaded with AutoModelForCausalLM

from transformers import AutoModelForCausalLM
from transformers import AutoTokenizer

model = AutoModelForCausalLM.from_pretrained(checkpoint)

error:

were not used when initializing MixtralForCausalLM: [
'model.layers.0.block_sparse_moe.experts.wi_gate', 
'model.layers.0.block_sparse_moe.experts.wi_up',
'model.layers.0.block_sparse_moe.experts.wo',
'model.layers.0.block_sparse_moe.gate_weight',
....

with some fix, I can load from transformers correctly

In [3]: from transformers import AutoModelForCausalLM
   ...: from transformers import AutoTokenizer
In [4]: model = AutoModelForCausalLM.from_pretrained(checkpoint)
Loading checkpoint shards:   0%|                                                                                        | 0/2 [00:00<?, ?it/s]
Loading checkpoint shards:  50%|████████████████████████████████████████                                        | 1/2 [00:25<00:25, 25.15s/it]
Loading checkpoint shards: 100%|████████████████████████████████████████████████████████████████████████████████| 2/2 [00:33<00:00, 16.75s/it]

and restore it correctly

However, when attempting to load Mixtral-8x7B-v0.1, executing replace_moe_layer is excessively slow and results in Out of Memory (OOM) errors.

File "models/mixtral_layer.py", line 63, in replace_moe_layer

replace_moe_layer(

File "models/mixtral_layer.py", line 63, in replace_moe_layer

replace_moe_layer(

File "models/mixtral_layer.py", line 63, in replace_moe_layer

replace_moe_layer(

File "models/mixtral_layer.py", line 55, in replace_moe_layer

model.block_sparse_moe = MixtralSparseMLP.from_native_module(

File "models/mixtral_layer.py", line 46, in from_native_module

sparse_mlp = SparseMLP(**moe_kwargs).to(dtype).to(device)

File ".local/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1160, in to

return self._apply(convert)

File ".local/lib/python3.10/site-packages/torch/nn/modules/module.py", line 810, in _apply

module._apply(fn)

File ".local/lib/python3.10/site-packages/torch/nn/modules/module.py", line 833, in _apply

param_applied = fn(param)

File ".local/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1158, in convert

return t.to(device, dtype if t.is_floating_point() or t.is_complex() else None, non_blocking)

torch.cuda.OutOfMemoryError: CUDA out of memory. Tried to allocate 224.00 MiB. GPU 3 has a total capacty of 79.32 GiB of which 41.56 MiB is free. Process 3898623 has 79.28 GiB memory in use

Env:

A800 80G, 8x10 gpus
zero_state 2

ep_size	OOM Tried to allocate
2	448M
4	224M
8	112M

flybird11111 commented 8 months ago

hi，What version of transformers have you installed?

ericxsun commented 8 months ago

hi，What version of transformers have you installed?

transformers 4.38.2

ericxsun commented 8 months ago

Regarding the main branch, would it be advisable to include the following code snippet within the def step(self, closure=None) function of LowLevelZeroOptimizer https://github.com/hpcaitech/ColossalAI/blob/385e85afd460a1b9a947b09c9d0f7d2628c35ad2/colossalai/zero/low_level/low_level_optim.py#L613

grad = working_moe_param.grad
if grad is None:
    continue

to avoid such error:

    grad = grad.to(master_moe_param.dtype).to(master_moe_param.device)
AttributeError: 'NoneType' object has no attribute 'to'

I test it, adding if grad is None: continue, the main branch can run now. However, I cannot restore it correctly:

could you take a look? thanks a lot. @flybird11111 @ver217

ericxsun commented 8 months ago

More exp: The restoration appears to be correct when compared to 5x8 GPUs.

Env:

gpus = 1 x 8
ep_size = 2
zero_state = 2

Env:

gpus = 5 x 8
ep_size = 2
zero_state = 2

hpcaitech / ColossalAI

[BUG]: bugs surfaced while training MoE(Mixtral) #5426

🐛 Describe the bug

Environment