Closed ericxsun closed 7 months ago
The feat/moe branch runs successfully. I believe that the replace_moe_layer
https://github.com/hpcaitech/ColossalAI/blob/1d96a562bb73d33424a8f91ac7463fa4e3b7dada/applications/ColossalMoE/train_moe.py#L219 might be the crucial distinction compare to EPMixtralSparseMoeBlock
in main branch
cc @flybird11111
However, upon implementing the checkpoint in the feat/moe branch, I encountered an issue with restoring from the previously saved checkpoint not occurring correctly.
with replace_moe_layer implemeted in feat/moe
, and save_shard_model in main branch, I can save and restore it correctly,
However, it cannot be loaded with AutoModelForCausalLM
from transformers import AutoModelForCausalLM
from transformers import AutoTokenizer
model = AutoModelForCausalLM.from_pretrained(checkpoint)
error:
were not used when initializing MixtralForCausalLM: [
'model.layers.0.block_sparse_moe.experts.wi_gate',
'model.layers.0.block_sparse_moe.experts.wi_up',
'model.layers.0.block_sparse_moe.experts.wo',
'model.layers.0.block_sparse_moe.gate_weight',
....
with some fix, I can load from transformers correctly
In [3]: from transformers import AutoModelForCausalLM
...: from transformers import AutoTokenizer
In [4]: model = AutoModelForCausalLM.from_pretrained(checkpoint)
Loading checkpoint shards: 0%| | 0/2 [00:00<?, ?it/s]
Loading checkpoint shards: 50%|ββββββββββββββββββββββββββββββββββββββββ | 1/2 [00:25<00:25, 25.15s/it]
Loading checkpoint shards: 100%|ββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ| 2/2 [00:33<00:00, 16.75s/it]
and restore it correctly
However, when attempting to load Mixtral-8x7B-v0.1, executing replace_moe_layer is excessively slow and results in Out of Memory (OOM) errors.
File "models/mixtral_layer.py", line 63, in replace_moe_layer
replace_moe_layer(
File "models/mixtral_layer.py", line 63, in replace_moe_layer
replace_moe_layer(
File "models/mixtral_layer.py", line 63, in replace_moe_layer
replace_moe_layer(
File "models/mixtral_layer.py", line 55, in replace_moe_layer
model.block_sparse_moe = MixtralSparseMLP.from_native_module(
File "models/mixtral_layer.py", line 46, in from_native_module
sparse_mlp = SparseMLP(**moe_kwargs).to(dtype).to(device)
File ".local/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1160, in to
return self._apply(convert)
File ".local/lib/python3.10/site-packages/torch/nn/modules/module.py", line 810, in _apply
module._apply(fn)
File ".local/lib/python3.10/site-packages/torch/nn/modules/module.py", line 833, in _apply
param_applied = fn(param)
File ".local/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1158, in convert
return t.to(device, dtype if t.is_floating_point() or t.is_complex() else None, non_blocking)
torch.cuda.OutOfMemoryError: CUDA out of memory. Tried to allocate 224.00 MiB. GPU 3 has a total capacty of 79.32 GiB of which 41.56 MiB is free. Process 3898623 has 79.28 GiB memory in use
Env:
ep_size | OOM Tried to allocate |
---|---|
2 | 448M |
4 | 224M |
8 | 112M |
hiοΌWhat version of transformers have you installed?
hiοΌWhat version of transformers have you installed?
transformers 4.38.2
Regarding the main branch, would it be advisable to include the following code snippet within the def step(self, closure=None)
function of LowLevelZeroOptimizer
https://github.com/hpcaitech/ColossalAI/blob/385e85afd460a1b9a947b09c9d0f7d2628c35ad2/colossalai/zero/low_level/low_level_optim.py#L613
grad = working_moe_param.grad
if grad is None:
continue
to avoid such error:
grad = grad.to(master_moe_param.dtype).to(master_moe_param.device)
AttributeError: 'NoneType' object has no attribute 'to'
I test it, adding if grad is None: continue
, the main branch can run now. However, I cannot restore it correctly:
could you take a look? thanks a lot. @flybird11111 @ver217
More exp: The restoration appears to be correct when compared to 5x8 GPUs.
Env:
Env:
π Describe the bug
With the main branch
applications/ColossalMoE
, I got such error:start script:
the full trace
without
EPMixtralSparseMoeBlock
in mixtral_policy, I met another problem:hangs at the following point:
Environment