Open wulaoshi opened 1 year ago
i met the same error, did you solve it
i met the same error, did you solve it
No, not yet.
I met the same error with Mixtral-8x7B-v0.1
File "pretrain.py", line 223, in main
booster.backward(loss, optimizer)
File ".local/lib/python3.10/site-packages/colossalai/booster/booster.py", line 167, in backward
optimizer.backward(loss)
File ".local/lib/python3.10/site-packages/colossalai/zero/gemini/gemini_optimizer.py", line 291, in backward
self.module.backward(loss)
File ".local/lib/python3.10/site-packages/colossalai/zero/gemini/gemini_ddp.py", line 331, in backward
self._post_backward()
File ".local/lib/python3.10/site-packages/colossalai/zero/gemini/gemini_ddp.py", line 314, in _post_backward
raise RuntimeError(
RuntimeError: ("ZERO DDP error: the synchronization of gradients doesn't exit properly.", 'The most possible reason is that the model is not compatible with GeminiDDP.\n', 'Reduction failed at followed parameters:\n\tmodel.layers.22.block_sparse_moe.experts.2.w1.weight\n\tmodel.layers.22.block_sparse_moe.experts.2.w2.weight\n\tmodel.layers.22.block_sparse_moe.experts.2.w3.weight\n\tmodel.layers.22.block_sparse_moe.experts.3.w1.weight\n\tmodel.layers.22.block_sparse_moe.experts.3.w2.weight\n\tmodel.layers.22.block_sparse_moe.experts.3.w3.weight\n\tmodel.layers.22.block_sparse_moe.experts.4.w1.weight\n\tmodel.layers.22.block_sparse_moe.experts.4.w2.weight\n\tmodel.layers.22.block_sparse_moe.experts.4.w3.weight\n\tmodel.layers.22.block_sparse_moe.experts.5.w1.weight\n\tmodel.layers.22.block_sparse_moe.experts.5.w2.weight\n\tmodel.layers.22.block_sparse_moe.experts.5.w3.weight\n\tmodel.layers.22.block_sparse_moe.experts.6.w1.weight\n\tmodel.layers.22.block_sparse_moe.experts.6.w2.weight\n\tmodel.layers.22.block_sparse_moe.experts.6.w3.weight\n\tmodel.layers.22.block_sparse_moe.experts.7.w1.weight\n\tmodel.layers.22.block_sparse_moe.experts.7.w2.weight\n\tmodel.layers.22.block_sparse_moe.experts.7.w3.weight\n\tmodel.layers.23.block_sparse_moe.experts.2.w1.weight\n\tmodel.layers.23.block_sparse_moe.experts.2.w2.weight\n\tmodel.layers.23.block_sparse_moe.experts.2.w3.weight\n\tmodel.layers.23.block_sparse_moe.experts.3.w1.weight\n\tmodel.layers.23.block_sparse_moe.experts.3.w2.weight\n\tmodel.layers.23.block_sparse_moe.experts.3.w3.weight\n\tmodel.layers.23.block_sparse_moe.experts.4.w1.weight\n\tmodel.layers.23.block_sparse_moe.experts.4.w2.weight\n\tmodel.layers.23.block_sparse_moe.experts.4.w3.weight\n\tmodel.layers.23.block_sparse_moe.experts.5.w1.weight\n\tmodel.layers.23.block_sparse_moe.experts.5.w2.weight\n\tmodel.layers.23.block_sparse_moe.experts.5.w3.weight\n\tmodel.layers.23.block_sparse_moe.experts.6.w1.weight\n\tmodel.layers.23.block_sparse_moe.experts.6.w2.weight\n\tmodel.layers.23.block_sparse_moe.experts.6.w3.weight\n\tmodel.layers.23.block_sparse_moe.experts.7.w1.weight\n\tmodel.layers.23.block_sparse_moe.experts.7.w2.weight\n\tmodel.layers.23.block_sparse_moe.experts.7.w3.weight\n\tmodel.layers.24.block_sparse_moe.experts.6.w1.weight\n\tmodel.layers.24.block_sparse_moe.experts.6.w2.weight\n\tmodel.layers.24.block_sparse_moe.experts.6.w3.weight\n\tmodel.layers.29.block_sparse_moe.experts.2.w1.weight\n\tmodel.layers.29.block_sparse_moe.experts.2.w2.weight\n\tmodel.layers.29.block_sparse_moe.experts.2.w3.weight\n\tmodel.layers.29.block_sparse_moe.experts.3.w1.weight\n\tmodel.layers.29.block_sparse_moe.experts.3.w2.weight\n\tmodel.layers.29.block_sparse_moe.experts.3.w3.weight\n\tmodel.layers.29.block_sparse_moe.experts.4.w1.weight\n\tmodel.layers.29.block_sparse_moe.experts.4.w2.weight\n\tmodel.layers.29.block_sparse_moe.experts.4.w3.weight\n\tmodel.layers.29.block_sparse_moe.experts.5.w1.weight\n\tmodel.layers.29.block_sparse_moe.experts.5.w2.weight\n\tmodel.layers.29.block_sparse_moe.experts.5.w3.weight\n\tmodel.layers.29.block_sparse_moe.experts.6.w1.weight\n\tmodel.layers.29.block_sparse_moe.experts.6.w2.weight\n\tmodel.layers.29.block_sparse_moe.experts.6.w3.weight\n\tmodel.layers.29.block_sparse_moe.experts.7.w1.weight\n\tmodel.layers.29.block_sparse_moe.experts.7.w2.weight\n\tmodel.layers.29.block_sparse_moe.experts.7.w3.weight\n\tmodel.layers.30.block_sparse_moe.experts.2.w1.weight\n\tmodel.layers.30.block_sparse_moe.experts.2.w2.weight\n\tmodel.layers.30.block_sparse_moe.experts.2.w3.weight\n\tmodel.layers.30.block_sparse_moe.experts.3.w1.weight\n\tmodel.layers.30.block_sparse_moe.experts.3.w2.weight\n\tmodel.layers.30.block_sparse_moe.experts.3.w3.weight\n\tmodel.layers.30.block_sparse_moe.experts.4.w1.weight\n\tmodel.layers.30.block_sparse_moe.experts.4.w2.weight\n\tmodel.layers.30.block_sparse_moe.experts.4.w3.weight\n\tmodel.layers.30.block_sparse_moe.experts.5.w1.weight\n\tmodel.layers.30.block_sparse_moe.experts.5.w2.weight\n\tmodel.layers.30.block_sparse_moe.experts.5.w3.weight\n\tmodel.layers.30.block_sparse_moe.experts.6.w1.weight\n\tmodel.layers.30.block_sparse_moe.experts.6.w2.weight\n\tmodel.layers.30.block_sparse_moe.experts.6.w3.weight\n\tmodel.layers.30.block_sparse_moe.experts.7.w1.weight\n\tmodel.layers.30.block_sparse_moe.experts.7.w2.weight\n\tmodel.layers.30.block_sparse_moe.experts.7.w3.weight\n\tmodel.layers.31.block_sparse_moe.experts.2.w1.weight\n\tmodel.layers.31.block_sparse_moe.experts.2.w2.weight\n\tmodel.layers.31.block_sparse_moe.experts.2.w3.weight\n\tmodel.layers.31.block_sparse_moe.experts.3.w1.weight\n\tmodel.layers.31.block_sparse_moe.experts.3.w2.weight\n\tmodel.layers.31.block_sparse_moe.experts.3.w3.weight\n\tmodel.layers.31.block_sparse_moe.experts.4.w1.weight\n\tmodel.layers.31.block_sparse_moe.experts.4.w2.weight\n\tmodel.layers.31.block_sparse_moe.experts.4.w3.weight\n\tmodel.layers.31.block_sparse_moe.experts.5.w1.weight\n\tmodel.layers.31.block_sparse_moe.experts.5.w2.weight\n\tmodel.layers.31.block_sparse_moe.experts.5.w3.weight\n\tmodel.layers.31.block_sparse_moe.experts.6.w1.weight\n\tmodel.layers.31.block_sparse_moe.experts.6.w2.weight\n\tmodel.layers.31.block_sparse_moe.experts.6.w3.weight\n\tmodel.layers.31.block_sparse_moe.experts.7.w1.weight\n\tmodel.layers.31.block_sparse_moe.experts.7.w2.weight\n\tmodel.layers.31.block_sparse_moe.experts.7.w3.weight')
🐛 Describe the bug
I got an error when I trained Bert large with GeminiDDP: Error location >> self.optimizer.backward(loss) error message:RuntimeError: ("ZERO DDP error: the synchronization of gradients doesn't exit properly.", 'The most possible reason is that the model is not compatible with ZeroDDP.\n', 'Reduction failed at followed parameters:\n\tbert.embeddings.word_embeddings.weight\n\tbert.embeddings.position_embeddings.weight\n\tbert.embeddings.token_type_embeddings.weight\n\tbert.embeddings.LayerNorm.weigh t\n\tbert.embeddings.LayerNorm.bias\n\tbert.encoder.layer.0.attention.self.query.weight\n\tbert.encoder.layer.0.attention.self.query.bias\n\tbert.encoder.layer.0.attention.self.key.w eight\n\tbert.encoder.layer.0.attention.self.key.bias\n\tbert.encoder.layer.0.attention.self.value.weight\n\tbert.encoder.layer.0.attention.self.value.bias\n.......
Code that may be involved:
and:
Then how should I modify it? Thanks.
Environment
No response