microsoft / tutel

Tutel MoE: An Optimized Mixture-of-Experts Implementation
MIT License
723 stars 93 forks source link

How to load 32-experts Swin-transformer-moe on a 2-GPU machine. #248

Open ywxsuperstar opened 3 hours ago

ywxsuperstar commented 3 hours ago

Hi,

I have downloaded the checkpoint for a 32-expert Swin-Transformer-MOE. However, the checkpoints are dependent sub-checkpoints distributed across different ranks (32 ranks). I want to load these sub-checkpoints and fine-tune the model on a 2-GPU machine.

To do this, should I gather the sub-checkpoints into a single checkpoint? I attempted to use the script from gather.py, but it did not work. Could you help me understand what went wrong?

Additionally, I checked the original code and found that the condition "if k.endswith('._num_global_experts')" is returning false. Is this due to the format of the Swin-Transformer-MOE checkpoint? I'm quite confused about this.

Thank you for your assistance!

ywxsuperstar commented 2 hours ago

If I load the Swin-Transformer-MOE checkpoint directly, an error occurs."

Or, to be a bit more detailed : load_pretrained(config, model_without_ddp, logger) [rank0]: File "/ai_home/data/private/ywx/Swin-Transformer/utils_moe.py", line 217, in load_pretrained [rank0]: msg = model.load_state_dict(state_dict, strict=False) [rank0]: File "/opt/conda/envs/tutel/lib/python3.8/site-packages/torch/nn/modules/module.py", line 2215, in load_state_dict [rank0]: raise RuntimeError('Error(s) in loading state_dict for {}:\n\t{}'.format( [rank0]: RuntimeError: Error(s) in loading state_dict for SwinTransformerMoE: [rank0]: size mismatch for layers.2.blocks.1.mlp._moe_layer.experts.batched_fc1_w: copying a param with shape torch.Size([1, 1536, 384]) from checkpoint, the shape in current model is torch.Size([32, 1536, 384]). [rank0]: size mismatch for layers.2.blocks.1.mlp._moe_layer.experts.batched_fc2_w: copying a param with shape torch.Size([1, 1536, 384]) from checkpoint, the shape in current model is torch.Size([32, 1536, 384]). [rank0]: size mismatch for layers.2.blocks.1.mlp._moe_layer.experts.batched_fc1_bias: copying a param with shape torch.Size([1, 1, 1536]) from checkpoint, the shape in current model is torch.Size([32, 1536]). [rank0]: size mismatch for layers.2.blocks.1.mlp._moe_layer.experts.batched_fc2_bias: copying a param with shape torch.Size([1, 1, 384]) from checkpoint, the shape in current model is torch.Size([32, 384]). [rank0]: size mismatch for layers.2.blocks.3.mlp._moe_layer.experts.batched_fc1_w: copying a param with shape torch.Size([1, 1536, 384]) from checkpoint, the shape in current model is torch.Size([32, 1536, 384]). [rank0]: size mismatch for layers.2.blocks.3.mlp._moe_layer.experts.batched_fc2_w: copying a param with shape torch.Size([1, 1536, 384]) from checkpoint, the shape in current model is torch.Size([32, 1536, 384]). [rank0]: size mismatch for layers.2.blocks.3.mlp._moe_layer.experts.batched_fc1_bias: copying a param with shape torch.Size([1, 1, 1536]) from checkpoint, the shape in current model is torch.Size([32, 1536]). [rank0]: size mismatch for layers.2.blocks.3.mlp._moe_layer.experts.batched_fc2_bias: copying a param with shape torch.Size([1, 1, 384]) from checkpoint, the shape in current model is torch.Size([32, 384]). [rank0]: size mismatch for layers.2.blocks.5.mlp._moe_layer.experts.batched_fc1_w: copying a param with shape torch.Size([1, 1536, 384]) from checkpoint, the shape in current model is torch.Size([32, 1536, 384]). [rank0]: size mismatch for layers.2.blocks.5.mlp._moe_layer.experts.batched_fc2_w: copying a param with shape torch.Size([1, 1536, 384]) from checkpoint, the shape in current model is torch.Size([32, 1536, 384]). [rank0]: size mismatch for layers.2.blocks.5.mlp._moe_layer.experts.batched_fc1_bias: copying a param with shape torch.Size([1, 1, 1536]) from checkpoint, the shape in current model is torch.Size([32, 1536]). [rank0]: size mismatch for layers.2.blocks.5.mlp._moe_layer.experts.batched_fc2_bias: copying a param with shape torch.Size([1, 1, 384]) from checkpoint, the shape in current model is torch.Size([32, 384]). [rank0]: size mismatch for layers.2.blocks.7.mlp._moe_layer.experts.batched_fc1_w: copying a param with shape torch.Size([1, 1536, 384]) from checkpoint, the shape in current model is torch.Size([32, 1536, 384]). [rank0]: size mismatch for layers.2.blocks.7.mlp._moe_layer.experts.batched_fc2_w: copying a param with shape torch.Size([1, 1536, 384]) from checkpoint, the shape in current model is torch.Size([32, 1536, 384]). [rank0]: size mismatch for layers.2.blocks.7.mlp._moe_layer.experts.batched_fc1_bias: copying a param with shape torch.Size([1, 1, 1536]) from checkpoint, the shape in current model is torch.Size([32, 1536]). [rank0]: size mismatch for layers.2.blocks.7.mlp._moe_layer.experts.batched_fc2_bias: copying a param with shape torch.Size([1, 1, 384]) from checkpoint, the shape in current model is torch.Size([32, 384]). [rank0]: size mismatch for layers.2.blocks.9.mlp._moe_layer.experts.batched_fc1_w: copying a param with shape torch.Size([1, 1536, 384]) from checkpoint, the shape in current model is torch.Size([32, 1536, 384]). [rank0]: size mismatch for layers.2.blocks.9.mlp._moe_layer.experts.batched_fc2_w: copying a param with shape torch.Size([1, 1536, 384]) from checkpoint, the shape in current model is torch.Size([32, 1536, 384]). [rank0]: size mismatch for layers.2.blocks.9.mlp._moe_layer.experts.batched_fc1_bias: copying a param with shape torch.Size([1, 1, 1536]) from checkpoint, the shape in current model is torch.Size([32, 1536]). [rank0]: size mismatch for layers.2.blocks.9.mlp._moe_layer.experts.batched_fc2_bias: copying a param with shape torch.Size([1, 1, 384]) from checkpoint, the shape in current model is torch.Size([32, 384]). [rank0]: size mismatch for layers.2.blocks.11.mlp._moe_layer.experts.batched_fc1_w: copying a param with shape torch.Size([1, 1536, 384]) from checkpoint, the shape in current model is torch.Size([32, 1536, 384]). [rank0]: size mismatch for layers.2.blocks.11.mlp._moe_layer.experts.batched_fc2_w: copying a param with shape torch.Size([1, 1536, 384]) from checkpoint, the shape in current model is torch.Size([32, 1536, 384]). [rank0]: size mismatch for layers.2.blocks.11.mlp._moe_layer.experts.batched_fc1_bias: copying a param with shape torch.Size([1, 1, 1536]) from checkpoint, the shape in current model is torch.Size([32, 1536]). [rank0]: size mismatch for layers.2.blocks.11.mlp._moe_layer.experts.batched_fc2_bias: copying a param with shape torch.Size([1, 1, 384]) from checkpoint, the shape in current model is torch.Size([32, 384]). [rank0]: size mismatch for layers.2.blocks.13.mlp._moe_layer.experts.batched_fc1_w: copying a param with shape torch.Size([1, 1536, 384]) from checkpoint, the shape in current model is torch.Size([32, 1536, 384]). [rank0]: size mismatch for layers.2.blocks.13.mlp._moe_layer.experts.batched_fc2_w: copying a param with shape torch.Size([1, 1536, 384]) from checkpoint, the shape in current model is torch.Size([32, 1536, 384]). [rank0]: size mismatch for layers.2.blocks.13.mlp._moe_layer.experts.batched_fc1_bias: copying a param with shape torch.Size([1, 1, 1536]) from checkpoint, the shape in current model is torch.Size([32, 1536]). [rank0]: size mismatch for layers.2.blocks.13.mlp._moe_layer.experts.batched_fc2_bias: copying a param with shape torch.Size([1, 1, 384]) from checkpoint, the shape in current model is torch.Size([32, 384]). [rank0]: size mismatch for layers.2.blocks.15.mlp._moe_layer.experts.batched_fc1_w: copying a param with shape torch.Size([1, 1536, 384]) from checkpoint, the shape in current model is torch.Size([32, 1536, 384]). [rank0]: size mismatch for layers.2.blocks.15.mlp._moe_layer.experts.batched_fc2_w: copying a param with shape torch.Size([1, 1536, 384]) from checkpoint, the shape in current model is torch.Size([32, 1536, 384]). [rank0]: size mismatch for layers.2.blocks.15.mlp._moe_layer.experts.batched_fc1_bias: copying a param with shape torch.Size([1, 1, 1536]) from checkpoint, the shape in current model is torch.Size([32, 1536]). [rank0]: size mismatch for layers.2.blocks.15.mlp._moe_layer.experts.batched_fc2_bias: copying a param with shape torch.Size([1, 1, 384]) from checkpoint, the shape in current model is torch.Size([32, 384]). [rank0]: size mismatch for layers.2.blocks.17.mlp._moe_layer.experts.batched_fc1_w: copying a param with shape torch.Size([1, 1536, 384]) from checkpoint, the shape in current model is torch.Size([32, 1536, 384]). [rank0]: size mismatch for layers.2.blocks.17.mlp._moe_layer.experts.batched_fc2_w: copying a param with shape torch.Size([1, 1536, 384]) from checkpoint, the shape in current model is torch.Size([32, 1536, 384]). [rank0]: size mismatch for layers.2.blocks.17.mlp._moe_layer.experts.batched_fc1_bias: copying a param with shape torch.Size([1, 1, 1536]) from checkpoint, the shape in current model is torch.Size([32, 1536]). [rank0]: size mismatch for layers.2.blocks.17.mlp._moe_layer.experts.batched_fc2_bias: copying a param with shape torch.Size([1, 1, 384]) from checkpoint, the shape in current model is torch.Size([32, 384]). [rank0]: size mismatch for layers.3.blocks.1.mlp._moe_layer.experts.batched_fc1_w: copying a param with shape torch.Size([1, 3072, 768]) from checkpoint, the shape in current model is torch.Size([32, 3072, 768]). [rank0]: size mismatch for layers.3.blocks.1.mlp._moe_layer.experts.batched_fc2_w: copying a param with shape torch.Size([1, 3072, 768]) from checkpoint, the shape in current model is torch.Size([32, 3072, 768]). [rank0]: size mismatch for layers.3.blocks.1.mlp._moe_layer.experts.batched_fc1_bias: copying a param with shape torch.Size([1, 1, 3072]) from checkpoint, the shape in current model is torch.Size([32, 3072]). [rank0]: size mismatch for layers.3.blocks.1.mlp._moe_layer.experts.batched_fc2_bias: copying a param with shape torch.Size([1, 1, 768]) from checkpoint, the shape in current model is torch.Size([32, 768]).