How to load 32-experts Swin-transformer-moe on a 2-GPU machine.

If I load the Swin-Transformer-MOE checkpoint directly, an error occurs."

Or, to be a bit more detailed : load_pretrained(config, model_without_ddp, logger) [rank0]: File "/ai_home/data/private/ywx/Swin-Transformer/utils_moe.py", line 217, in load_pretrained [rank0]: msg = model.load_state_dict(state_dict, strict=False) [rank0]: File "/opt/conda/envs/tutel/lib/python3.8/site-packages/torch/nn/modules/module.py", line 2215, in load_state_dict [rank0]: raise RuntimeError('Error(s) in loading state_dict for {}:\n\t{}'.format( [rank0]: RuntimeError: Error(s) in loading state_dict for SwinTransformerMoE: [rank0]: size mismatch for layers.2.blocks.1.mlp._moe_layer.experts.batched_fc1_w: copying a param with shape torch.Size([1, 1536, 384]) from checkpoint, the shape in current model is torch.Size([32, 1536, 384]). [rank0]: size mismatch for layers.2.blocks.1.mlp._moe_layer.experts.batched_fc2_w: copying a param with shape torch.Size([1, 1536, 384]) from checkpoint, the shape in current model is torch.Size([32, 1536, 384]). [rank0]: size mismatch for layers.2.blocks.1.mlp._moe_layer.experts.batched_fc1_bias: copying a param with shape torch.Size([1, 1, 1536]) from checkpoint, the shape in current model is torch.Size([32, 1536]). [rank0]: size mismatch for layers.2.blocks.1.mlp._moe_layer.experts.batched_fc2_bias: copying a param with shape torch.Size([1, 1, 384]) from checkpoint, the shape in current model is torch.Size([32, 384]). [rank0]: size mismatch for layers.2.blocks.3.mlp._moe_layer.experts.batched_fc1_w: copying a param with shape torch.Size([1, 1536, 384]) from checkpoint, the shape in current model is torch.Size([32, 1536, 384]). [rank0]: size mismatch for layers.2.blocks.3.mlp._moe_layer.experts.batched_fc2_w: copying a param with shape torch.Size([1, 1536, 384]) from checkpoint, the shape in current model is torch.Size([32, 1536, 384]). [rank0]: size mismatch for layers.2.blocks.3.mlp._moe_layer.experts.batched_fc1_bias: copying a param with shape torch.Size([1, 1, 1536]) from checkpoint, the shape in current model is torch.Size([32, 1536]). [rank0]: size mismatch for layers.2.blocks.3.mlp._moe_layer.experts.batched_fc2_bias: copying a param with shape torch.Size([1, 1, 384]) from checkpoint, the shape in current model is torch.Size([32, 384]). [rank0]: size mismatch for layers.2.blocks.5.mlp._moe_layer.experts.batched_fc1_w: copying a param with shape torch.Size([1, 1536, 384]) from checkpoint, the shape in current model is torch.Size([32, 1536, 384]). [rank0]: size mismatch for layers.2.blocks.5.mlp._moe_layer.experts.batched_fc2_w: copying a param with shape torch.Size([1, 1536, 384]) from checkpoint, the shape in current model is torch.Size([32, 1536, 384]). [rank0]: size mismatch for layers.2.blocks.5.mlp._moe_layer.experts.batched_fc1_bias: copying a param with shape torch.Size([1, 1, 1536]) from checkpoint, the shape in current model is torch.Size([32, 1536]). [rank0]: size mismatch for layers.2.blocks.5.mlp._moe_layer.experts.batched_fc2_bias: copying a param with shape torch.Size([1, 1, 384]) from checkpoint, the shape in current model is torch.Size([32, 384]). [rank0]: size mismatch for layers.2.blocks.7.mlp._moe_layer.experts.batched_fc1_w: copying a param with shape torch.Size([1, 1536, 384]) from checkpoint, the shape in current model is torch.Size([32, 1536, 384]). [rank0]: size mismatch for layers.2.blocks.7.mlp._moe_layer.experts.batched_fc2_w: copying a param with shape torch.Size([1, 1536, 384]) from checkpoint, the shape in current model is torch.Size([32, 1536, 384]). [rank0]: size mismatch for layers.2.blocks.7.mlp._moe_layer.experts.batched_fc1_bias: copying a param with shape torch.Size([1, 1, 1536]) from checkpoint, the shape in current model is torch.Size([32, 1536]). [rank0]: size mismatch for layers.2.blocks.7.mlp._moe_layer.experts.batched_fc2_bias: copying a param with shape torch.Size([1, 1, 384]) from checkpoint, the shape in current model is torch.Size([32, 384]). [rank0]: size mismatch for layers.2.blocks.9.mlp._moe_layer.experts.batched_fc1_w: copying a param with shape torch.Size([1, 1536, 384]) from checkpoint, the shape in current model is torch.Size([32, 1536, 384]). [rank0]: size mismatch for layers.2.blocks.9.mlp._moe_layer.experts.batched_fc2_w: copying a param with shape torch.Size([1, 1536, 384]) from checkpoint, the shape in current model is torch.Size([32, 1536, 384]). [rank0]: size mismatch for layers.2.blocks.9.mlp._moe_layer.experts.batched_fc1_bias: copying a param with shape torch.Size([1, 1, 1536]) from checkpoint, the shape in current model is torch.Size([32, 1536]). [rank0]: size mismatch for layers.2.blocks.9.mlp._moe_layer.experts.batched_fc2_bias: copying a param with shape torch.Size([1, 1, 384]) from checkpoint, the shape in current model is torch.Size([32, 384]). [rank0]: size mismatch for layers.2.blocks.11.mlp._moe_layer.experts.batched_fc1_w: copying a param with shape torch.Size([1, 1536, 384]) from checkpoint, the shape in current model is torch.Size([32, 1536, 384]). [rank0]: size mismatch for layers.2.blocks.11.mlp._moe_layer.experts.batched_fc2_w: copying a param with shape torch.Size([1, 1536, 384]) from checkpoint, the shape in current model is torch.Size([32, 1536, 384]). [rank0]: size mismatch for layers.2.blocks.11.mlp._moe_layer.experts.batched_fc1_bias: copying a param with shape torch.Size([1, 1, 1536]) from checkpoint, the shape in current model is torch.Size([32, 1536]). [rank0]: size mismatch for layers.2.blocks.11.mlp._moe_layer.experts.batched_fc2_bias: copying a param with shape torch.Size([1, 1, 384]) from checkpoint, the shape in current model is torch.Size([32, 384]). [rank0]: size mismatch for layers.2.blocks.13.mlp._moe_layer.experts.batched_fc1_w: copying a param with shape torch.Size([1, 1536, 384]) from checkpoint, the shape in current model is torch.Size([32, 1536, 384]). [rank0]: size mismatch for layers.2.blocks.13.mlp._moe_layer.experts.batched_fc2_w: copying a param with shape torch.Size([1, 1536, 384]) from checkpoint, the shape in current model is torch.Size([32, 1536, 384]). [rank0]: size mismatch for layers.2.blocks.13.mlp._moe_layer.experts.batched_fc1_bias: copying a param with shape torch.Size([1, 1, 1536]) from checkpoint, the shape in current model is torch.Size([32, 1536]). [rank0]: size mismatch for layers.2.blocks.13.mlp._moe_layer.experts.batched_fc2_bias: copying a param with shape torch.Size([1, 1, 384]) from checkpoint, the shape in current model is torch.Size([32, 384]). [rank0]: size mismatch for layers.2.blocks.15.mlp._moe_layer.experts.batched_fc1_w: copying a param with shape torch.Size([1, 1536, 384]) from checkpoint, the shape in current model is torch.Size([32, 1536, 384]). [rank0]: size mismatch for layers.2.blocks.15.mlp._moe_layer.experts.batched_fc2_w: copying a param with shape torch.Size([1, 1536, 384]) from checkpoint, the shape in current model is torch.Size([32, 1536, 384]). [rank0]: size mismatch for layers.2.blocks.15.mlp._moe_layer.experts.batched_fc1_bias: copying a param with shape torch.Size([1, 1, 1536]) from checkpoint, the shape in current model is torch.Size([32, 1536]). [rank0]: size mismatch for layers.2.blocks.15.mlp._moe_layer.experts.batched_fc2_bias: copying a param with shape torch.Size([1, 1, 384]) from checkpoint, the shape in current model is torch.Size([32, 384]). [rank0]: size mismatch for layers.2.blocks.17.mlp._moe_layer.experts.batched_fc1_w: copying a param with shape torch.Size([1, 1536, 384]) from checkpoint, the shape in current model is torch.Size([32, 1536, 384]). [rank0]: size mismatch for layers.2.blocks.17.mlp._moe_layer.experts.batched_fc2_w: copying a param with shape torch.Size([1, 1536, 384]) from checkpoint, the shape in current model is torch.Size([32, 1536, 384]). [rank0]: size mismatch for layers.2.blocks.17.mlp._moe_layer.experts.batched_fc1_bias: copying a param with shape torch.Size([1, 1, 1536]) from checkpoint, the shape in current model is torch.Size([32, 1536]). [rank0]: size mismatch for layers.2.blocks.17.mlp._moe_layer.experts.batched_fc2_bias: copying a param with shape torch.Size([1, 1, 384]) from checkpoint, the shape in current model is torch.Size([32, 384]). [rank0]: size mismatch for layers.3.blocks.1.mlp._moe_layer.experts.batched_fc1_w: copying a param with shape torch.Size([1, 3072, 768]) from checkpoint, the shape in current model is torch.Size([32, 3072, 768]). [rank0]: size mismatch for layers.3.blocks.1.mlp._moe_layer.experts.batched_fc2_w: copying a param with shape torch.Size([1, 3072, 768]) from checkpoint, the shape in current model is torch.Size([32, 3072, 768]). [rank0]: size mismatch for layers.3.blocks.1.mlp._moe_layer.experts.batched_fc1_bias: copying a param with shape torch.Size([1, 1, 3072]) from checkpoint, the shape in current model is torch.Size([32, 3072]). [rank0]: size mismatch for layers.3.blocks.1.mlp._moe_layer.experts.batched_fc2_bias: copying a param with shape torch.Size([1, 1, 768]) from checkpoint, the shape in current model is torch.Size([32, 768]).

microsoft / tutel

How to load 32-experts Swin-transformer-moe on a 2-GPU machine. #248