Open ywxsuperstar opened 4 days ago
If I load the Swin-Transformer-MOE checkpoint directly, an error occurs."
Or, to be a bit more detailed : load_pretrained(config, model_without_ddp, logger) [rank0]: File "/ai_home/data/private/ywx/Swin-Transformer/utils_moe.py", line 217, in load_pretrained [rank0]: msg = model.load_state_dict(state_dict, strict=False) [rank0]: File "/opt/conda/envs/tutel/lib/python3.8/site-packages/torch/nn/modules/module.py", line 2215, in load_state_dict [rank0]: raise RuntimeError('Error(s) in loading state_dict for {}:\n\t{}'.format( [rank0]: RuntimeError: Error(s) in loading state_dict for SwinTransformerMoE: [rank0]: size mismatch for layers.2.blocks.1.mlp._moe_layer.experts.batched_fc1_w: copying a param with shape torch.Size([1, 1536, 384]) from checkpoint, the shape in current model is torch.Size([32, 1536, 384]). [rank0]: size mismatch for layers.2.blocks.1.mlp._moe_layer.experts.batched_fc2_w: copying a param with shape torch.Size([1, 1536, 384]) from checkpoint, the shape in current model is torch.Size([32, 1536, 384]). [rank0]: size mismatch for layers.2.blocks.1.mlp._moe_layer.experts.batched_fc1_bias: copying a param with shape torch.Size([1, 1, 1536]) from checkpoint, the shape in current model is torch.Size([32, 1536]). [rank0]: size mismatch for layers.2.blocks.1.mlp._moe_layer.experts.batched_fc2_bias: copying a param with shape torch.Size([1, 1, 384]) from checkpoint, the shape in current model is torch.Size([32, 384]). [rank0]: size mismatch for layers.2.blocks.3.mlp._moe_layer.experts.batched_fc1_w: copying a param with shape torch.Size([1, 1536, 384]) from checkpoint, the shape in current model is torch.Size([32, 1536, 384]). [rank0]: size mismatch for layers.2.blocks.3.mlp._moe_layer.experts.batched_fc2_w: copying a param with shape torch.Size([1, 1536, 384]) from checkpoint, the shape in current model is torch.Size([32, 1536, 384]). [rank0]: size mismatch for layers.2.blocks.3.mlp._moe_layer.experts.batched_fc1_bias: copying a param with shape torch.Size([1, 1, 1536]) from checkpoint, the shape in current model is torch.Size([32, 1536]). [rank0]: size mismatch for layers.2.blocks.3.mlp._moe_layer.experts.batched_fc2_bias: copying a param with shape torch.Size([1, 1, 384]) from checkpoint, the shape in current model is torch.Size([32, 384]). [rank0]: size mismatch for layers.2.blocks.5.mlp._moe_layer.experts.batched_fc1_w: copying a param with shape torch.Size([1, 1536, 384]) from checkpoint, the shape in current model is torch.Size([32, 1536, 384]). [rank0]: size mismatch for layers.2.blocks.5.mlp._moe_layer.experts.batched_fc2_w: copying a param with shape torch.Size([1, 1536, 384]) from checkpoint, the shape in current model is torch.Size([32, 1536, 384]). [rank0]: size mismatch for layers.2.blocks.5.mlp._moe_layer.experts.batched_fc1_bias: copying a param with shape torch.Size([1, 1, 1536]) from checkpoint, the shape in current model is torch.Size([32, 1536]). [rank0]: size mismatch for layers.2.blocks.5.mlp._moe_layer.experts.batched_fc2_bias: copying a param with shape torch.Size([1, 1, 384]) from checkpoint, the shape in current model is torch.Size([32, 384]). [rank0]: size mismatch for layers.2.blocks.7.mlp._moe_layer.experts.batched_fc1_w: copying a param with shape torch.Size([1, 1536, 384]) from checkpoint, the shape in current model is torch.Size([32, 1536, 384]). [rank0]: size mismatch for layers.2.blocks.7.mlp._moe_layer.experts.batched_fc2_w: copying a param with shape torch.Size([1, 1536, 384]) from checkpoint, the shape in current model is torch.Size([32, 1536, 384]). [rank0]: size mismatch for layers.2.blocks.7.mlp._moe_layer.experts.batched_fc1_bias: copying a param with shape torch.Size([1, 1, 1536]) from checkpoint, the shape in current model is torch.Size([32, 1536]). [rank0]: size mismatch for layers.2.blocks.7.mlp._moe_layer.experts.batched_fc2_bias: copying a param with shape torch.Size([1, 1, 384]) from checkpoint, the shape in current model is torch.Size([32, 384]). [rank0]: size mismatch for layers.2.blocks.9.mlp._moe_layer.experts.batched_fc1_w: copying a param with shape torch.Size([1, 1536, 384]) from checkpoint, the shape in current model is torch.Size([32, 1536, 384]). [rank0]: size mismatch for layers.2.blocks.9.mlp._moe_layer.experts.batched_fc2_w: copying a param with shape torch.Size([1, 1536, 384]) from checkpoint, the shape in current model is torch.Size([32, 1536, 384]). [rank0]: size mismatch for layers.2.blocks.9.mlp._moe_layer.experts.batched_fc1_bias: copying a param with shape torch.Size([1, 1, 1536]) from checkpoint, the shape in current model is torch.Size([32, 1536]). [rank0]: size mismatch for layers.2.blocks.9.mlp._moe_layer.experts.batched_fc2_bias: copying a param with shape torch.Size([1, 1, 384]) from checkpoint, the shape in current model is torch.Size([32, 384]). [rank0]: size mismatch for layers.2.blocks.11.mlp._moe_layer.experts.batched_fc1_w: copying a param with shape torch.Size([1, 1536, 384]) from checkpoint, the shape in current model is torch.Size([32, 1536, 384]). [rank0]: size mismatch for layers.2.blocks.11.mlp._moe_layer.experts.batched_fc2_w: copying a param with shape torch.Size([1, 1536, 384]) from checkpoint, the shape in current model is torch.Size([32, 1536, 384]). [rank0]: size mismatch for layers.2.blocks.11.mlp._moe_layer.experts.batched_fc1_bias: copying a param with shape torch.Size([1, 1, 1536]) from checkpoint, the shape in current model is torch.Size([32, 1536]). [rank0]: size mismatch for layers.2.blocks.11.mlp._moe_layer.experts.batched_fc2_bias: copying a param with shape torch.Size([1, 1, 384]) from checkpoint, the shape in current model is torch.Size([32, 384]). [rank0]: size mismatch for layers.2.blocks.13.mlp._moe_layer.experts.batched_fc1_w: copying a param with shape torch.Size([1, 1536, 384]) from checkpoint, the shape in current model is torch.Size([32, 1536, 384]). [rank0]: size mismatch for layers.2.blocks.13.mlp._moe_layer.experts.batched_fc2_w: copying a param with shape torch.Size([1, 1536, 384]) from checkpoint, the shape in current model is torch.Size([32, 1536, 384]). [rank0]: size mismatch for layers.2.blocks.13.mlp._moe_layer.experts.batched_fc1_bias: copying a param with shape torch.Size([1, 1, 1536]) from checkpoint, the shape in current model is torch.Size([32, 1536]). [rank0]: size mismatch for layers.2.blocks.13.mlp._moe_layer.experts.batched_fc2_bias: copying a param with shape torch.Size([1, 1, 384]) from checkpoint, the shape in current model is torch.Size([32, 384]). [rank0]: size mismatch for layers.2.blocks.15.mlp._moe_layer.experts.batched_fc1_w: copying a param with shape torch.Size([1, 1536, 384]) from checkpoint, the shape in current model is torch.Size([32, 1536, 384]). [rank0]: size mismatch for layers.2.blocks.15.mlp._moe_layer.experts.batched_fc2_w: copying a param with shape torch.Size([1, 1536, 384]) from checkpoint, the shape in current model is torch.Size([32, 1536, 384]). [rank0]: size mismatch for layers.2.blocks.15.mlp._moe_layer.experts.batched_fc1_bias: copying a param with shape torch.Size([1, 1, 1536]) from checkpoint, the shape in current model is torch.Size([32, 1536]). [rank0]: size mismatch for layers.2.blocks.15.mlp._moe_layer.experts.batched_fc2_bias: copying a param with shape torch.Size([1, 1, 384]) from checkpoint, the shape in current model is torch.Size([32, 384]). [rank0]: size mismatch for layers.2.blocks.17.mlp._moe_layer.experts.batched_fc1_w: copying a param with shape torch.Size([1, 1536, 384]) from checkpoint, the shape in current model is torch.Size([32, 1536, 384]). [rank0]: size mismatch for layers.2.blocks.17.mlp._moe_layer.experts.batched_fc2_w: copying a param with shape torch.Size([1, 1536, 384]) from checkpoint, the shape in current model is torch.Size([32, 1536, 384]). [rank0]: size mismatch for layers.2.blocks.17.mlp._moe_layer.experts.batched_fc1_bias: copying a param with shape torch.Size([1, 1, 1536]) from checkpoint, the shape in current model is torch.Size([32, 1536]). [rank0]: size mismatch for layers.2.blocks.17.mlp._moe_layer.experts.batched_fc2_bias: copying a param with shape torch.Size([1, 1, 384]) from checkpoint, the shape in current model is torch.Size([32, 384]). [rank0]: size mismatch for layers.3.blocks.1.mlp._moe_layer.experts.batched_fc1_w: copying a param with shape torch.Size([1, 3072, 768]) from checkpoint, the shape in current model is torch.Size([32, 3072, 768]). [rank0]: size mismatch for layers.3.blocks.1.mlp._moe_layer.experts.batched_fc2_w: copying a param with shape torch.Size([1, 3072, 768]) from checkpoint, the shape in current model is torch.Size([32, 3072, 768]). [rank0]: size mismatch for layers.3.blocks.1.mlp._moe_layer.experts.batched_fc1_bias: copying a param with shape torch.Size([1, 1, 3072]) from checkpoint, the shape in current model is torch.Size([32, 3072]). [rank0]: size mismatch for layers.3.blocks.1.mlp._moe_layer.experts.batched_fc2_bias: copying a param with shape torch.Size([1, 1, 768]) from checkpoint, the shape in current model is torch.Size([32, 768]).
The pretrained checkpoint may be old which was compatible with a legacy Tutel version. Can you provide the checkpoint link you use, and SWIN command to load it if possible?
The pretrained checkpoint may be old which was compatible with a legacy Tutel version. Can you provide the checkpoint link you use, and SWIN command to load it if possible?
Hi, I have loaded the checkpoint from (https://github.com/SwinTransformer/storage/releases/download/v2.0.2/swin_moe_small_patch4_window12_192_32expert_32gpu_22k.zip).
I used the command: torchrun --nproc_per_node=1 --nnode=1 --master_port 12347 main_moe.py --cfg configs/swinmoe/swin_moe_small_patch4_window12_192_32expert_32gpu_1k 128.yaml --data-path imagenet --batch-size 128 --pretrained swin_moe_small_patch4_window12_192_32expert_32gpu_22k/swin_moe_small_patch4_window12_192_32expert_32gpu_22k.pth
(For "swin_moe_small_patch4_window12_192_32expert_32gpu_1k", I used imagenet 1k to fintuning, and I only modify the dataset)
If you have any suggestions or improvements, please let me know. Thank you!
I just merged a PR (https://github.com/microsoft/tutel/pull/249) to ensure checkpoint compatible with legacy format.
Can you upgrade tutel installation, and follow this New Steps to convert the SWIN checkpoints and see if it works?
I just merged a PR (#249) to ensure checkpoint compatible with legacy format.
Can you upgrade tutel installation, and follow this New Steps to convert the SWIN checkpoints and see if it works?
Thank you for your modifications! I upgraded the Tutel installation and followed the steps to merge the checkpoint. However, the issue still persists. Upon printing the merged checkpoint, I noticed a mismatch with the model.
erorr: [rank0]: size mismatch for layers.2.blocks.1.mlp._moe_layer.experts.batched_fc1_w: copying a param with shape torch.Size([1, 1536, 384]) from checkpoint, the shape in current model is torch.Size([32, 1536, 384]). [rank0]: size mismatch for layers.2.blocks.1.mlp._moe_layer.experts.batched_fc2_w: copying a param with shape torch.Size([1, 1536, 384]) from checkpoint, the shape in current model is torch.Size([32, 1536, 384]). [rank0]: size mismatch for layers.2.blocks.1.mlp._moe_layer.experts.batched_fc1_bias: copying a param with shape torch.Size([1, 1, 1536]) from checkpoint, the shape in current model is torch.Size([32, 1536]). [rank0]: size mismatch for layers.2.blocks.3.mlp._moe_layer.experts.batched_fc1_w: copying a param with shape torch.Size([1, 1536, 384]) from checkpoint, the shape in current model is torch.Size([32, 1536, 384]). [rank0]: size mismatch for layers.2.blocks.3.mlp._moe_layer.experts.batched_fc2_w: copying a param with shape torch.Size([1, 1536, 384]) from checkpoint, the shape in current model is torch.Size([32, 1536, 384]). [rank0]: size mismatch for layers.2.blocks.3.mlp._moe_layer.experts.batched_fc1_bias: copying a param with shape torch.Size([1, 1, 1536]) from checkpoint, the shape in current model is torch.Size([32, 1536]). [rank0]: size mismatch for layers.2.blocks.5.mlp._moe_layer.experts.batched_fc1_w: copying a param with shape torch.Size([1, 1536, 384]) from checkpoint, the shape in current model is torch.Size([32, 1536, 384]). [rank0]: size mismatch for layers.2.blocks.5.mlp._moe_layer.experts.batched_fc2_w: copying a param with shape torch.Size([1, 1536, 384]) from checkpoint, the shape in current model is torch.Size([32, 1536, 384]). [rank0]: size mismatch for layers.2.blocks.5.mlp._moe_layer.experts.batched_fc1_bias: copying a param with shape torch.Size([1, 1, 1536]) from checkpoint, the shape in current model is torch.Size([32, 1536]). [rank0]: size mismatch for layers.2.blocks.7.mlp._moe_layer.experts.batched_fc1_w: copying a param with shape torch.Size([1, 1536, 384]) from checkpoint, the shape in current model is torch.Size([32, 1536, 384]). [rank0]: size mismatch for layers.2.blocks.7.mlp._moe_layer.experts.batched_fc2_w: copying a param with shape torch.Size([1, 1536, 384]) from checkpoint, the shape in current model is torch.Size([32, 1536, 384]). [rank0]: size mismatch for layers.2.blocks.7.mlp._moe_layer.experts.batched_fc1_bias: copying a param with shape torch.Size([1, 1, 1536]) from checkpoint, the shape in current model is torch.Size([32, 1536]). [rank0]: size mismatch for layers.2.blocks.9.mlp._moe_layer.experts.batched_fc1_w: copying a param with shape torch.Size([1, 1536, 384]) from checkpoint, the shape in current model is torch.Size([32, 1536, 384]). [rank0]: size mismatch for layers.2.blocks.9.mlp._moe_layer.experts.batched_fc2_w: copying a param with shape torch.Size([1, 1536, 384]) from checkpoint, the shape in current model is torch.Size([32, 1536, 384]). [rank0]: size mismatch for layers.2.blocks.9.mlp._moe_layer.experts.batched_fc1_bias: copying a param with shape torch.Size([1, 1, 1536]) from checkpoint, the shape in current model is torch.Size([32, 1536]). [rank0]: size mismatch for layers.2.blocks.11.mlp._moe_layer.experts.batched_fc1_w: copying a param with shape torch.Size([1, 1536, 384]) from checkpoint, the shape in current model is torch.Size([32, 1536, 384]). [rank0]: size mismatch for layers.2.blocks.11.mlp._moe_layer.experts.batched_fc2_w: copying a param with shape torch.Size([1, 1536, 384]) from checkpoint, the shape in current model is torch.Size([32, 1536, 384]). [rank0]: size mismatch for layers.2.blocks.11.mlp._moe_layer.experts.batched_fc1_bias: copying a param with shape torch.Size([1, 1, 1536]) from checkpoint, the shape in current model is torch.Size([32, 1536]). [rank0]: size mismatch for layers.2.blocks.13.mlp._moe_layer.experts.batched_fc1_w: copying a param with shape torch.Size([1, 1536, 384]) from checkpoint, the shape in current model is torch.Size([32, 1536, 384]). [rank0]: size mismatch for layers.2.blocks.13.mlp._moe_layer.experts.batched_fc2_w: copying a param with shape torch.Size([1, 1536, 384]) from checkpoint, the shape in current model is torch.Size([32, 1536, 384]). [rank0]: size mismatch for layers.2.blocks.13.mlp._moe_layer.experts.batched_fc1_bias: copying a param with shape torch.Size([1, 1, 1536]) from checkpoint, the shape in current model is torch.Size([32, 1536]). [rank0]: size mismatch for layers.2.blocks.15.mlp._moe_layer.experts.batched_fc1_w: copying a param with shape torch.Size([1, 1536, 384]) from checkpoint, the shape in current model is torch.Size([32, 1536, 384]). [rank0]: size mismatch for layers.2.blocks.15.mlp._moe_layer.experts.batched_fc2_w: copying a param with shape torch.Size([1, 1536, 384]) from checkpoint, the shape in current model is torch.Size([32, 1536, 384]). [rank0]: size mismatch for layers.2.blocks.15.mlp._moe_layer.experts.batched_fc1_bias: copying a param with shape torch.Size([1, 1, 1536]) from checkpoint, the shape in current model is torch.Size([32, 1536]). [rank0]: size mismatch for layers.2.blocks.17.mlp._moe_layer.experts.batched_fc1_w: copying a param with shape torch.Size([1, 1536, 384]) from checkpoint, the shape in current model is torch.Size([32, 1536, 384]). [rank0]: size mismatch for layers.2.blocks.17.mlp._moe_layer.experts.batched_fc2_w: copying a param with shape torch.Size([1, 1536, 384]) from checkpoint, the shape in current model is torch.Size([32, 1536, 384]). [rank0]: size mismatch for layers.2.blocks.17.mlp._moe_layer.experts.batched_fc1_bias: copying a param with shape torch.Size([1, 1, 1536]) from checkpoint, the shape in current model is torch.Size([32, 1536]). [rank0]: size mismatch for layers.3.blocks.1.mlp._moe_layer.experts.batched_fc1_w: copying a param with shape torch.Size([1, 3072, 768]) from checkpoint, the shape in current model is torch.Size([32, 3072, 768]). [rank0]: size mismatch for layers.3.blocks.1.mlp._moe_layer.experts.batched_fc2_w: copying a param with shape torch.Size([1, 3072, 768]) from checkpoint, the shape in current model is torch.Size([32, 3072, 768]). [rank0]: size mismatch for layers.3.blocks.1.mlp._moe_layer.experts.batched_fc1_bias: copying a param with shape torch.Size([1, 1, 3072]) from checkpoint, the shape in current model is torch.Size([32, 3072]).
Therefore, it seems that the dimensions being merged may be incorrect, and not all the parameters of the experts have been combined properly?
I think you may not follow the instructions correctly. The examples in tutorial should exactly work for your zip file. You should get a new checkpoint folder new_swin_moe_small_for_2_gpus/ containing 3 files in total. Please check if your new checkpoint folder corresponds with the file list below:
$ ls -ls new_swin_moe_small_for_2_gpus/
total 2121952
202872 -rw-r--r-- 1 root root 207739264 Oct 28 04:20 swin_moe_small_patch4_window12_192_32expert_32gpu_22k.pth.master
959540 -rw-r--r-- 1 root root 982561440 Oct 28 04:19 swin_moe_small_patch4_window12_192_32expert_32gpu_22k.pth.rank0
959540 -rw-r--r-- 1 root root 982561440 Oct 28 04:19 swin_moe_small_patch4_window12_192_32expert_32gpu_22k.pth.rank1
Feel free to let us know if the issue is still unsolved.
Thank you very much. The issue has been solved.
Hi,
I have downloaded the checkpoint for a 32-expert Swin-Transformer-MOE. However, the checkpoints are dependent sub-checkpoints distributed across different ranks (32 ranks). I want to load these sub-checkpoints and fine-tune the model on a 2-GPU machine.
To do this, should I gather the sub-checkpoints into a single checkpoint? I attempted to use the script from gather.py, but it did not work. Could you help me understand what went wrong?
Additionally, I checked the original code and found that the condition "if k.endswith('._num_global_experts')" is returning false. Is this due to the format of the Swin-Transformer-MOE checkpoint? I'm quite confused about this.
Thank you for your assistance!