How to load 32-experts Swin-transformer-moe on a 2-GPU machine.

ywxsuperstar commented 4 days ago

Hi,

I have downloaded the checkpoint for a 32-expert Swin-Transformer-MOE. However, the checkpoints are dependent sub-checkpoints distributed across different ranks (32 ranks). I want to load these sub-checkpoints and fine-tune the model on a 2-GPU machine.

To do this, should I gather the sub-checkpoints into a single checkpoint? I attempted to use the script from gather.py, but it did not work. Could you help me understand what went wrong?

Additionally, I checked the original code and found that the condition "if k.endswith('._num_global_experts')" is returning false. Is this due to the format of the Swin-Transformer-MOE checkpoint? I'm quite confused about this.

Thank you for your assistance!

ywxsuperstar commented 4 days ago

If I load the Swin-Transformer-MOE checkpoint directly, an error occurs."

Or, to be a bit more detailed : load_pretrained(config, model_without_ddp, logger) [rank0]: File "/ai_home/data/private/ywx/Swin-Transformer/utils_moe.py", line 217, in load_pretrained [rank0]: msg = model.load_state_dict(state_dict, strict=False) [rank0]: File "/opt/conda/envs/tutel/lib/python3.8/site-packages/torch/nn/modules/module.py", line 2215, in load_state_dict [rank0]: raise RuntimeError('Error(s) in loading state_dict for {}:\n\t{}'.format( [rank0]: RuntimeError: Error(s) in loading state_dict for SwinTransformerMoE: [rank0]: size mismatch for layers.2.blocks.1.mlp._moe_layer.experts.batched_fc1_w: copying a param with shape torch.Size([1, 1536, 384]) from checkpoint, the shape in current model is torch.Size([32, 1536, 384]). [rank0]: size mismatch for layers.2.blocks.1.mlp._moe_layer.experts.batched_fc2_w: copying a param with shape torch.Size([1, 1536, 384]) from checkpoint, the shape in current model is torch.Size([32, 1536, 384]). [rank0]: size mismatch for layers.2.blocks.1.mlp._moe_layer.experts.batched_fc1_bias: copying a param with shape torch.Size([1, 1, 1536]) from checkpoint, the shape in current model is torch.Size([32, 1536]). [rank0]: size mismatch for layers.2.blocks.1.mlp._moe_layer.experts.batched_fc2_bias: copying a param with shape torch.Size([1, 1, 384]) from checkpoint, the shape in current model is torch.Size([32, 384]). [rank0]: size mismatch for layers.2.blocks.3.mlp._moe_layer.experts.batched_fc1_w: copying a param with shape torch.Size([1, 1536, 384]) from checkpoint, the shape in current model is torch.Size([32, 1536, 384]). [rank0]: size mismatch for layers.2.blocks.3.mlp._moe_layer.experts.batched_fc2_w: copying a param with shape torch.Size([1, 1536, 384]) from checkpoint, the shape in current model is torch.Size([32, 1536, 384]). [rank0]: size mismatch for layers.2.blocks.3.mlp._moe_layer.experts.batched_fc1_bias: copying a param with shape torch.Size([1, 1, 1536]) from checkpoint, the shape in current model is torch.Size([32, 1536]). [rank0]: size mismatch for layers.2.blocks.3.mlp._moe_layer.experts.batched_fc2_bias: copying a param with shape torch.Size([1, 1, 384]) from checkpoint, the shape in current model is torch.Size([32, 384]). [rank0]: size mismatch for layers.2.blocks.5.mlp._moe_layer.experts.batched_fc1_w: copying a param with shape torch.Size([1, 1536, 384]) from checkpoint, the shape in current model is torch.Size([32, 1536, 384]). [rank0]: size mismatch for layers.2.blocks.5.mlp._moe_layer.experts.batched_fc2_w: copying a param with shape torch.Size([1, 1536, 384]) from checkpoint, the shape in current model is torch.Size([32, 1536, 384]). [rank0]: size mismatch for layers.2.blocks.5.mlp._moe_layer.experts.batched_fc1_bias: copying a param with shape torch.Size([1, 1, 1536]) from checkpoint, the shape in current model is torch.Size([32, 1536]). [rank0]: size mismatch for layers.2.blocks.5.mlp._moe_layer.experts.batched_fc2_bias: copying a param with shape torch.Size([1, 1, 384]) from checkpoint, the shape in current model is torch.Size([32, 384]). [rank0]: size mismatch for layers.2.blocks.7.mlp._moe_layer.experts.batched_fc1_w: copying a param with shape torch.Size([1, 1536, 384]) from checkpoint, the shape in current model is torch.Size([32, 1536, 384]). [rank0]: size mismatch for layers.2.blocks.7.mlp._moe_layer.experts.batched_fc2_w: copying a param with shape torch.Size([1, 1536, 384]) from checkpoint, the shape in current model is torch.Size([32, 1536, 384]). [rank0]: size mismatch for layers.2.blocks.7.mlp._moe_layer.experts.batched_fc1_bias: copying a param with shape torch.Size([1, 1, 1536]) from checkpoint, the shape in current model is torch.Size([32, 1536]). [rank0]: size mismatch for layers.2.blocks.7.mlp._moe_layer.experts.batched_fc2_bias: copying a param with shape torch.Size([1, 1, 384]) from checkpoint, the shape in current model is torch.Size([32, 384]). [rank0]: size mismatch for layers.2.blocks.9.mlp._moe_layer.experts.batched_fc1_w: copying a param with shape torch.Size([1, 1536, 384]) from checkpoint, the shape in current model is torch.Size([32, 1536, 384]). [rank0]: size mismatch for layers.2.blocks.9.mlp._moe_layer.experts.batched_fc2_w: copying a param with shape torch.Size([1, 1536, 384]) from checkpoint, the shape in current model is torch.Size([32, 1536, 384]). [rank0]: size mismatch for layers.2.blocks.9.mlp._moe_layer.experts.batched_fc1_bias: copying a param with shape torch.Size([1, 1, 1536]) from checkpoint, the shape in current model is torch.Size([32, 1536]). [rank0]: size mismatch for layers.2.blocks.9.mlp._moe_layer.experts.batched_fc2_bias: copying a param with shape torch.Size([1, 1, 384]) from checkpoint, the shape in current model is torch.Size([32, 384]). [rank0]: size mismatch for layers.2.blocks.11.mlp._moe_layer.experts.batched_fc1_w: copying a param with shape torch.Size([1, 1536, 384]) from checkpoint, the shape in current model is torch.Size([32, 1536, 384]). [rank0]: size mismatch for layers.2.blocks.11.mlp._moe_layer.experts.batched_fc2_w: copying a param with shape torch.Size([1, 1536, 384]) from checkpoint, the shape in current model is torch.Size([32, 1536, 384]). [rank0]: size mismatch for layers.2.blocks.11.mlp._moe_layer.experts.batched_fc1_bias: copying a param with shape torch.Size([1, 1, 1536]) from checkpoint, the shape in current model is torch.Size([32, 1536]). [rank0]: size mismatch for layers.2.blocks.11.mlp._moe_layer.experts.batched_fc2_bias: copying a param with shape torch.Size([1, 1, 384]) from checkpoint, the shape in current model is torch.Size([32, 384]). [rank0]: size mismatch for layers.2.blocks.13.mlp._moe_layer.experts.batched_fc1_w: copying a param with shape torch.Size([1, 1536, 384]) from checkpoint, the shape in current model is torch.Size([32, 1536, 384]). [rank0]: size mismatch for layers.2.blocks.13.mlp._moe_layer.experts.batched_fc2_w: copying a param with shape torch.Size([1, 1536, 384]) from checkpoint, the shape in current model is torch.Size([32, 1536, 384]). [rank0]: size mismatch for layers.2.blocks.13.mlp._moe_layer.experts.batched_fc1_bias: copying a param with shape torch.Size([1, 1, 1536]) from checkpoint, the shape in current model is torch.Size([32, 1536]). [rank0]: size mismatch for layers.2.blocks.13.mlp._moe_layer.experts.batched_fc2_bias: copying a param with shape torch.Size([1, 1, 384]) from checkpoint, the shape in current model is torch.Size([32, 384]). [rank0]: size mismatch for layers.2.blocks.15.mlp._moe_layer.experts.batched_fc1_w: copying a param with shape torch.Size([1, 1536, 384]) from checkpoint, the shape in current model is torch.Size([32, 1536, 384]). [rank0]: size mismatch for layers.2.blocks.15.mlp._moe_layer.experts.batched_fc2_w: copying a param with shape torch.Size([1, 1536, 384]) from checkpoint, the shape in current model is torch.Size([32, 1536, 384]). [rank0]: size mismatch for layers.2.blocks.15.mlp._moe_layer.experts.batched_fc1_bias: copying a param with shape torch.Size([1, 1, 1536]) from checkpoint, the shape in current model is torch.Size([32, 1536]). [rank0]: size mismatch for layers.2.blocks.15.mlp._moe_layer.experts.batched_fc2_bias: copying a param with shape torch.Size([1, 1, 384]) from checkpoint, the shape in current model is torch.Size([32, 384]). [rank0]: size mismatch for layers.2.blocks.17.mlp._moe_layer.experts.batched_fc1_w: copying a param with shape torch.Size([1, 1536, 384]) from checkpoint, the shape in current model is torch.Size([32, 1536, 384]). [rank0]: size mismatch for layers.2.blocks.17.mlp._moe_layer.experts.batched_fc2_w: copying a param with shape torch.Size([1, 1536, 384]) from checkpoint, the shape in current model is torch.Size([32, 1536, 384]). [rank0]: size mismatch for layers.2.blocks.17.mlp._moe_layer.experts.batched_fc1_bias: copying a param with shape torch.Size([1, 1, 1536]) from checkpoint, the shape in current model is torch.Size([32, 1536]). [rank0]: size mismatch for layers.2.blocks.17.mlp._moe_layer.experts.batched_fc2_bias: copying a param with shape torch.Size([1, 1, 384]) from checkpoint, the shape in current model is torch.Size([32, 384]). [rank0]: size mismatch for layers.3.blocks.1.mlp._moe_layer.experts.batched_fc1_w: copying a param with shape torch.Size([1, 3072, 768]) from checkpoint, the shape in current model is torch.Size([32, 3072, 768]). [rank0]: size mismatch for layers.3.blocks.1.mlp._moe_layer.experts.batched_fc2_w: copying a param with shape torch.Size([1, 3072, 768]) from checkpoint, the shape in current model is torch.Size([32, 3072, 768]). [rank0]: size mismatch for layers.3.blocks.1.mlp._moe_layer.experts.batched_fc1_bias: copying a param with shape torch.Size([1, 1, 3072]) from checkpoint, the shape in current model is torch.Size([32, 3072]). [rank0]: size mismatch for layers.3.blocks.1.mlp._moe_layer.experts.batched_fc2_bias: copying a param with shape torch.Size([1, 1, 768]) from checkpoint, the shape in current model is torch.Size([32, 768]).

ghostplant commented 3 days ago

The pretrained checkpoint may be old which was compatible with a legacy Tutel version. Can you provide the checkpoint link you use, and SWIN command to load it if possible?

ywxsuperstar commented 3 days ago

The pretrained checkpoint may be old which was compatible with a legacy Tutel version. Can you provide the checkpoint link you use, and SWIN command to load it if possible?

Hi, I have loaded the checkpoint from (https://github.com/SwinTransformer/storage/releases/download/v2.0.2/swin_moe_small_patch4_window12_192_32expert_32gpu_22k.zip).

I used the command: torchrun --nproc_per_node=1 --nnode=1 --master_port 12347 main_moe.py --cfg configs/swinmoe/swin_moe_small_patch4_window12_192_32expert_32gpu_1k 128.yaml --data-path imagenet --batch-size 128 --pretrained swin_moe_small_patch4_window12_192_32expert_32gpu_22k/swin_moe_small_patch4_window12_192_32expert_32gpu_22k.pth
(For "swin_moe_small_patch4_window12_192_32expert_32gpu_1k", I used imagenet 1k to fintuning, and I only modify the dataset)
If you have any suggestions or improvements, please let me know. Thank you!

ghostplant commented 3 days ago

I just merged a PR (https://github.com/microsoft/tutel/pull/249) to ensure checkpoint compatible with legacy format.

Can you upgrade tutel installation, and follow this New Steps to convert the SWIN checkpoints and see if it works?

ywxsuperstar commented 3 days ago

I just merged a PR (#249) to ensure checkpoint compatible with legacy format.

Can you upgrade tutel installation, and follow this New Steps to convert the SWIN checkpoints and see if it works?

Thank you for your modifications! I upgraded the Tutel installation and followed the steps to merge the checkpoint. However, the issue still persists. Upon printing the merged checkpoint, I noticed a mismatch with the model.

erorr: [rank0]: size mismatch for layers.2.blocks.1.mlp._moe_layer.experts.batched_fc1_w: copying a param with shape torch.Size([1, 1536, 384]) from checkpoint, the shape in current model is torch.Size([32, 1536, 384]). [rank0]: size mismatch for layers.2.blocks.1.mlp._moe_layer.experts.batched_fc2_w: copying a param with shape torch.Size([1, 1536, 384]) from checkpoint, the shape in current model is torch.Size([32, 1536, 384]). [rank0]: size mismatch for layers.2.blocks.1.mlp._moe_layer.experts.batched_fc1_bias: copying a param with shape torch.Size([1, 1, 1536]) from checkpoint, the shape in current model is torch.Size([32, 1536]). [rank0]: size mismatch for layers.2.blocks.3.mlp._moe_layer.experts.batched_fc1_w: copying a param with shape torch.Size([1, 1536, 384]) from checkpoint, the shape in current model is torch.Size([32, 1536, 384]). [rank0]: size mismatch for layers.2.blocks.3.mlp._moe_layer.experts.batched_fc2_w: copying a param with shape torch.Size([1, 1536, 384]) from checkpoint, the shape in current model is torch.Size([32, 1536, 384]). [rank0]: size mismatch for layers.2.blocks.3.mlp._moe_layer.experts.batched_fc1_bias: copying a param with shape torch.Size([1, 1, 1536]) from checkpoint, the shape in current model is torch.Size([32, 1536]). [rank0]: size mismatch for layers.2.blocks.5.mlp._moe_layer.experts.batched_fc1_w: copying a param with shape torch.Size([1, 1536, 384]) from checkpoint, the shape in current model is torch.Size([32, 1536, 384]). [rank0]: size mismatch for layers.2.blocks.5.mlp._moe_layer.experts.batched_fc2_w: copying a param with shape torch.Size([1, 1536, 384]) from checkpoint, the shape in current model is torch.Size([32, 1536, 384]). [rank0]: size mismatch for layers.2.blocks.5.mlp._moe_layer.experts.batched_fc1_bias: copying a param with shape torch.Size([1, 1, 1536]) from checkpoint, the shape in current model is torch.Size([32, 1536]). [rank0]: size mismatch for layers.2.blocks.7.mlp._moe_layer.experts.batched_fc1_w: copying a param with shape torch.Size([1, 1536, 384]) from checkpoint, the shape in current model is torch.Size([32, 1536, 384]). [rank0]: size mismatch for layers.2.blocks.7.mlp._moe_layer.experts.batched_fc2_w: copying a param with shape torch.Size([1, 1536, 384]) from checkpoint, the shape in current model is torch.Size([32, 1536, 384]). [rank0]: size mismatch for layers.2.blocks.7.mlp._moe_layer.experts.batched_fc1_bias: copying a param with shape torch.Size([1, 1, 1536]) from checkpoint, the shape in current model is torch.Size([32, 1536]). [rank0]: size mismatch for layers.2.blocks.9.mlp._moe_layer.experts.batched_fc1_w: copying a param with shape torch.Size([1, 1536, 384]) from checkpoint, the shape in current model is torch.Size([32, 1536, 384]). [rank0]: size mismatch for layers.2.blocks.9.mlp._moe_layer.experts.batched_fc2_w: copying a param with shape torch.Size([1, 1536, 384]) from checkpoint, the shape in current model is torch.Size([32, 1536, 384]). [rank0]: size mismatch for layers.2.blocks.9.mlp._moe_layer.experts.batched_fc1_bias: copying a param with shape torch.Size([1, 1, 1536]) from checkpoint, the shape in current model is torch.Size([32, 1536]). [rank0]: size mismatch for layers.2.blocks.11.mlp._moe_layer.experts.batched_fc1_w: copying a param with shape torch.Size([1, 1536, 384]) from checkpoint, the shape in current model is torch.Size([32, 1536, 384]). [rank0]: size mismatch for layers.2.blocks.11.mlp._moe_layer.experts.batched_fc2_w: copying a param with shape torch.Size([1, 1536, 384]) from checkpoint, the shape in current model is torch.Size([32, 1536, 384]). [rank0]: size mismatch for layers.2.blocks.11.mlp._moe_layer.experts.batched_fc1_bias: copying a param with shape torch.Size([1, 1, 1536]) from checkpoint, the shape in current model is torch.Size([32, 1536]). [rank0]: size mismatch for layers.2.blocks.13.mlp._moe_layer.experts.batched_fc1_w: copying a param with shape torch.Size([1, 1536, 384]) from checkpoint, the shape in current model is torch.Size([32, 1536, 384]). [rank0]: size mismatch for layers.2.blocks.13.mlp._moe_layer.experts.batched_fc2_w: copying a param with shape torch.Size([1, 1536, 384]) from checkpoint, the shape in current model is torch.Size([32, 1536, 384]). [rank0]: size mismatch for layers.2.blocks.13.mlp._moe_layer.experts.batched_fc1_bias: copying a param with shape torch.Size([1, 1, 1536]) from checkpoint, the shape in current model is torch.Size([32, 1536]). [rank0]: size mismatch for layers.2.blocks.15.mlp._moe_layer.experts.batched_fc1_w: copying a param with shape torch.Size([1, 1536, 384]) from checkpoint, the shape in current model is torch.Size([32, 1536, 384]). [rank0]: size mismatch for layers.2.blocks.15.mlp._moe_layer.experts.batched_fc2_w: copying a param with shape torch.Size([1, 1536, 384]) from checkpoint, the shape in current model is torch.Size([32, 1536, 384]). [rank0]: size mismatch for layers.2.blocks.15.mlp._moe_layer.experts.batched_fc1_bias: copying a param with shape torch.Size([1, 1, 1536]) from checkpoint, the shape in current model is torch.Size([32, 1536]). [rank0]: size mismatch for layers.2.blocks.17.mlp._moe_layer.experts.batched_fc1_w: copying a param with shape torch.Size([1, 1536, 384]) from checkpoint, the shape in current model is torch.Size([32, 1536, 384]). [rank0]: size mismatch for layers.2.blocks.17.mlp._moe_layer.experts.batched_fc2_w: copying a param with shape torch.Size([1, 1536, 384]) from checkpoint, the shape in current model is torch.Size([32, 1536, 384]). [rank0]: size mismatch for layers.2.blocks.17.mlp._moe_layer.experts.batched_fc1_bias: copying a param with shape torch.Size([1, 1, 1536]) from checkpoint, the shape in current model is torch.Size([32, 1536]). [rank0]: size mismatch for layers.3.blocks.1.mlp._moe_layer.experts.batched_fc1_w: copying a param with shape torch.Size([1, 3072, 768]) from checkpoint, the shape in current model is torch.Size([32, 3072, 768]). [rank0]: size mismatch for layers.3.blocks.1.mlp._moe_layer.experts.batched_fc2_w: copying a param with shape torch.Size([1, 3072, 768]) from checkpoint, the shape in current model is torch.Size([32, 3072, 768]). [rank0]: size mismatch for layers.3.blocks.1.mlp._moe_layer.experts.batched_fc1_bias: copying a param with shape torch.Size([1, 1, 3072]) from checkpoint, the shape in current model is torch.Size([32, 3072]).

Therefore, it seems that the dimensions being merged may be incorrect, and not all the parameters of the experts have been combined properly？

ghostplant commented 3 days ago

I think you may not follow the instructions correctly. The examples in tutorial should exactly work for your zip file. You should get a new checkpoint folder new_swin_moe_small_for_2_gpus/ containing 3 files in total. Please check if your new checkpoint folder corresponds with the file list below:

$ ls -ls new_swin_moe_small_for_2_gpus/
total 2121952
202872 -rw-r--r-- 1 root root 207739264 Oct 28 04:20 swin_moe_small_patch4_window12_192_32expert_32gpu_22k.pth.master
959540 -rw-r--r-- 1 root root 982561440 Oct 28 04:19 swin_moe_small_patch4_window12_192_32expert_32gpu_22k.pth.rank0
959540 -rw-r--r-- 1 root root 982561440 Oct 28 04:19 swin_moe_small_patch4_window12_192_32expert_32gpu_22k.pth.rank1

ghostplant commented 2 days ago

Feel free to let us know if the issue is still unsolved.

ywxsuperstar commented 2 days ago

Thank you very much. The issue has been solved.

microsoft / Tutel

How to load 32-experts Swin-transformer-moe on a 2-GPU machine. #248