microsoft / tutel

Tutel MoE: An Optimized Mixture-of-Experts Implementation
MIT License
694 stars 84 forks source link

Can this package support the one-gpu machine #206

Open momo1986 opened 1 year ago

momo1986 commented 1 year ago

Hi, dear guys of tutelage team.

I have run the script and do some small modifications. python -u main_moe.py --cfg configs/swinmoe/swin_moe_small_patch4_window12_192_32expert_32gpu_22k.yaml --data-path /data/user1/junyan/datasets/ImageNet/ImageNet_Val --batch-size 128 --resume checkpoints/swin_moe_small_patch4_window12_192_32expert_32gpu_22k/swin_moe_small_patch4_window12_192_32expert_32gpu_22k.pth

However, I have received the error message:

File "main_moe.py", line 374, in main(config) File "main_moe.py", line 141, in main max_accuracy = load_checkpoint(config, model_without_ddp, optimizer, lr_scheduler, loss_scaler, logger) File "/data/user1/junyan/adv_training/Swin-Transformer/utils_moe.py", line 45, in load_checkpoint msg = model.load_state_dict(checkpoint['model'], strict=False) File "/opt/conda/lib/python3.6/site-packages/torch/nn/modules/module.py", line 1039, in load_state_dict load(self) File "/opt/conda/lib/python3.6/site-packages/torch/nn/modules/module.py", line 1037, in load load(child, prefix + name + '.') File "/opt/conda/lib/python3.6/site-packages/torch/nn/modules/module.py", line 1037, in load load(child, prefix + name + '.') File "/opt/conda/lib/python3.6/site-packages/torch/nn/modules/module.py", line 1037, in load load(child, prefix + name + '.') [Previous line repeated 3 more times] File "/opt/conda/lib/python3.6/site-packages/torch/nn/modules/module.py", line 1034, in load state_dict, prefix, local_metadata, True, missing_keys, unexpected_keys, error_msgs) File "/root/.local/lib/python3.6/site-packages/tutel/impls/moe_layer.py", line 54, in _load_from_state_dict assert buff_name in state_dict, "Could not find parameter %s in state_dict." % buff_name AssertionError: Could not find parameter layers.2.blocks.1.mlp._moe_layer.experts.batched_fc2_bias in state_dict.

I have only one gpu. I am not sure whether multiple gpus are essential for this task. Is there a possibility to run it on one gpu? Furthermore, how can I resolve this problem of error?

I am looking forward to your response.

Thanks a lot.

Best Regards!

momo1986 commented 1 year ago

Thanks for your kind comment.

ghostplant commented 1 year ago

One GPU per machine? Can you explain how many machines you'd like to run it? Or you just want to run it using 1 GPU on 1 machine?

momo1986 commented 1 year ago

Hi, dear guys, @ghostplant.

I have several different one-gpu machines. To save the computation resource, running the program in a one-gpu machine would be economical for me. Actually, I mainly study some specific properties of MOE. Therefor, if it is OK, as you mentioned, just want to run it using 1 GPU on 1 machine.

ghostplant commented 1 year ago

If you run it with a one-gpu machine, seems like you need to ensure this GPU memory size is enough to store all 32-expert parameters. The way to convert swin_moe_small_patch4_window12_192_32expert_32gpu_22k.pth to single-gpu can follow the utility here, where the second example is to merge 32 different checkpoint files into a single checkpoint file and it will be compatible for single gpu to load.

momo1986 commented 1 year ago

Hi, @ghostplant. Thanks for your guidance. Can this package support run a single-gpu machine to test ImageNet? The user should implement this program manually, or is there a relevant demo? Thanks & Regards! Momo