microsoft / torchscale

Foundation Architecture for (M)LLMs
https://aka.ms/GeneralAI
MIT License
3.01k stars 202 forks source link

SMOE or XMOE Network how to "evaluate" and "save and resume" #17

Closed randomtutu closed 1 year ago

randomtutu commented 1 year ago

Hello there,

i'm interested in using the XMOE network and I have some questions regarding how to evaluate its performance on a validation set, how to save checkpoints, and how to resume training from a saved checkpoint.

Evaluation on Validation Set: Could you please provide some guidance on how to evaluate the XMOE network on a validation set? Also, I'm using the Distributed Data Parallel (DDP) mode, and I'm wondering whether I need to evaluate the XMOE network on all devices or only one device?

Saving Checkpoints: How can I save the XMOE model's checkpoints during training? What's the recommended way of doing this? Given that each GPU has its own experts and shared parameters, should I save all the parameters on each device or is there an API that can centralize the parameters and save them to avoid redundancy?

Resuming Training from a Saved Checkpoint: How can I resume training the XMOE model from a saved checkpoint? What's the recommended way of doing this? Is there any specific API or command I should use?

Thank you in advance for your help. I'm looking forward to using the XMOE network in my projects.

BaohaoLiao commented 1 year ago

I'm also interested in reproducing the fine-tuned results in the X-MoE paper. If you can release some related scripts and pre-trained models, that would be perfect.

shumingma commented 1 year ago

Hello there,

i'm interested in using the XMOE network and I have some questions regarding how to evaluate its performance on a validation set, how to save checkpoints, and how to resume training from a saved checkpoint.

Evaluation on Validation Set: Could you please provide some guidance on how to evaluate the XMOE network on a validation set? Also, I'm using the Distributed Data Parallel (DDP) mode, and I'm wondering whether I need to evaluate the XMOE network on all devices or only one device?

Saving Checkpoints: How can I save the XMOE model's checkpoints during training? What's the recommended way of doing this? Given that each GPU has its own experts and shared parameters, should I save all the parameters on each device or is there an API that can centralize the parameters and save them to avoid redundancy?

Resuming Training from a Saved Checkpoint: How can I resume training the XMOE model from a saved checkpoint? What's the recommended way of doing this? Is there any specific API or command I should use?

Thank you in advance for your help. I'm looking forward to using the XMOE network in my projects.

Evaluation: You can check the code for more details regarding the evaluation/generation of MoE models. It should be feasible to evaluate on a single device as long as the GPU memory is enough.

Checkpoint: Here is an example of saving and loading MoE checkpoints. The dense part and the expert parts are stored separately to avoid redundancy.