Closed randomtutu closed 1 year ago
I'm also interested in reproducing the fine-tuned results in the X-MoE paper. If you can release some related scripts and pre-trained models, that would be perfect.
Hello there,
i'm interested in using the XMOE network and I have some questions regarding how to evaluate its performance on a validation set, how to save checkpoints, and how to resume training from a saved checkpoint.
Evaluation on Validation Set: Could you please provide some guidance on how to evaluate the XMOE network on a validation set? Also, I'm using the Distributed Data Parallel (DDP) mode, and I'm wondering whether I need to evaluate the XMOE network on all devices or only one device?
Saving Checkpoints: How can I save the XMOE model's checkpoints during training? What's the recommended way of doing this? Given that each GPU has its own experts and shared parameters, should I save all the parameters on each device or is there an API that can centralize the parameters and save them to avoid redundancy?
Resuming Training from a Saved Checkpoint: How can I resume training the XMOE model from a saved checkpoint? What's the recommended way of doing this? Is there any specific API or command I should use?
Thank you in advance for your help. I'm looking forward to using the XMOE network in my projects.
Evaluation: You can check the code for more details regarding the evaluation/generation of MoE models. It should be feasible to evaluate on a single device as long as the GPU memory is enough.
Checkpoint: Here is an example of saving and loading MoE checkpoints. The dense part and the expert parts are stored separately to avoid redundancy.
Hello there,
i'm interested in using the XMOE network and I have some questions regarding how to evaluate its performance on a validation set, how to save checkpoints, and how to resume training from a saved checkpoint.
Evaluation on Validation Set: Could you please provide some guidance on how to evaluate the XMOE network on a validation set? Also, I'm using the Distributed Data Parallel (DDP) mode, and I'm wondering whether I need to evaluate the XMOE network on all devices or only one device?
Saving Checkpoints: How can I save the XMOE model's checkpoints during training? What's the recommended way of doing this? Given that each GPU has its own experts and shared parameters, should I save all the parameters on each device or is there an API that can centralize the parameters and save them to avoid redundancy?
Resuming Training from a Saved Checkpoint: How can I resume training the XMOE model from a saved checkpoint? What's the recommended way of doing this? Is there any specific API or command I should use?
Thank you in advance for your help. I'm looking forward to using the XMOE network in my projects.