microsoft / Tutel

Tutel MoE: An Optimized Mixture-of-Experts Implementation
MIT License
724 stars 93 forks source link

Question: Dictionary of Optimal Parallelism & Pipelining #247

Closed hikettei closed 2 months ago

hikettei commented 2 months ago

Hello.

I have been exploring the tutel library and reading the original paper to understand the different methodologies it employs. Specifically, I am interested in the section titled 3.3 Dictionary of Optimal Parallelism & Pipelining. It mentions an approach to dynamically search for optimal {r, d, a} configurations for a given capacity value using a Ternary Search. However, I am still unable to find the corresponding code in the repository.

Additionally, as shown on slide p11 of this slide, there seems to be a method to adjust the r, d values based on cap_factor. Is there a sample code or a module in the tutel repository that implements this functionality? If it exists, could you guide me on where to find it?

I look forward to hearing from you.

ghostplant commented 2 months ago

Hi, you can see the codes here: https://github.com/microsoft/tutel/blob/main/tutel/examples/helloworld_switch.py#L85-L88, which presents a simple way to enumerate all the different parallelism combinations in round-robin order and switch to an option that is different from the one used in the last step. In other words, .., adaptive_r=r, a2a_ffn_overlap_degree=o, allows you to instantly activate any one of parallel option through a dynamic scheduler of your design.

hikettei commented 2 months ago

okay, thank you for your prompt reply!