shawntan / scattermoe

Triton-based implementation of Sparse Mixture of Experts.
Apache License 2.0
150 stars 10 forks source link

Experts with different capacity #6

Closed CanyonWind closed 3 months ago

CanyonWind commented 3 months ago

Hi, thanks for the great work. Wonder whether it's possible to leverage ScatterMoE for various expert capacities.

The scatter2scatter implementation is highly bonded with the aggregated expert weights. If each expert has difference capacity (input/output dimensions), is there any way we can tweak around to make scattermoe also work? Many thanks

shawntan commented 3 months ago

How would you re-aggregate the outputs of the experts if they have different output dimensions?

CanyonWind commented 3 months ago

Each expert is an MLP layer not a single linear layer, e.g. expert_i = linear [M, N_i] + activation + liner [N_i, M]. And N_i here are different across the expert index. Will it be possible to use scattermoe in this case?

shawntan commented 3 months ago

We envisioned building different experts of similar architectures comprised of linear transforms so this might be harder to achieve. Something that may be possible with some effort is granulating the different capacity sizes to be multiples of a common block size, then adjusting the routing accordingly.

CanyonWind commented 3 months ago

I see, thanks