Closed CanyonWind closed 3 months ago
How would you re-aggregate the outputs of the experts if they have different output dimensions?
Each expert is an MLP layer not a single linear layer, e.g. expert_i = linear [M, N_i] + activation + liner [N_i, M]
. And N_i
here are different across the expert index. Will it be possible to use scattermoe in this case?
We envisioned building different experts of similar architectures comprised of linear transforms so this might be harder to achieve. Something that may be possible with some effort is granulating the different capacity sizes to be multiples of a common block size, then adjusting the routing accordingly.
I see, thanks
Hi, thanks for the great work. Wonder whether it's possible to leverage ScatterMoE for various expert capacities.
The
scatter2scatter
implementation is highly bonded with the aggregated expert weights. If each expert has difference capacity (input/output dimensions), is there any way we can tweak around to make scattermoe also work? Many thanks