microsoft / tutel

Tutel MoE: An Optimized Mixture-of-Experts Implementation
MIT License
694 stars 84 forks source link

updt init #210

Open vchiley opened 1 year ago

vchiley commented 1 year ago
vchiley commented 11 months ago

I also found few issues in some other changes where I am still investigating whether it has influence in existing features.

Any updates on this?

ghostplant commented 11 months ago

For Pyramid MoE, I think different MoE layers don't share the same global expert counts, so it will be incompatible with a lot of those cases.

vchiley commented 10 months ago

For Pyramid MoE, I think different MoE layers don't share the same global expert counts, so it will be incompatible with a lot of those cases.

we're talking about this line: self._num_global_experts = MOELayer.global_expert_count(self.num_local_experts, self.group)

MOELayer.global_expert_count is a static method which takes in the args of the specific MoE layer. Then self._num_global_experts is set per layer and one MoE's _num_global_experts shouldn't interfere with the next layers ie it shouldn't effect Pyramid MoE in any way.

Am I missing something?

vchiley commented 10 months ago

self._num_global_experts = MOELayer.global_expert_count(self.num_local_experts, self.group) Is leaving num_global_experts a buffer the only issue with the PR?

We could remove it from this PR and open a followup PR.

ghostplant commented 10 months ago

self._num_global_experts = MOELayer.global_expert_count(self.num_local_experts, self.group) Is leaving num_global_experts a buffer the only issue with the PR?

We could remove it from this PR and open a followup PR.

Sure. The feature non-ffn-ending-bias-training is relatively more acceptable. However, for this feature, what I worry about is whether it may have negative impact on 2 things below:

If your changes can avoid above problems, we are happy to accept your requests.

ghostplant commented 10 months ago

To keep original design unchanged so as to be compatible with legacy checkpoint files as well, I suggest your modification follow this at least: if bias=True which should be the default setting, all following registration & seeding order & seed initialization design should have no difference with original's in any cases.

vchiley commented 10 months ago

All tensor naming is still the same. If bias=True the generated state dict will be the same.

I only make layer init more conventional (along with using a more conventional reset_parameters) eg torch.nn.Linear layer. (Standard PyTorch attaches params in __init__ instead of in an auxiliary self.update(ctx) fn. After params are attached it calls standard PyTorch calls reset_parameters.)

I just moved parameter init, I didn't change how init happens. The seeding is unchanged. The parameters are still initialized in an nn.Linear layer and copied over.

I'll open a new PR with out self._num_global_experts.