updt init - Githubissues

vchiley commented 1 year ago

~~Clean up __init__ / param init for FusedExpertsNetwork~~ done in another PR
~~enable FusedExpertsNetwork to run without bias~~ done in another PR
make _num_global_experts not a buffer while still adding it to the state dict
- helps when init network with device=meta helpers
- helps layer play nice with torch distributed frameworks

vchiley commented 11 months ago

I also found few issues in some other changes where I am still investigating whether it has influence in existing features.

Any updates on this?

ghostplant commented 11 months ago

For Pyramid MoE, I think different MoE layers don't share the same global expert counts, so it will be incompatible with a lot of those cases.

vchiley commented 10 months ago

For Pyramid MoE, I think different MoE layers don't share the same global expert counts, so it will be incompatible with a lot of those cases.

we're talking about this line: self._num_global_experts = MOELayer.global_expert_count(self.num_local_experts, self.group)

MOELayer.global_expert_count is a static method which takes in the args of the specific MoE layer. Then self._num_global_experts is set per layer and one MoE's _num_global_experts shouldn't interfere with the next layers ie it shouldn't effect Pyramid MoE in any way.

Am I missing something?

vchiley commented 10 months ago

self._num_global_experts = MOELayer.global_expert_count(self.num_local_experts, self.group) Is leaving num_global_experts a buffer the only issue with the PR?

We could remove it from this PR and open a followup PR.

ghostplant commented 10 months ago

self._num_global_experts = MOELayer.global_expert_count(self.num_local_experts, self.group) Is leaving num_global_experts a buffer the only issue with the PR?

We could remove it from this PR and open a followup PR.

Sure. The feature non-ffn-ending-bias-training is relatively more acceptable. However, for this feature, what I worry about is whether it may have negative impact on 2 things below:

Previous SWIN-V2's checkpoint files containing MoE layers may be failed to load after your changes since data from those old state_dict is no longer compatible with new parameter registration. A lot of users having their private checkpoints for private models may also result in failure after this update.
Original's expert parameter initialization seed was specially designed to ensure initial parameter values to be deterministic regardless of whether expert parameters are sharded or not. In order words, 2 GPU training 2 experts in total should have exact the same training loss compared with 4 GPU training 2 experts (i.e. each GPU training half of an expert). Let's assume "2 GPU training 2 experts" have seed A + B on the first GPU, and C + D on the second GPU. In "4 GPU training 2 experts" mode, we seed A on the first GPU, B on the second GPU, C on the third GPU, D on the last GPU. This special design of parameter seeding ensures the initialization of expert parameters to be stable even if you want to change the number of GPUs but keep the number of global expert count unchanged. Your new changes seem to break this seed setting, right?

If your changes can avoid above problems, we are happy to accept your requests.

ghostplant commented 10 months ago

To keep original design unchanged so as to be compatible with legacy checkpoint files as well, I suggest your modification follow this at least: if bias=True which should be the default setting, all following registration & seeding order & seed initialization design should have no difference with original's in any cases.

vchiley commented 10 months ago

All tensor naming is still the same. If bias=True the generated state dict will be the same.

I only make layer init more conventional (along with using a more conventional reset_parameters) eg torch.nn.Linear layer. (Standard PyTorch attaches params in __init__ instead of in an auxiliary self.update(ctx) fn. After params are attached it calls standard PyTorch calls reset_parameters.)

I just moved parameter init, I didn't change how init happens. The seeding is unchanged. The parameters are still initialized in an nn.Linear layer and copied over.

I'll open a new PR with out self._num_global_experts.

microsoft / tutel

updt init #210