microsoft / tutel

Tutel MoE: An Optimized Mixture-of-Experts Implementation
MIT License
694 stars 84 forks source link

Non-surface function utilities only work for contiguous input data #218

Open lyd126 opened 9 months ago

lyd126 commented 9 months ago

According to the paper, when the 'expert' value is set to 1, the score (scores = F.softmax(logits_w_noise, dim=1)) should always equal 1. Consequently, the output variable "y" (y = fast_encode(x.to(logits_dtype), crit, self.is_postscore).to(x.dtype), in the 'moe_layer.py' file on line 304) should be equal to the input variable "x". However, in my experiment, the "x" and "y" values are sometimes found to be different. This difference is first shown in "ctx.config.func_fwd(g, i, l, reshaped_input, dispatchedinput, extra=[ctx.config.indices[0].size(0), ctx.config.aligned_dim, ctx.config.capacity]) in fast_dispatch.py,line28" and the root source is "tutel_custom_kernel.invoke(inputs, extra, blocks, ctx) in jit_compiler.py line33". How can I fix this problem?

ghostplant commented 9 months ago

Can you explain why "x == y" for y = fast_encode(x.to(logits_dtype), crit, self.is_postscore).to(x.dtype)?

lyd126 commented 9 months ago

Thank you for your reply. To clarify, while running the example code (# Input Example: import torch x = torch.ones([6, 1024], device='cuda:0')......), when I set the 'expert' value to 1 and do not use any activation function (moe_layer = tutel_moe.moe_layer( gate_type={'type': 'top', 'k': 1, 'capacity_factor': 0, 'gate_noise': 1.0}, experts={'type': 'ffn', 'count_per_node': 1, 'hidden_size_per_expert': 32, 'output_dim': 32, 'activation_fn': lambda x: x}, model_dim=x.shape[-1], scan_expert_func=lambda name, param: setattr(param, 'skip_allreduce', True), ) I observe that the 'scores' always equals 1 and the output variable 'y' is the same as the input variable 'x' (y = fast_encode(x.to(logits_dtype), crit, self.is_postscore).to(x.dtype). However, when I replace the fully connected layer in my own model with the exact same MoE layer, and maintain the same input and weights as in the example code, the results "y" are entirely different from those in the example. I have identified that the discrepancy occurs specifically at this line: "y = fast_encode(x.to(logits_dtype), crit, self.is_postscore).to(x.dtype)". I believe that the two models should produce the same output, but the results are completely different. I am unsure where the issue lies. Please refer to the attached file for specific details. Issue_detial.docx Extremely grateful for your response.

ghostplant commented 9 months ago

Can you set gate_noise = 0 for both and check if they produce the same results?

lyd126 commented 9 months ago

Thank you very much for your prompt reply. I set the gate_noise to 0 according to what you said, but the result is still the same as before, and it is still different (when expert equals 1, the softmax result equals 1, and the gate modification may not affect the final result at this time?)

ghostplant commented 9 months ago

OK, can you help to provides these things? For both solutions, please add the following codes after y = fast_encode(..):

In example code:

...
torch.save([x, crit, y], 'test_cast_example.py')

In model code:

...
torch.save([x, crit, y], 'test_cast_model.py')

It'll help us to reproduce and look into what happens for your cases. BTW, I assume you use default setting of self.is_postscore, which equals True, right?

lyd126 commented 9 months ago

Thank you again for your reply. I saved the results according to your prompt, please see the attachment. As you said, self.is_postscore always equals True. In addition, I would also like to ask about the function of the self.is_postscore. test_cast_example.zip test_cast_model.zip

ghostplant commented 9 months ago

Hi, current fast_encode(x, ) requires x to be contiguous, while your model case is not satisfied, so you can get correct result by calling fast_encode(x.contiguous()). If you directly use MoELayer, it will cast to be contiguous outside (https://github.com/microsoft/tutel/blob/main/tutel/impls/moe_layer.py#L247) so that won't result in this problem.

Thanks for this finding, since you directly use internal function utilities. You can create a PR and turn it into contiguous inside this function to always guarantee the assumption.

lyd126 commented 9 months ago

I made the changes you mentioned and the problem was solved perfectly. Also, I'd like to ask if I want to ignore the score at this point in the top-k=1 case, i.e. y=expert1(x)+expert2(x)+.... +expertn(x) instead of *_y=score1expert1(x)+score2expert2(x)+.... +scorenexpertn(x)_** , how should I set it?

ghostplant commented 9 months ago

For now, score tensor applies to either one of x and y, which is specified by is_postpone. Do you want to always not using score tensor? If so, the gating section becomes useless.

To force doing that, please: scores = torch.ones_like(scores); moe.top_k_routing(scores, top_k=k); ..

lyd126 commented 9 months ago

I hope to use scores to determine which expert to work with, i.e. _y=expertn(x), n=softmax(score1, score2....), but I want to ignore the scores i.e. _*y=expert_n(x) instead of y=score_nexpertn(x)**. Is this possible?

ghostplant commented 9 months ago

For your purpose, I think you need to delete *self.gates_ from L125 and L129, and rebuild from from.

lyd126 commented 9 months ago

Thank you very much, the problem has been solved perfectly~~