kamanphoebe / Look-into-MoEs

A Closer Look into Mixture-of-Experts in Large Language Models
https://arxiv.org/abs/2406.18219
MIT License
37 stars 0 forks source link

What is the difference between import models from transformers library and your implementation? #1

Open David-Li0406 opened 3 weeks ago

David-Li0406 commented 3 weeks ago

Hi,

Thanks for sharing the code and I am currently working based on it. I only need to get the gate score/ indices from each model (part of "Norms of expert outputs and gate scores" in dynamic_analysis.ipynb).

In this case, is there any difference between using the model implementations from your repository (modeling_xxx.py) or directly importing from transformers? I am asking because it will produce a CUDA trigger error for me to use your model implementation.

Thanks for your help!

kamanphoebe commented 3 weeks ago

Thanks for your interest on our work!

Our modeling_xxx.py scripts are modified based on the original model hubs on HuggingFace. Specifically, we modify the forward() function of the CausalLM class to add an extra argument decoder_layer_idx for implementing the second pass with top_k=ALL, as described in Section 5 in our paper.

Hence, if you aim to obtain the original top_k gate score/indices (i.e., the second pass is unnecessary), I believe importing from transformers should function the same using our code, as long as (1) your model checkpoints are compatible with it; (2) differences between variable names (if any) are probably handled.

Also, could you provide the error message (and which model script triggers the error) so that we can deal with the potential bugs of our code? Thanks ;)

David-Li0406 commented 3 weeks ago

Thanks for this prompt response! The error was appeared when I ran the "Norms of Expert Outputs and Gate Scores" part in dynamic_analysis.ipynb:

image