microsoft / torchscale

Foundation Architecture for (M)LLMs
https://aka.ms/GeneralAI
MIT License
3k stars 201 forks source link

[Question] what are the usages of multiway_network.py? #15

Closed yiqiwang8177 closed 1 year ago

yiqiwang8177 commented 1 year ago

Dear torchscale developers & researchers,

Thank you for sharing the implementation of torchcale public.

I have a question regarding the multiway_network usage in torchscale. In BeitV3.py line 32, i found its the only place a multiway_wrapper is used and it return multiway_networks that split things into 2 and apply 2 modules (in the code example, it's position emb) to it. Does it mean multiway_network only supports splitting a feature to 2 and apply operations?

According to Feedforward_network.py line 55, we could potentially have many FFNs, then it is very likely that moe_counts > 2.

Then, how does a multiway_networks helpful to train a multiway transformer? I guess it should be able to provide number of copies that are equivalent to # of moe_counts, but not only 2.

I think I probably misunderstand some part of code. Could you provide some guidance or reference for me?

Thank you very much!

NormXU commented 1 year ago

From my understanding, they are actually two different tricks.

Feedforward_network.py line 55 is used to initialize X-MoE in FFN, each FFN can have, let's say, 32 experts.

While BeitV3.py line 32 have two sets of FFN parameters, one is for text modality, the other is for image modality. They can be merged by the top FFN, which is named VL-FFN in the BERT-v3 paper.

Overall, X-MoE is for expanding the width of an encoder for a single modality, while the multiway is designed for multi-modalities

shumingma commented 1 year ago

Yes, _moecount is only used for the X-MoE implementation, while the multiway is for multimodal modeling as described in BEiT-3 paper.

We will update the README to make it clear.