microsoft / mttl

Building modular LMs with parameter-efficient fine-tuning.
MIT License
73 stars 7 forks source link

Not sure about the details of arrow routing #107

Open Weifan1226 opened 2 weeks ago

Weifan1226 commented 2 weeks ago

Hi Team!

I have recently been working on implementing the Arrow Routing algorithm in my project. However, I'm facing a challenge due to my limited expertise, particularly in understanding the concept of "token in layer l" within the algorithm. My current understanding is that the hidden state post the attention layer serves as h_l. However, the output of a transformer layer is typically structured as [batch_size, tokens, hidden_size]. I am uncertain about how to proceed from this point.

Additionally, I am seeking clarity on the phrase "it routes differently in every layer and token, increasing the overall model expressivity." Does this imply a "per token" routing mechanism? My interpretation is that the output of each transformer layer determines the subsequent LoRA adjustments to be made to the transformer layer.

截屏2024-09-01 23 25 38

I would appreciate any guidance or insights you could provide to help me better understand these aspects of the Arrow Routing algorithm.

Thank you!

Best regards, Fanjunduo Wei

Weifan1226 commented 1 week ago

I seem to understand the concept of "per token", just like MoE router. But my new question is: if Arrow Router selects LoRA A based on token a and LoRA B based on token b, then after processing token a in base model+A, do I need to eliminate the weight of LoRA A, reload LoRA B into the base model and then process token b?

pclucas14 commented 1 week ago

Hi!

Thanks for your interest in our work. Let me try and clarify Arrow routing. Just like an MoE router, each token at each layer is routed individually. One difference however is that in your typical MoE, tokens are routed to MLP / FFN experts, whereas our experts are simple linear layers.

For example, on the Mistral model, say that we train LoRAs on the gate_proj, up_proj and down_proj of the MLP layer. Then, for each of gate_proj, up_proj and down_proj we have a LoRA adapter, meaning that for a given MLP block each token will get routed 3 times, once for each linear layer. In a standard MoE, each token would get routed once per MLP expert.

I am not sure I fully understood the last point regarding LoRA As and Bs. Whenever a token is routed to a given LoRA expert, that token will be processed by both the A and B projections of the LoRA adapter.

Hopefully that clarifies a few things! Lucas

herkerser commented 3 days ago

Hi Team!

I have recently been working on implementing the Arrow Routing algorithm in my project. However, I'm facing a challenge due to my limited expertise, particularly in understanding the concept of the "first right singular vector" within the algorithm. In the algorithm, the V matrix is utilized. Then, I roughly went through the code and found that matrix U is used to calculate the top vector. Why is that? image