pjlab-sys4nlp / llama-moe

⛷️ LLaMA-MoE: Building Mixture-of-Experts from LLaMA with Continual Pre-training
https://arxiv.org/abs/2406.16554
Apache License 2.0
849 stars 44 forks source link

How many llama models are used for constructing llama-moe ? moe的构建是通过多个llama模型还是1个llama模型 #55

Open ZeyuTeng96 opened 8 months ago

ZeyuTeng96 commented 8 months ago

请问moe模型的构建是通过多个llama模型还是1个llama模型呢?

请问这个repo的用途是将1个llama模型的FFN层通过不同的切分方法,切分为多个FFN来扮演多个专家。然后将llama模型的其余模型层和权重与切分后的FFN和gate进行拼接变成moe模型嘛?

是否支持将多个llama结构模型的FFN层合并,基于一个base 的llama模型结构构建Moe呢?

  1. How many llama models are used when constructing llama-moe?
  2. Is this repo partition one llama model's FFN into multiple experts and concatenate the rest parameters with gates to construct an MoE model?
  3. Do you support concatenating multiple llama models and construct an MoE model?
Spico197 commented 8 months ago

感谢您对本项目的关注❤️

  1. LLaMA-MoE基于一个完整的llama模型进行切分
  2. 是的。我们只切分了llama的FFN层,之后再加一个gate进行token路由
  3. 目前不支持。不过这个想法基于同构模型,我想是比较容易实现的

Hi there, thanks for your attention on this project❤️

  1. LLaMA-MoE is constructed on ONE llama2-7B model
  2. Yes, you are right. We first partition llama2's FFN layers into multiple experts, then initialize a gate for token routing.
  3. Currently this repo does not support such functions. But, since all the candidate models have similar structures, I thinks it is not difficult to implement.
ZeyuTeng96 commented 8 months ago

谢谢大佬,后续有无考虑创建一个微信群聊,大家一起讨论moe

Sniper970119 commented 7 months ago

感谢您对本项目的关注❤️

  1. LLaMA-MoE基于一个完整的llama模型进行切分
  2. 是的。我们只切分了llama的FFN层,之后再加一个gate进行token路由
  3. 目前不支持。不过这个想法基于同构模型,我想是比较容易实现的

Hi there, thanks for your attention on this project❤️

  1. LLaMA-MoE is constructed on ONE llama2-7B model
  2. Yes, you are right. We first partition llama2's FFN layers into multiple experts, then initialize a gate for token routing.
  3. Currently this repo does not support such functions. But, since all the candidate models have similar structures, I thinks it is not difficult to implement.

你好,有些其他问题请教。 关于2的gate,我们做了一些实验,发现梯度会很大,不确定是否因为gate的初始化问题(因为模型的参数们都是经过良好训练的,但是gate是随机初始化的,不确定是否会因为这个导致梯度异常),所以想请教一下你们gate的一些初始化策略和方案。以及前期的warmup是否需要层学习率这类加速gate或冻结其它层更新的这类策略呢?


Hello, I have some other questions to consult. Regarding the gate for item 2, we conducted some experiments and found that the gradients are often large. We are unsure whether this is due to an issue with gate initialization. (Since the model parameters are well-trained, but the gate is randomly initialized, we are uncertain if this could cause gradient abnormalities.) So, I would like to inquire about some initialization strategies and approaches for your gates. Additionally, is there a need for early-stage warmup strategies, such as layer-wise learning rates to accelerate gate updates or freeze updates for other layers?

Spico197 commented 7 months ago

We had tested to firstly freeze other parameters and pre-train the gates. However, as more tokens consumed during continual pre-training, the two-stage pre-training didn't show advantages. So we keep the simplicity and train the whole model without specific gating magics.

Sniper970119 commented 7 months ago

We had tested to firstly freeze other parameters and pre-train the gates. However, as more tokens consumed during continual pre-training, the two-stage pre-training didn't show advantages. So we keep the simplicity and train the whole model without specific gating magics.

大概多少token这两个方案会基本一致呢?如果不特殊处理gate,在多少token时loss会下降到相对合理的水平?我现在loss大概在4.x,梯度在大几千的水平并且还在持续上升。根据之前的经验看,这么大的梯度似乎是不正确的。


How many tokens approximately will make these two approaches essentially consistent? If gates are not handled with special care, at what token count does the loss generally decrease to a reasonably acceptable level? Currently, my loss is around 4.x, and the gradients are at several thousand levels, continuously increasing. Based on previous experience, such large gradients seem to be incorrect.

Spico197 commented 7 months ago

Hi there~ For multi-stage pre-training comparison, it takes about 20B tokens. It may take about 20~30B tokens to reach a relative low loss values (2.1). But 20B tokens for gate pre-training may be not an effective training recipe (loss get convergence in 5-10B), you could try different settings to find a better one.

Sniper970119 commented 7 months ago

十分感谢解答。另外能否请教一些其它问题?

我这里初始化的梯度似乎问题比较大,但是短期内loss没观察到问题。

image

Thank you very much for your response. Additionally, may I inquire about some other questions?

Looking forward to your response~