pjlab-sys4nlp / llama-moe

⛷️ LLaMA-MoE: Building Mixture-of-Experts from LLaMA with Continual Pre-training (EMNLP 2024)
https://arxiv.org/abs/2406.16554
Apache License 2.0
883 stars 46 forks source link

How to split "down" by "up" when using clustering to construct experts? 请问使用clustering进行Expert Construction时,down怎么根据up划分? #56

Closed Attention-is-All-I-Need closed 8 months ago

Attention-is-All-I-Need commented 10 months ago

image llama的FFN层包含up,down,gate三个部分,根据技术报告中这段话,使用MoEfication方法对up的权重进行k-means聚类后,down的权重是根据up的聚类结果进行分割吗?而gate的权重是需要单独进行k-means聚类吗? 请问对down的分割操作具体是怎么做的呢?是在哪一部分代码中实现的呢? 以及为什么down的weight要根据up进行划分而不是对down进行k-means聚类呢? Thanks very much!

Spico197 commented 10 months ago

cc @DaizeDong

DaizeDong commented 10 months ago

感谢对项目的关注!

1. down与gate的权重是根据up的聚类结果进行分割吗?

是的,两者权重都基于up的聚类结果进行分割,不需要单独聚类。

2. 为什么down的weight要根据up进行划分,而不是对down进行k-means聚类呢?

因为Moefication方案是根据up进行划分的,我们参考了其做法。这里我们分析的原因是,划分后的up、down与gate都由同一个路由网络进行控制,分别聚类无法保证原有神经元与参数的映射关系,因此要根据一个权重的聚类结果对所有权重进行划分。

事实上,我们也尝试了对所有权重按照gate/down聚类结果进行划分,最后发现,两者相比根据up聚类进行划分的结果差异不大。因此我们这里选择了up作为最终方案。


Thank you for your interest in the project!

1. Are the weights of down and gate split based on the clustering results of up?

Yes, the weights of both are split based on the clustering results of up, no separate clustering is needed.

2. Why is the weight of down split according to up, instead of performing k-means clustering on down?

Because the Moefication method splits the experts by up, and we refer to it. Here, the reason is that the split up, down and gate weights are all controlled by the same gating network. Individually clustering on each weight cannot guarantee the mapping relationship of the original neurons and parameters, so we need to split all weights according to the clustering result of one weight.

In fact, we also tried to split all weights according to the clustering results on gate / down. We found that the difference was small compared to clustering on up, so we finally chose up here.

Attention-is-All-I-Need commented 10 months ago

感谢对项目的关注!

1. down与gate的权重是根据up的聚类结果进行分割吗?

是的,两者权重都基于up的聚类结果进行分割,不需要单独聚类。

2. 为什么down的weight要根据up进行划分,而不是对down进行k-means聚类呢?

因为Moefication方案是根据up进行划分的,我们参考了其做法。这里我们分析的原因是,划分后的up、down与gate都由同一个路由网络进行控制,分别聚类无法保证原有神经元与参数的映射关系,因此要根据一个权重的聚类结果对所有权重进行划分。

事实上,我们也尝试了对所有权重按照gate/down聚类结果进行划分,最后发现,两者相比根据up聚类进行划分的结果差异不大。因此我们这里选择了up作为最终方案。

Thank you for your interest in the project!

1. Are the weights of down and gate split based on the clustering results of up?

Yes, the weights of both are split based on the clustering results of up, no separate clustering is needed.

2. Why is the weight of down split according to up, instead of performing k-means clustering on down?

Because the Moefication method splits the experts by up, and we refer to it. Here, the reason is that the split up, down and gate weights are all controlled by the same gating network. Individually clustering on each weight cannot guarantee the mapping relationship of the original neurons and parameters, so we need to split all weights according to the clustering result of one weight.

In fact, we also tried to split all weights according to the clustering results on gate / down. We found that the difference was small compared to clustering on up, so we finally chose up here.

感谢回答!请问根据一个up权重的聚类结果对down和gate的权重进行划分有具体实现的代码吗?或者划分时具体怎么实现呢?比如down的权重如何跟up的不同专家权重对应的呢?@DaizeDong

Thank you for the answer! I would like to know if there is any specific implementation code for dividing the weights of 'down' and 'gate' based on the clustering results of 'up' weights. Or how is the division implemented specifically? For example, how does the weight of 'down' correspond to the different expert weights in 'up'?

DaizeDong commented 10 months ago

请问根据一个up权重的聚类结果对down和gate的权重进行划分有具体实现的代码吗?

参考 readme 里面的部分来跑就可以,详细步骤如下:

  1. 设置 scripts/moefication/split/run_split_clustering.sh 中的 proj_type=up_proj 并运行,可以在 save_path 路径下找到按照 up 聚类的结果,里面每个文件都存储了每列 up 参数对应的神经元所属的集簇编号。

  2. 修改 scripts/moefication/convert/run_convert.sh 中的 split_file_path 为聚类结果所在的路径,并设置 proj_type=up_proj 来指定要读取的聚类结果名称,之后运行就可以得到根据 up 权重的聚类结果、对所有权重 (up, down, gate) 进行划分的模型。

具体的实现代码在这里 llama-moe/smoe/utils/expert_construction/convert_llama_moe.py


I would like to know if there is any specific implementation code for dividing the weights of 'down' and 'gate' based on the clustering results of 'up' weights. Or how is the division implemented specifically?

You can refer to the readme. Here are the detailed steps:

  1. Set proj_type=up_proj in scripts/moefication/split/run_split_clustering.sh and run the script. The clustering results on up will be saved to save_path, where each file stores the cluster id for each row in up.

  2. Set split_file_path in scripts/moefication/convert/run_convert.sh as the above save_path, and set proj_type=up_proj, then run the script. You will get the model with all weights (gate, up, down) split by the clustering result on up.

The code implementation of dividing the weights of 'down' and 'gate' based on the clustering results of 'up' weights is here: llama-moe/smoe/utils/expert_construction/convert_llama_moe.py