Closed Attention-is-All-I-Need closed 8 months ago
cc @DaizeDong
感谢对项目的关注!
1. down与gate的权重是根据up的聚类结果进行分割吗?
是的,两者权重都基于up的聚类结果进行分割,不需要单独聚类。
2. 为什么down的weight要根据up进行划分,而不是对down进行k-means聚类呢?
因为Moefication方案是根据up进行划分的,我们参考了其做法。这里我们分析的原因是,划分后的up、down与gate都由同一个路由网络进行控制,分别聚类无法保证原有神经元与参数的映射关系,因此要根据一个权重的聚类结果对所有权重进行划分。
事实上,我们也尝试了对所有权重按照gate/down聚类结果进行划分,最后发现,两者相比根据up聚类进行划分的结果差异不大。因此我们这里选择了up作为最终方案。
Thank you for your interest in the project!
1. Are the weights of down and gate split based on the clustering results of up?
Yes, the weights of both are split based on the clustering results of up, no separate clustering is needed.
2. Why is the weight of down split according to up, instead of performing k-means clustering on down?
Because the Moefication method splits the experts by up, and we refer to it. Here, the reason is that the split up, down and gate weights are all controlled by the same gating network. Individually clustering on each weight cannot guarantee the mapping relationship of the original neurons and parameters, so we need to split all weights according to the clustering result of one weight.
In fact, we also tried to split all weights according to the clustering results on gate / down. We found that the difference was small compared to clustering on up, so we finally chose up here.
感谢对项目的关注!
1. down与gate的权重是根据up的聚类结果进行分割吗?
是的,两者权重都基于up的聚类结果进行分割,不需要单独聚类。
2. 为什么down的weight要根据up进行划分,而不是对down进行k-means聚类呢?
因为Moefication方案是根据up进行划分的,我们参考了其做法。这里我们分析的原因是,划分后的up、down与gate都由同一个路由网络进行控制,分别聚类无法保证原有神经元与参数的映射关系,因此要根据一个权重的聚类结果对所有权重进行划分。
事实上,我们也尝试了对所有权重按照gate/down聚类结果进行划分,最后发现,两者相比根据up聚类进行划分的结果差异不大。因此我们这里选择了up作为最终方案。
Thank you for your interest in the project!
1. Are the weights of down and gate split based on the clustering results of up?
Yes, the weights of both are split based on the clustering results of up, no separate clustering is needed.
2. Why is the weight of down split according to up, instead of performing k-means clustering on down?
Because the Moefication method splits the experts by up, and we refer to it. Here, the reason is that the split up, down and gate weights are all controlled by the same gating network. Individually clustering on each weight cannot guarantee the mapping relationship of the original neurons and parameters, so we need to split all weights according to the clustering result of one weight.
In fact, we also tried to split all weights according to the clustering results on gate / down. We found that the difference was small compared to clustering on up, so we finally chose up here.
Thank you for the answer! I would like to know if there is any specific implementation code for dividing the weights of 'down' and 'gate' based on the clustering results of 'up' weights. Or how is the division implemented specifically? For example, how does the weight of 'down' correspond to the different expert weights in 'up'?
请问根据一个up权重的聚类结果对down和gate的权重进行划分有具体实现的代码吗?
参考 readme 里面的部分来跑就可以,详细步骤如下:
设置 scripts/moefication/split/run_split_clustering.sh
中的 proj_type=up_proj
并运行,可以在 save_path
路径下找到按照 up 聚类的结果,里面每个文件都存储了每列 up 参数对应的神经元所属的集簇编号。
修改 scripts/moefication/convert/run_convert.sh
中的 split_file_path
为聚类结果所在的路径,并设置 proj_type=up_proj
来指定要读取的聚类结果名称,之后运行就可以得到根据 up 权重的聚类结果、对所有权重 (up, down, gate) 进行划分的模型。
具体的实现代码在这里 llama-moe/smoe/utils/expert_construction/convert_llama_moe.py
I would like to know if there is any specific implementation code for dividing the weights of 'down' and 'gate' based on the clustering results of 'up' weights. Or how is the division implemented specifically?
You can refer to the readme. Here are the detailed steps:
Set proj_type=up_proj
in scripts/moefication/split/run_split_clustering.sh
and run the script. The clustering results on up will be saved to save_path
, where each file stores the cluster id for each row in up.
Set split_file_path
in scripts/moefication/convert/run_convert.sh
as the above save_path
, and set proj_type=up_proj
, then run the script. You will get the model with all weights (gate, up, down) split by the clustering result on up.
The code implementation of dividing the weights of 'down' and 'gate' based on the clustering results of 'up' weights is here: llama-moe/smoe/utils/expert_construction/convert_llama_moe.py
llama的FFN层包含up,down,gate三个部分,根据技术报告中这段话,使用MoEfication方法对up的权重进行k-means聚类后,down的权重是根据up的聚类结果进行分割吗?而gate的权重是需要单独进行k-means聚类吗? 请问对down的分割操作具体是怎么做的呢?是在哪一部分代码中实现的呢? 以及为什么down的weight要根据up进行划分而不是对down进行k-means聚类呢? Thanks very much!