Open jasperzhong opened 3 years ago
在看Section 5. 里面有一句话印象深刻:
As a general rule, whenever a dimension that is partitioned across cores must be summed, then an all-reduce operation is added for both the forward and backward pass.
好像确实. 深刻体会下这句话.
Section 7的Discussion很有意思.
Sparse Model与Dense Model之辩: 简单来讲,Sparse Model更加高效——达到同样精度所需要时间更短.
https://arxiv.org/pdf/2101.03961.pdf
奇怪,我怎么没挂这一篇????