arXiv '21 | Switch Transformers: Scaling to Trillion Parameter Models with Simple and Efficient Sparsity

jasperzhong / read-papers-and-code

My paper/code reading notes in Chinese

43 stars 3 forks source link

arXiv '21 | Switch Transformers: Scaling to Trillion Parameter Models with Simple and Efficient Sparsity #235

Open jasperzhong opened 3 years ago

jasperzhong commented 3 years ago

https://arxiv.org/pdf/2101.03961.pdf

奇怪，我怎么没挂这一篇？？？？

jasperzhong commented 3 years ago

在看Section 5. 里面有一句话印象深刻:

As a general rule, whenever a dimension that is partitioned across cores must be summed, then an all-reduce operation is added for both the forward and backward pass.

好像确实. 深刻体会下这句话.

jasperzhong commented 3 years ago

Section 7的Discussion很有意思.

Sparse Model与Dense Model之辩: 简单来讲，Sparse Model更加高效——达到同样精度所需要时间更短.