Closed htwang14 closed 3 years ago
Hi Haotao,
Yeah, we include the GroupScaling and mentioned that in the implementation details section (Appendix). We found for LM and MT tasks, it helps with stabilizing the training so we add it there. Feel free to change it or raise any questions.
Best, Sheng
Thank you!
Hi Sheng,
I see there is a GroupScaling1D operation at the very beginning of the PN layer. If I understand correctly, the GroupScaling1D operation scales the input feature across the channel dimension. I'm kind of confused why this is necessary. Seems this operation is not mentioned in the paper. Is it a standard way to preprocess features in NLP tasks?
Thanks in advance!
Haotao