Hello, thanks for sharing the code! Have you tried the T-Fixup initialization on language modeling tasks, in transformer models like BERT? Since the decoder is absent, I was wondering if you have any suggestions on how to initialize the encoder so as to remove layer norm as shown in this paper?
Hello, thanks for sharing the code! Have you tried the T-Fixup initialization on language modeling tasks, in transformer models like BERT? Since the decoder is absent, I was wondering if you have any suggestions on how to initialize the encoder so as to remove layer norm as shown in this paper?