Closed W4ngatang closed 6 years ago
This should be done in ELMo style and only for ELMo. We should also add a flag-protected skip connection between the input and output of our pretrained BiLSTM. @W4ngatang ?
I think the only skip-connection is between input (either just the ELMo charCNN or a mixture of all the ELMo layers) and output of the RNN/Transformer
CharCNN (ELMo input) if we don't use ELMo, ELMo mixture if we do.
implemented
insert learnable layer scaling parameters to be learned once LSTM weights are frozen (for eval tasks) when training on LM