Implements a pre-layernorm self-attention layer that can be enabled for the attention-based model by setting use_pre_layernorm=True in the config file.
According to this paper, pre-layernorm transformers are less sensitive to HP thus requiring less HPO.
Test breaks because dataset is already generated at v2.1.0 but training config is still at 2.0.0 until the updated datasets with additional stats are copied.
Implements a pre-layernorm self-attention layer that can be enabled for the attention-based model by setting
use_pre_layernorm=True
in the config file.According to this paper, pre-layernorm transformers are less sensitive to HP thus requiring less HPO.