How to implement adapters in case of pre norm

rabeehkarimimahabadi commented 3 years ago

Hi, I am having a model in which normalization first happens and then there is add operation. In the paper, you discussed the post-norm case, could you tell me how I can implement adapters for this case? thank you.

I mark the lines normalization and add operation happen with ** to describe the model better:

class LayerFF(nn.Module):
    def __init__(self, config):
        super().__init__()
        if config.feed_forward_proj == "relu":
            self.DenseReluDense = DenseReluDense(config)
        elif config.feed_forward_proj == "gated-gelu":
            self.DenseReluDense = DenseGatedGeluDense(config)
        else:
            raise ValueError(
                f"{self.config.feed_forward_proj} is not supported. Choose between `relu` and `gated-gelu`"
            )

        self.layer_norm = LayerNorm(config.d_model, eps=config.layer_norm_epsilon)
        self.dropout = nn.Dropout(config.dropout_rate)

    def forward(self, hidden_states):
        **forwarded_states = self.layer_norm(hidden_states)**
        forwarded_states = self.DenseReluDense(forwarded_states)
        **hidden_states = hidden_states + self.dropout(forwarded_states)**
        return hidden_states

class LayerSelfAttention(nn.Module):
    def __init__(self, config, has_relative_attention_bias=False):
        super().__init__()
        self.SelfAttention = Attention(config, has_relative_attention_bias=has_relative_attention_bias)
        self.layer_norm = LayerNorm(config.d_model, eps=config.layer_norm_epsilon)
        self.dropout = nn.Dropout(config.dropout_rate)

    def forward(
        self,
        hidden_states,
        attention_mask=None,
        position_bias=None,
        head_mask=None,
        past_key_value=None,
        use_cache=False,
        output_attentions=False,
    ):
        **normed_hidden_states = self.layer_norm(hidden_states)**
        attention_output = self.SelfAttention(
            normed_hidden_states,
            mask=attention_mask,
            position_bias=position_bias,
            head_mask=head_mask,
            past_key_value=past_key_value,
            use_cache=use_cache,
            output_attentions=output_attentions,
        )
        **hidden_states = hidden_states + self.dropout(attention_output[0])**
        outputs = (hidden_states,) + attention_output[1:]  # add attentions if we output them
        return outputs

neilhoulsby commented 3 years ago

For different architectures you'll need to experiment a little with where to put the adapters. The first thing one could try is placing them before the layer norm layers. The second could be at the end, after dropout, and possibly with an additional layer norm afterwards.

rabeehkarimimahabadi commented 3 years ago

Hi, thank you for the response. I sometimes get huge drop in performance with unfreezing the model's layernorm, could you give me some intuitions how adapter layers interact with layernorms? thanks

google-research / adapter-bert

How to implement adapters in case of pre norm #7