SwiGLU activation function

drAbreu commented 2 years ago

Feature request

Since it has been recently used in PaLM and several papers report its better performance, it would be good to have access to a SwiGLU implementation as an activation function.

Motivation

I am building a biomedical RoBERTa-based model with specific biomedical vocabulary. It could be seen as a PubMedBERT version wirth RoBERTa architecture and BPE vocab.

Since RoBERTa has already some years, I want to also add recent improvements to architecture and training.

I have tried myself to generate a RoBERTa model with two extra features. One is to remove bias from the FFN layers and the other to add the SwiGLU activation to these.

My approach has been to copy the code of roberta_modeling.py and modify its RobertaIntermediate class to a EXcellRobertaIntermediate class including the swiglu activation and a bias=config.dense_layer_bias attribute in the nn.Linear instantiation.

This works good for a first training of the model. However, when loading the model I find problems. The first problem was that the model config has activation=swiglu and there is some ContextManager that does not allow for that option. I did a dirty work around, keeping activation=gelu while keeping the swiglu in the code. This works and the model trains... but if I want to then further train it or use it for fine-tuning it will drop the extra layers generated by the swiglu. Here is an example output:

from smtag.excell_roberta.modeling_excell_roberta import EXcellRobertaForMaskedLM
model = EXcellRobertaForMaskedLM.from_pretrained('/app/excell-roberta-training/checkpoint-50/')

  loading configuration file /app/excell-roberta-training/checkpoint-50/config.json
  Model config EXcellRobertaConfig {
    "architectures": [
      "EXcellRobertaForMaskedLM"
    ],
    "attention_probs_dropout_prob": 0.1,
    "bias_dense_layers": false,
    "bias_norm": false,
    "bos_token_id": 0,
    "classifier_dropout": null,
    "dense_layer_bias": false,
    "eos_token_id": 1,
    "hidden_act": "gelu",
    "hidden_dropout_prob": 0.1,
    "hidden_size": 768,
    "initializer_range": 0.02,
    "intermediate_size": 3072,
    "layer_norm_eps": 1e-12,
    "max_position_embeddings": 514,
    "model_type": "roberta",
    "num_attention_heads": 12,
    "num_hidden_layers": 12,
    "pad_token_id": 3,
    "position_embedding_type": "absolute",
    "sep_token_id": 1,
    "swiglu": true,
    "tokenizer_class": "RobertaTokenizerFast",
    "torch_dtype": "float32",
    "transformers_version": "4.20.0",
    "type_vocab_size": 1,
    "use_cache": true,
    "vocab_size": 64000
  }

  loading weights file /app/excell-roberta-training/checkpoint-50/pytorch_model.bin
  Some weights of the model checkpoint at /app/excell-roberta-training/checkpoint-50/ were not used when initializing EXcellRobertaForMaskedLM: ['roberta.encoder.layer.2.intermediate.intermediate_dense.weight', 'roberta.encoder.layer.0.intermediate.intermediate_dense.weight', 'roberta.encoder.layer.3.intermediate.intermediate_dense.weight', 'roberta.encoder.layer.11.intermediate.intermediate_dense.weight', 'roberta.encoder.layer.8.intermediate.intermediate_dense.weight', 'roberta.encoder.layer.7.intermediate.intermediate_dense.weight', 'roberta.encoder.layer.9.intermediate.intermediate_dense.weight', 'roberta.encoder.layer.5.intermediate.intermediate_dense.weight', 'roberta.encoder.layer.6.intermediate.intermediate_dense.weight', 'roberta.encoder.layer.4.intermediate.intermediate_dense.weight', 'roberta.encoder.layer.1.intermediate.intermediate_dense.weight', 'roberta.encoder.layer.10.intermediate.intermediate_dense.weight']
  - This IS expected if you are initializing EXcellRobertaForMaskedLM from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPreTraining model).
  - This IS NOT expected if you are initializing EXcellRobertaForMaskedLM from the checkpoint of a model that you expect to be exactly identical (initializing a BertForSequenceClassification model from a BertForSequenceClassification model).
  All the weights of EXcellRobertaForMaskedLM were initialized from the model checkpoint at /app/excell-roberta-training/checkpoint-50/.
  If your task is similar to the task the model of the checkpoint was trained on, you can already use EXcellRobertaForMaskedLM for predictions without further training.

  model(**excell("acetyltransferase is something that should give extra subtokens to the tokenizer", truncation=True, padding="max_length", return_tensors='pt'))

  MaskedLMOutput(loss=None, logits=tensor([[[-0.1479,  0.3992, -0.3396,  ..., -0.3373, -0.8730, -0.7037],
           [ 0.1812,  0.5421, -0.4052,  ..., -0.0612, -0.6076, -1.0300],
           [-0.1578,  0.6487, -0.8400,  ...,  0.0745, -0.6941, -0.7082],
           ...,
           [-0.2610,  0.6921, -0.6040,  ..., -0.0400, -0.6101, -0.9326],
           [-0.2610,  0.6921, -0.6040,  ..., -0.0400, -0.6101, -0.9326],
           [-0.2610,  0.6921, -0.6040,  ..., -0.0400, -0.6101, -0.9326]]],
         grad_fn=<AddBackward0>), hidden_states=None, attentions=None)

  model = EXcellRobertaForMaskedLM.from_pretrained('/app/excell-roberta-training/checkpoint-50/')

  loading configuration file /app/excell-roberta-training/checkpoint-50/config.json
  Model config EXcellRobertaConfig {
    "architectures": [
      "EXcellRobertaForMaskedLM"
    ],
    "attention_probs_dropout_prob": 0.1,
    "bias_dense_layers": false,
    "bias_norm": false,
    "bos_token_id": 0,
    "classifier_dropout": null,
    "dense_layer_bias": false,
    "eos_token_id": 1,
    "hidden_act": "gelu",
    "hidden_dropout_prob": 0.1,
    "hidden_size": 768,
    "initializer_range": 0.02,
    "intermediate_size": 3072,
    "layer_norm_eps": 1e-12,
    "max_position_embeddings": 514,
    "model_type": "roberta",
    "num_attention_heads": 12,
    "num_hidden_layers": 12,
    "pad_token_id": 3,
    "position_embedding_type": "absolute",
    "sep_token_id": 1,
    "swiglu": true,
    "tokenizer_class": "RobertaTokenizerFast",
    "torch_dtype": "float32",
    "transformers_version": "4.20.0",
    "type_vocab_size": 1,
    "use_cache": true,
    "vocab_size": 64000
  }

  loading weights file /app/excell-roberta-training/checkpoint-50/pytorch_model.bin
  Some weights of the model checkpoint at /app/excell-roberta-training/checkpoint-50/ were not used when initializing EXcellRobertaForMaskedLM: ['roberta.encoder.layer.2.intermediate.intermediate_dense.weight', 'roberta.encoder.layer.0.intermediate.intermediate_dense.weight', 'roberta.encoder.layer.3.intermediate.intermediate_dense.weight', 'roberta.encoder.layer.11.intermediate.intermediate_dense.weight', 'roberta.encoder.layer.8.intermediate.intermediate_dense.weight', 'roberta.encoder.layer.7.intermediate.intermediate_dense.weight', 'roberta.encoder.layer.9.intermediate.intermediate_dense.weight', 'roberta.encoder.layer.5.intermediate.intermediate_dense.weight', 'roberta.encoder.layer.6.intermediate.intermediate_dense.weight', 'roberta.encoder.layer.4.intermediate.intermediate_dense.weight', 'roberta.encoder.layer.1.intermediate.intermediate_dense.weight', 'roberta.encoder.layer.10.intermediate.intermediate_dense.weight']
  - This IS expected if you are initializing EXcellRobertaForMaskedLM from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPreTraining model).
  - This IS NOT expected if you are initializing EXcellRobertaForMaskedLM from the checkpoint of a model that you expect to be exactly identical (initializing a BertForSequenceClassification model from a BertForSequenceClassification model).
  All the weights of EXcellRobertaForMaskedLM were initialized from the model checkpoint at /app/excell-roberta-training/checkpoint-50/.
  If your task is similar to the task the model of the checkpoint was trained on, you can already use EXcellRobertaForMaskedLM for predictions without further training.

I would like to check with you if there is any best way that this could be done, or whether it is possible at all without big modifications on transformers.

We plan to eventually, once the model is published to submit a request to add it to the library.

I would also be happy with a contribution of the SwiGLU activation if this would be possible. The main issue I see here is that instantiating a SwiGLU class requires instantiating an extra nn.Linear class. This therefore changes the behavior of the typical callables to other activation functions.

I will be happy also to contribute on this topic.

Your contribution

I have added two main modifications to the original code of RoBERTa:

First, I generated the class SwiGLU. I know that this is not the place to define this class, but this has been a test so far.

  class SwiGLU(nn.Module):
      def forward(self, x):
          x, gate = x.chunk(2, dim=-1)
          return F.silu(gate) * x

The other modification is:

  class EXcellRobertaIntermediate(nn.Module):
      def __init__(self, config):
          super().__init__()
          self.dense = nn.Linear(config.hidden_size, config.intermediate_size, bias=config.dense_layer_bias)
          self.swiglu = config.swiglu
          if self.swiglu:
              self.swiglu = True
              self.intermediate_act_fn = SwiGLU()
              self.intermediate_dense = nn.Linear(config.intermediate_size//2, config.intermediate_size, bias=config.dense_layer_bias)
          elif isinstance(config.hidden_act, str):
              self.intermediate_act_fn = ACT2FN[config.hidden_act]
          else:
              self.intermediate_act_fn = config.hidden_act

      def forward(self, hidden_states: torch.Tensor) -> torch.Tensor:
          if self.swiglu:
              hidden_states = self.dense(hidden_states)
              hidden_states = self.intermediate_act_fn(hidden_states)
              hidden_states = self.intermediate_dense(hidden_states)
          else:
              hidden_states = self.dense(hidden_states)
              hidden_states = self.intermediate_act_fn(hidden_states)
          return hidden_states

Iwould be happy to contribute with tthe SwiGLU activation and eventually to bring the entire model to transformers.

drAbreu commented 2 years ago

Indeed, I was able to solve the issue with the loading of the SwiGLU layers using the ugly fix of keeping the activation function definition in EXcellRobertaConfig as gelu, while adding a swiglu parameter that, it set to True, overrides activation function. I am not sure if this is a recommended procedure... Would expect that it is not.

I would be happy to get any comment on this and contribute with the addition of SwiGLU as an activation function.

sgugger commented 2 years ago

Hi there! We would recommend you to modify the modeling file to suit your needs, you can then include it with your checkpoint using the custom model on the Hub feature.

github-actions[bot] commented 1 year ago

This issue has been automatically marked as stale because it has not had recent activity. If you think this still needs to be addressed please comment on this thread.

Please note that issues that do not follow the contributing guidelines are likely to be ignored.

Linhvjc commented 9 months ago

Awesome ! Thanks for your contributions @drAbreu

VincentCCandela commented 4 months ago

Alphafold3 uses SwiGLU, so it really should be officially implemented at this point.

huggingface / transformers