Closed drAbreu closed 1 year ago
Indeed, I was able to solve the issue with the loading of the SwiGLU layers using the ugly fix of keeping the activation function definition in EXcellRobertaConfig
as gelu
, while adding a swiglu
parameter that, it set to True, overrides activation function.
I am not sure if this is a recommended procedure... Would expect that it is not.
I would be happy to get any comment on this and contribute with the addition of SwiGLU as an activation function.
Hi there! We would recommend you to modify the modeling file to suit your needs, you can then include it with your checkpoint using the custom model on the Hub feature.
This issue has been automatically marked as stale because it has not had recent activity. If you think this still needs to be addressed please comment on this thread.
Please note that issues that do not follow the contributing guidelines are likely to be ignored.
Awesome ! Thanks for your contributions @drAbreu
Alphafold3 uses SwiGLU, so it really should be officially implemented at this point.
Feature request
Since it has been recently used in PaLM and several papers report its better performance, it would be good to have access to a SwiGLU implementation as an activation function.
Motivation
I am building a biomedical RoBERTa-based model with specific biomedical vocabulary. It could be seen as a PubMedBERT version wirth RoBERTa architecture and BPE vocab.
Since RoBERTa has already some years, I want to also add recent improvements to architecture and training.
I have tried myself to generate a RoBERTa model with two extra features. One is to remove bias from the FFN layers and the other to add the SwiGLU activation to these.
My approach has been to copy the code of roberta_modeling.py and modify its
RobertaIntermediate
class to aEXcellRobertaIntermediate
class including theswiglu
activation and a bias=config.dense_layer_bias
attribute in thenn.Linear
instantiation.This works good for a first training of the model. However, when loading the model I find problems. The first problem was that the model config has
activation=swiglu
and there is some ContextManager that does not allow for that option. I did a dirty work around, keepingactivation=gelu
while keeping the swiglu in the code. This works and the model trains... but if I want to then further train it or use it for fine-tuning it will drop the extra layers generated by the swiglu. Here is an example output:I would like to check with you if there is any best way that this could be done, or whether it is possible at all without big modifications on transformers.
We plan to eventually, once the model is published to submit a request to add it to the library.
I would also be happy with a contribution of the SwiGLU activation if this would be possible. The main issue I see here is that instantiating a SwiGLU class requires instantiating an extra
nn.Linear
class. This therefore changes the behavior of the typical callables to other activation functions.I will be happy also to contribute on this topic.
Your contribution
I have added two main modifications to the original code of RoBERTa:
First, I generated the class
SwiGLU
. I know that this is not the place to define this class, but this has been a test so far.The other modification is:
Iwould be happy to contribute with tthe SwiGLU activation and eventually to bring the entire model to transformers.