Adding the ELECTRA Model

Stealth-py commented 2 years ago

While looking around, I found this paper on the ELECTRA model, it shows that replacing MLM with RTD gave them better GLUE scores than BERT. Might be worth taking note of/adding it here. We can also look at some other models like this as some future additions, I think. Would love to know your views on this!

Stealth-py commented 2 years ago

This also takes into account the matter that we are to integrate parts from the Tensorflow Model Garden as mentioned in #77, so I thought it might be worth noting.

chenmoneygithub commented 2 years ago

@Stealth-py Thanks for opening this feature request!

ELECTRA's proposed training flow is pretty interesting and promising, however I have a question - what components do we want to deliver here rather than the pretrained model?

abheesht17 commented 2 years ago

Weighing in on this, I think ELECTRA's architecture is pretty much the same as BERT's. This is what is said in the paper: "Our model architecture and most hyperparameters are the same as BERT’s." [Section 3.1]

So, I reckon the TransformerEncoder which we already have on the repo is sufficient. Correct me if I'm wrong.

We can maybe have a layer for the pretraining task - replaced token detection (we have a layer for dynamically generating masks?). Talking about this issue: https://github.com/keras-team/keras-nlp/issues/54. I think we can use the same layer? We may not have to add any layer because the building blocks for ELECTRA are already present. 🤔

Edit:

Hugging Face reference: https://github.com/huggingface/transformers/blob/70851a6bf0bf0cf39cb4919c1d3fd66affe9f0db/src/transformers/models/electra/modeling_electra.py#L443

Here, they've even mentioned that they simply copied everything from the BERT class.

Stealth-py commented 2 years ago

@chenmoneygithub Rather than the pretrained model, we can do something like what @abheesht17 proposed above, i.e., adding an encoder layer for RTD. We can also add an example for the same, like we do for BERT.

chenmoneygithub commented 2 years ago

IIUC, the layer is a classification head? Are there special operations inside the layer?

mattdangerw commented 2 years ago

I think an examples/electra directory is something we would love to have at some point! It is probably still a little early for us...

The main issue here is we would like to improve our BERT example a bit, and have it set the standard for example code for training these larger models from scratch. We'd like to clean that up a little bit, include some directions for kicking off training jobs on gcp, and add TPU support.

Once we feel good about the state of that one, we will definitely horizontally scale from there, and show complete training for a number of common architectures.

So hopefully soon on that front!

In the mean time, if we feel like there is a smaller version of this architecture that could be shown in a colab sized example on keras.io, that could be an option.

Stealth-py commented 2 years ago

@chenmoneygithub, I'm not sure but I suppose it is a classification head, and I don't suppose there would be any special operations to be conducted as such.

@mattdangerw, yeah makes sense. It would be a bit distracting if we worked on different architectures right now, that's true. I'll look around how we can enhance the current BERT implementation, and if we are to do some examples, though I suppose it wouldn't be right now, but in some time, I can try to work on that as well :D

mattdangerw commented 2 years ago

Sounds good. Yeah really an open question to me if we think there could be a good https://keras.io/examples/ sized demonstration of a simplified ELECTRA model.

If we think so, that work could start today. That would probably need to be a small amount of training data, but still show the generator/discriminator setup for the pre-training task.

ddofer commented 1 year ago

An opion to "break it down" a bit would be for the baseline "Replaced token" generator to do random or uniform sampling. (And have a model/EM as the generator function in a future issue/task). That would mean having all the necessary components (RTD, binary pretraining task), without needing to handle a self looping /extra model for the EM/generation.

In ProteinBERT (Keras based protein language model), we got equivalent results with this approach, using electra pretraining. https://github.com/nadavbra/protein_bert/blob/master/proteinbert/pretraining.py#L194

shivance commented 1 year ago

Hi @jbischof @mattdangerw , do you think this is the right time to add Electra model to KerasNLP? If yes, I would love to take this up.

mattdangerw commented 1 year ago

@shivance I actually think the thing to lead with here would be a "pre-training electra" example on keras.io. We could show the GAN-like setup used by ELECTRA with our low-level preprocessing and transformer modeling KerasNLP layers.

This would be genuinely useful for users I think, as it is really the pre-training setup that makes ELECTRA unique.

I'm not sure adding the pre-trained backbones is quite as important, as I don't think the model will perform better than some of the other encoder models we have in the repo (e.g. XLM-Roberta and DeBERTaV3 which was also trained with the ELECTRA objective). But we can asses that after we get the example up!

If adding the example would be of interest to you feel free to open up an issue and take this on! I would suggest...

Building a relatively small encoder model.
Using something like the wiki103 dataset used in our other pretraining example.
Focusing the example text on what makes electra unique, e.g. how and why this pre-training setup works.

keras-team / keras-nlp

Adding the ELECTRA Model #107