Question on InnerWorkings Of Bert

samuelgoodall commented 3 years ago

First of all thanks for the great Breakdown of the Huggingface-Bert-Model. From reading the Annotated Transformer and the Bert-paper it seems they use Vanilla Transformer encoders. But in the Huggingface implementation there is an extra layer for the selfattention(BertSelfOutput). Is there a specific reason for the additional linear layer? many thanks in advance

gmihaila commented 3 years ago

Hi @samuelgoodall ! Thank you for your great question!

If you closely follow the rest of the diagram there is a great number of linear layers being added at the beginning of a block. While there is no real explanation, in my opinion it helps with stability and robustness of the model.

Let me know if you have any other questions!

samuelgoodall commented 3 years ago

Hi @gmihaila thanks for the Explanation. I have another Question. For pretraining i couldnt find any validation loss curves online. Is a validation-set not necessary for pretraining language models and if so why? many thanks in advance!

gmihaila commented 3 years ago

@samuelgoodall Thank you for your interest in my tutorials!

A validation set is necessary! I also mention in my tutorials that for simplicity I skipped the validation curves, but strongly recommend people to use a validation set in their example.

The purpose of those tutorials is around the model and the code itself and not the data partitioning, best data practices or trying to get best performance on the movie reviews dataset.

samuelgoodall commented 3 years ago

Hi @gmihaila thanks for the quick response. One last Question. I am currently pretraining on the OpenWebTextCorpus with Electra pretraining on a Bert like model. The question i have is how would one go about creating a validation set for the discriminator? Should i just split the text-corpus like a normal data-set and check on that? Or should i use text from a different data-set to avoid having a validation set that is to similar to the trainings set? And because the Electra-pretraining involves two models is it sufficient to just feed in different text or do i have to use a different generator too? Many thanks in advance!

gmihaila commented 3 years ago

Great questions @samuelgoodall!

I think it is fine to use validation set from the OpenWebCorpus. You're training on the OpenWebTextCorpus so you want to make sure you don't overfit on it, and that's what the validation data will tell you. Just make sure the validation generator outputs same examples every time. Validation data should never change so you can properly evaluate how well the model is training. I wonder why are you concern about this? Are you planning to use Electra on a different dataset?

I can't remember what data was originally used to train Electra. If it's very similar to OpenWebTextCorpus I think it's fine to keep the generator the same. If the datasets are not the same you might want to pretrain or fine-tune the generator as well.

It looks to me like you have the option of starting from scratch: new generator, discriminator and tokenizer and train on OpenWebTextCorpus, or start from the original pretrained Electra and do extended training on the OpenWebTextCorpus.

I would be very interested what works best if you don't mind sharing.

samuelgoodall commented 3 years ago

Hi @gmihaila , i am currently testing what effect the components of the Reformer model have on training performance of ElectraPretraining with focus on the Question answering downstream task. As there arent any pretrained models for my usecase I will have to pretrain from scratch. In order to get something to compare to i modified the Huggingface-Electra-Model to incorporate LSH-Attention, AxialPositionalEmbeddings, and FeedForward-Chunking. In order to train from scratch i haggled my way through ebay and managed to get a rtx 3090 for a reasonable price. With Electra-pretraining i can train 200.000 steps of a comparable small model in 13-20 hours depending on which attention i use. In the paper they trained for a million steps but I think for comparison of the different Attention/Embeddings it will suffice. I managed to pretrain a Model with a modified Version of the very good Electra Pytorch repository. On squad i get a f1 score of 72 for the unmodified Electra model. I will try to do a split of the owt dataset to get the validation loss. With the generator it improves over time, should I take the generator of a trained model so i can leave it the same? As soon as i have something new/results ill share it with you. Thanks for the quick responses and helpful Answers :D !

gmihaila commented 3 years ago

@samuelgoodall Wow, nice work! Thank you for sharing this with me!

So in the validation step you must have 2 outputs for each example:

First output of the example where you have actual tokens to evaluate the generator.
Second output of the example is where you have the labels for the discriminator. The outputs and examples must remain the same for the validation data.

gmihaila / ml_things

Question on InnerWorkings Of Bert #12