BraidenKirkland commented 3 years ago

BERT

Represents multiple transformer encoders stacked together
The transformer encoder reads the entire sequence of words at once, rather than sequentially (L-R, R-L) like other models do
Before the word sequences are fed into BERT, 15% of the words in each sequence are replaced with a [MASK] token
The model tries to predict the masked words based on the context of the other non-masked words in the sequence

Training

Two Phases

Pretrain BERT to understand language/context
Fine tune BERT to understand a specific task

Pretraining Learns by training on two unsupervised tasks simultaneously. This steps helps BERT to understand bidirectional context within a sentence.

Masked Language Model (MLM) -> Predicting the masked words
Next Sentence Predicition (NSP) -> Takes in two sentences and sees if the second one follows from the first (yes/no answer)

Fine Tuning How do we use language for a specific task?

BERT Video BERT Article Google Paper

BraidenKirkland commented 3 years ago

Key Points from This Article

Preparing the Dataset

They have a text vector
The text vector gets encoded into integer token ids
They then mask 15% of the input token ids in a sequence at random

BERT Pretraining Model for Masked Language Modeling

The pretraining model architecture was created using the multihead attention layer
Input(s): Token ids (including the masked tokens)
Output(s): Predict the correct ids for the masked input tokens

They show an example with the following example token which was trained using 5 epochs. sample_tokens = vectorize_layer(["I have watched this [mask] and it was awesome"])

The masked sequence was {'input_text': 'i have watched this [mask] and it was awesome'}

Result: ['movie', 'film', 'was', 'is', 'series'] = [0.4759983, 0.18642229, 0.045611132, 0.028308254, 0.027862877] This was the result after the 3rd epoch. See the link for the print out of results after each epoch. You can see the probability of 'movie' increase between each epoch. This result makes sense, although I am surprised that 'was' and 'is' were predicted with higher probability than 'series'.

Fine Tuning

After this they fine-tuned the sentiment classification model by adding a pooling layer and a dense layer on top of the pretrained BERT features. For optimization it looks like they used adam with binary cross-entropy loss and evaluated based on accuracy.

optimizer = keras.optimizers.Adam()
classifer_model.compile(
    optimizer=optimizer, loss="binary_crossentropy", metrics=["accuracy"]
)

Adam Optimization Algorithm

End-to-end Model

Finally, they created an end-to-end model that accepts raw strings as its input.

samsara138 commented 3 years ago

I will use the structure in this article and the jupyter notebook example BERT Fine-Tuning Sentence Classification putting it here so I don't lose the link again

BraidenKirkland commented 3 years ago

Thank you. I will take a look through those articles.

BraidenKirkland commented 3 years ago

Attention Is All You Need

This paper proposes a new neural network architecture called the Transformer which is based solely on attention mechanisms. The result is a model of superior quality that requires much less training time because it allows for significantly less parallelization. This is an improvement over recurrent models which are constrained by sequential computation.

Model Architecture

Structure of most neural sequence transduction models. X -> Encoder -> Z -> Decoder -> Y

X - input sequence
Y - output sequence At each step, the previously generated symbols are consumed as input when generating the next.

Encoder and Decoder Stacks
- Each encoder has 6 layers and each layer is composed of a feed-forward network and a multi-head self-attention mechanism
- Each decorder has 6 layers and each layer is composed of the two sub-layer of the encoder layer plus a third layer which is a multi-head attention over the output of the encoder stack
Attention
- A query and a set of key-value pairs are mapped to an output, where the query, keys, values, and output are all vectors. The output is a weighted sum of the values. a. Scaled Dot-Product Attention
  - A softmax function is applied to obtain the weights on the values b. Multi-head Attention

qiaowangli / Advanced_AutoCodeCompletion

BERT Notes #14

BERT

Training

Two Phases

Key Points from This Article

Preparing the Dataset

BERT Pretraining Model for Masked Language Modeling

Fine Tuning

End-to-end Model

Attention Is All You Need

Model Architecture