qiaowangli / Advanced_AutoCodeCompletion

Term Project of SENG480B, Fall 2021.
MIT License
2 stars 0 forks source link

BERT Notes #14

Open BraidenKirkland opened 3 years ago

BraidenKirkland commented 3 years ago

BERT

Training

Two Phases

  1. Pretrain BERT to understand language/context
  2. Fine tune BERT to understand a specific task

Pretraining Learns by training on two unsupervised tasks simultaneously. This steps helps BERT to understand bidirectional context within a sentence.

Fine Tuning How do we use language for a specific task?

BERT Video BERT Article Google Paper

BraidenKirkland commented 3 years ago

Key Points from This Article

Preparing the Dataset

  1. They have a text vector
  2. The text vector gets encoded into integer token ids
  3. They then mask 15% of the input token ids in a sequence at random

BERT Pretraining Model for Masked Language Modeling

They show an example with the following example token which was trained using 5 epochs. sample_tokens = vectorize_layer(["I have watched this [mask] and it was awesome"])

The masked sequence was {'input_text': 'i have watched this [mask] and it was awesome'}

Result: ['movie', 'film', 'was', 'is', 'series'] = [0.4759983, 0.18642229, 0.045611132, 0.028308254, 0.027862877] This was the result after the 3rd epoch. See the link for the print out of results after each epoch. You can see the probability of 'movie' increase between each epoch. This result makes sense, although I am surprised that 'was' and 'is' were predicted with higher probability than 'series'.

Fine Tuning

After this they fine-tuned the sentiment classification model by adding a pooling layer and a dense layer on top of the pretrained BERT features. For optimization it looks like they used adam with binary cross-entropy loss and evaluated based on accuracy.

optimizer = keras.optimizers.Adam()
classifer_model.compile(
    optimizer=optimizer, loss="binary_crossentropy", metrics=["accuracy"]
)

Adam Optimization Algorithm

End-to-end Model

Finally, they created an end-to-end model that accepts raw strings as its input.

samsara138 commented 3 years ago

I will use the structure in this article and the jupyter notebook example BERT Fine-Tuning Sentence Classification putting it here so I don't lose the link again

BraidenKirkland commented 3 years ago

Thank you. I will take a look through those articles.

BraidenKirkland commented 3 years ago

Attention Is All You Need

This paper proposes a new neural network architecture called the Transformer which is based solely on attention mechanisms. The result is a model of superior quality that requires much less training time because it allows for significantly less parallelization. This is an improvement over recurrent models which are constrained by sequential computation.

Model Architecture

Structure of most neural sequence transduction models. X -> Encoder -> Z -> Decoder -> Y

  1. Encoder and Decoder Stacks
    • Each encoder has 6 layers and each layer is composed of a feed-forward network and a multi-head self-attention mechanism
    • Each decorder has 6 layers and each layer is composed of the two sub-layer of the encoder layer plus a third layer which is a multi-head attention over the output of the encoder stack
  2. Attention
    • A query and a set of key-value pairs are mapped to an output, where the query, keys, values, and output are all vectors. The output is a weighted sum of the values. a. Scaled Dot-Product Attention
      • A softmax function is applied to obtain the weights on the values b. Multi-head Attention