Transformers

Multiple encoders, then decoders
In each encoder, self attention followed by feed forward network
In self attention, three matrices for query, key, value.
Find attention weights by softmax_normalize(query*key), then use these weights to do a weighted sum of all the values for each token.
Have multiple heads at each layer. (multiple matrices for q,k,v)
Concatenate output from each attention head, multiple it with another matrik W_o to get a representation with required size and then pass through the feed forward network.
Also add positional encodings (sinusoidal)
In decoder side use encoder decoder attention
In the decoder, the self-attention layer is only allowed to attend to earlier positions in the output sequence. This is done by masking future positions (setting them to -inf) before the softmax step in the self-attention calculation. (Masked Self Attention)
The “Encoder-Decoder Attention” layer works just like multiheaded self-attention, except it creates its Queries matrix from the layer below it, and takes the Keys and Values matrix from the output of the encoder stack.

BERT

Transformers only trained in forward manner, BERT trained as Masked LM
Mask a word and uses the output at that position to predict the masked word.
Sometimes it randomly replaces a word with another word and asks the model to predict the correct word in that position.
To make BERT better at handling relationships between multiple sentences, the pre-training process includes an additional task: Given two sentences (A and B), is B likely to be the sentence that follows A, or not?
Can also be used like ELMo (Embeddings from Language Model)
As input takes token embedding (word/sub token embedding), position embedding and segment embedding

GPT2

The GPT-2 is built using transformer decoder blocks. BERT, on the other hand, uses transformer encoder blocks

Questions Misc: What is your definition of success for this role? How much independent action vs working off a provided list is expected? Is there a conference/travel budget and what are the rules to use it? Can I contribute to FOSS projects? Are there any approvals needed? You must be getting your ideas, how do you work on them? Does the company invest in developers? Paid conferences (and paid time) Courses What is a day at Applied Scientist role?

nishnik / Paper-Leaf

Transformers #12

Transformers

BERT

GPT2