In each encoder, self attention followed by feed forward network
In self attention, three matrices for query, key, value.
Find attention weights by softmax_normalize(query*key), then use these weights to do a weighted sum of all the values for each token.
Have multiple heads at each layer. (multiple matrices for q,k,v)
Concatenate output from each attention head, multiple it with another matrik W_o to get a representation with decent size and then pass through the feed forward network.
Also add positional encodings (sinusoidal)
In decoder side use encoder decoder attention
In the decoder, the self-attention layer is only allowed to attend to earlier positions in the output sequence. This is done by masking future positions (setting them to -inf) before the softmax step in the self-attention calculation. (Masked Self Attention)
The “Encoder-Decoder Attention” layer works just like multiheaded self-attention, except it creates its Queries matrix from the layer below it, and takes the Keys and Values matrix from the output of the encoder stack.
BERT
Transformers only trained in forward manner, BERT trained as Masked LM
Mask a word and uses the output at that position to predict the masked word.
Sometimes it randomly replaces a word with another word and asks the model to predict the correct word in that position.
To make BERT better at handling relationships between multiple sentences, the pre-training process includes an additional task: Given two sentences (A and B), is B likely to be the sentence that follows A, or not?
Can also be used like ELMo (Embeddings from Language Model)
As input takes token embedding (word/sub token embedding), position embedding and segment embedding
GPT2
The GPT-2 is built using transformer decoder blocks. BERT, on the other hand, uses transformer encoder blocks
SVM
Word2Vec
CBOW: learning to predict the word by the context, several times faster to train than the skip-gram, slightly better accuracy for the frequent words.
Skipgram: predicts the context given a word. works well with small amount of the training data, represents well even rare words or phrases.
Sub-sampling: High frequency words often provide little information. Words with frequency above a certain threshold may be subsampled to increase training speed
Glove
Word2vec is local while Glove is global
GloVe was built on – global matrix factorization and local context window.
Instead of extracting the embeddings from a neural network that is designed to perform a different task like predicting neighboring words (CBOW) or predicting the focus word (Skip-Gram), the embeddings are optimized directly, so that the dot product of two word vectors equals the log of the number of times the two words will occur near each other.
Vector(cat) . Vector(dog) = log(10)
FastText
fastText represents each word as an n-gram of characters
a skip-gram model is trained to learn the embeddings
Linear Regression
Bagging
Bagging is when sampling is performed with replacement.
RF
Random forest improves bagging further by adding some randomness. In random forest, only a subset of features are selected at random to construct a tree (while often not subsample instances).
The benefit is that random forest decorrelates the trees.
Boosting
Boosting builds on weak learners, and in an iterative fashion. In each iteration, a new learner is added, while all existing learners are kept unchanged. All learners are weighted based on their performance (e.g., accuracy), and after a weak learner is added, the data are re-weighted: examples that are misclassified gain more weights, while examples that are correctly classified lose weights. Thus, future weak learners focus more on examples that previous weak learners misclassified.
Difference b/w RF and Boosting
RF grows trees in parallel, while Boosting is sequential
RF reduces variance, while Boosting reduces errors by reducing bias
Transformer
BERT
GPT2
SVM
Word2Vec
FastText
Linear Regression
Bagging
RF
Boosting
Difference b/w RF and Boosting
Decision Tree
AdaBoost
LSTM
PCA
GAN loss
K means
RCNN