Interview Notes for Applied Scientist (ML role)

Transformer

Multiple encoders, then decoders
In each encoder, self attention followed by feed forward network
In self attention, three matrices for query, key, value.
Find attention weights by softmax_normalize(query*key), then use these weights to do a weighted sum of all the values for each token.
Have multiple heads at each layer. (multiple matrices for q,k,v)
Concatenate output from each attention head, multiple it with another matrik W_o to get a representation with decent size and then pass through the feed forward network.
Also add positional encodings (sinusoidal)
In decoder side use encoder decoder attention
In the decoder, the self-attention layer is only allowed to attend to earlier positions in the output sequence. This is done by masking future positions (setting them to -inf) before the softmax step in the self-attention calculation. (Masked Self Attention)
The “Encoder-Decoder Attention” layer works just like multiheaded self-attention, except it creates its Queries matrix from the layer below it, and takes the Keys and Values matrix from the output of the encoder stack.

BERT

Transformers only trained in forward manner, BERT trained as Masked LM
Mask a word and uses the output at that position to predict the masked word.
Sometimes it randomly replaces a word with another word and asks the model to predict the correct word in that position.
To make BERT better at handling relationships between multiple sentences, the pre-training process includes an additional task: Given two sentences (A and B), is B likely to be the sentence that follows A, or not?
Can also be used like ELMo (Embeddings from Language Model)
As input takes token embedding (word/sub token embedding), position embedding and segment embedding

GPT2

The GPT-2 is built using transformer decoder blocks. BERT, on the other hand, uses transformer encoder blocks

SVM

Word2Vec

CBOW: learning to predict the word by the context, several times faster to train than the skip-gram, slightly better accuracy for the frequent words.
Skipgram: predicts the context given a word. works well with small amount of the training data, represents well even rare words or phrases.
Sub-sampling: High frequency words often provide little information. Words with frequency above a certain threshold may be subsampled to increase training speed
Glove
Word2vec is local while Glove is global
GloVe was built on – global matrix factorization and local context window.
Instead of extracting the embeddings from a neural network that is designed to perform a different task like predicting neighboring words (CBOW) or predicting the focus word (Skip-Gram), the embeddings are optimized directly, so that the dot product of two word vectors equals the log of the number of times the two words will occur near each other.
Vector(cat) . Vector(dog) = log(10)

FastText

fastText represents each word as an n-gram of characters
a skip-gram model is trained to learn the embeddings

Linear Regression

Bagging

Bagging is when sampling is performed with replacement.

RF

Random forest improves bagging further by adding some randomness. In random forest, only a subset of features are selected at random to construct a tree (while often not subsample instances).
The benefit is that random forest decorrelates the trees.

Boosting

Boosting builds on weak learners, and in an iterative fashion. In each iteration, a new learner is added, while all existing learners are kept unchanged. All learners are weighted based on their performance (e.g., accuracy), and after a weak learner is added, the data are re-weighted: examples that are misclassified gain more weights, while examples that are correctly classified lose weights. Thus, future weak learners focus more on examples that previous weak learners misclassified.

Difference b/w RF and Boosting

RF grows trees in parallel, while Boosting is sequential
RF reduces variance, while Boosting reduces errors by reducing bias

Decision Tree

Gini index : 1 - (sigma [p**2])
Information gain: sigma[p log(p)]

AdaBoost

https://towardsdatascience.com/boosting-algorithm-adaboost-b6737a9ee60c

LSTM

PCA

GAN loss

K means

Value of k from elbow method

RCNN

RCNN use selective search to generate proposals and then on each proposal compute the cnn output for classification
Fast RCNN uses selective search, but only gives the input image to CNN and compute the corresponding vector for each proposal
Faster RCNN doesn’t use selective search and uses RPN

nishnik / Paper-Leaf

Interview Notes for Applied Scientist (ML role) #14

Transformer

BERT

GPT2

SVM

Word2Vec

FastText

Linear Regression

Bagging

RF

Boosting

Difference b/w RF and Boosting

Decision Tree

AdaBoost

LSTM

PCA

GAN loss

K means

RCNN