Prelude / Prior Work

free courses

Hands on Machine learning Aurélien Géron notes

Examples

Detecting credit card fraud This is anomaly detection, which can be tackled using isolation forests, Gaussian mixture models or autoencoders
common unsupervised task is association rule learning, in which the goal is to dig into large amounts of data and discover interesting relations between attributes.

Performance

Cross val

from sklearn.model_selection import cross_val_score ### k-fold cross-validation 
from sklearn.model_selection import StratifiedKFold ## for fine grain controls. for train_index, test_index in skfolds.split(X_train, y_train_5):

Confusion matrix (confused one with another)

from sklearn.model_selection import cross_val_predict
from sklearn.metrics import confusion_matrix
y_train_pred = cross_val_predict(sgd_clf, X_train, y_train_5, cv=3) ## predict on unseen data
cm = confusion_matrix(y_train_5, y_train_pred)

Down is actual (neg then positive) Right is pred (neg then postive)

R recall (bottom right to left) = TP/(TP+FN)

P precision (bottom right to top) = TP/(TP+FP)

from sklearn.metrics import precision_score, recall_score # precision_score(y_train_5, y_train_pred)

F1 = 2 PR/(P+R) from sklearn.metrics import f1_score

shoplift detect - high recall

video is safe for kids - high precision

PR curve from sklearn.metrics import precision_recall_curve # precisions, recalls, thresholds = precision_recall_curve(y_train_5, y_scores)

predict more than threshold sgd_clf.decision_function([some_digit]) > threshold to return bool class

ROC curve plots the true positive rate (another name for recall) against the false positive rate (FPR) or fallout.

ROC curve plots sensitivity (recall) versus 1 – specificity.

good classfier should be far top left

Classifiers

from sklearn.ensemble import RandomForestClassifier # forest_clf = RandomForestClassifier(random_state=42) # y_probas_forest = cross_val_predict(forest_clf, X_train, y_train_5, cv=3, method="predict_proba") above lacks decision_function when compared to SGDClassifier

sgd and svc are binary classifiers (but can be overcome with one-versus-the-rest (OvR) strategy, or sometimes one-versus-all (OvA) or OvO (N × (N – 1) / 2 classifiers).) . LogisticReg, RandomFOrestClass.., GaussianNB are muti class

from sklearn.multiclass import OneVsRestClassifier # ovr_clf = OneVsRestClassifier(SVC(random_state=42))

one way to improve accu X_train_scaled = StandardScaler().fit_transform(X_train.astype("float64"))

confusion matrix

from sklearn.metrics import ConfusionMatrixDisplay
y_train_pred = cross_val_predict(sgd_clf, X_train_scaled, y_train, cv=3)
ConfusionMatrixDisplay.from_predictions(y_train, y_train_pred) ## normalize="true", values_format=".0%" to be added if one class is underrepresentated 
plt.show()

to make the errors stand out more, you can try putting zero weight on the correct predictions. # sample_weight = (y_train_pred != y_train)

could engineer new features that would help the classifier

data augmentation to add more rotated/skwed images

Multilabel Classification from sklearn.neighbors import KNeighborsClassifier ## knn_clf.fit(X_train, np.c_[y_train_large, y_train_odd])

models can be organized in a chain: when a model makes a prediction, it uses the input features plus all the predictions of the models that come before it in the chain. from sklearn.multioutput import ClassifierChain # chain_clf = ClassifierChain(SVC(), cv=3, random_state=42)

Multioutput : On the left is the noisy input image, and on the right is the clean target image. example use KNeighborsClassifier()

math

see norm normalized vector - v head , is the unit vector that points in the same direction as v

dot product # np.dot(u, v)

The projection of vector v onto u's axis

matrix mul ,

Python 3.5 has the @ infix operator for matrix multiplication, and NumPy 1.10 added support for it. A @ D is equivalent to np.matmul(A, D):

derivative cheatsheet

normal eq

from sklearn.preprocessing import add_dummy_feature
X_b = add_dummy_feature(X)  # add x0 = 1 to each instance
theta_best = np.linalg.inv(X_b.T @ X_b) @ X_b.T @ y
 X_new_b @ theta_best # is prediction

Grad descent

elongated curves when no feature scaling Optimal learning rate (neither too small(many iterations) not large (shuttle between valleys))

Partial derivatives of the cost function , vector

Gradient descent step

simulated annealing - gradually reduce learning rate

n_epochs = 50
t0, t1 = 5, 50  # learning schedule hyperparameters
def learning_schedule(t):
    return t0 / (t + t1)

go over every epoch

example epoch in lin regression

for epoch in range(n_epochs):
    gradients = 2 / m * X_b.T @ (X_b @ theta - y)
    theta = theta - eta * gradients

Learning Curves RmSE vs Train set size


from sklearn.model_selection import learning_curve

train_sizes, train_scores, valid_scores = learning_curve( LinearRegression(), X, y, train_sizes=np.linspace(0.01, 1.0, 40), cv=5, scoring="neg_root_mean_squared_error") train_errors = -train_scores.mean(axis=1) valid_errors = -valid_scores.mean(axis=1)

plt.plot(train_sizes, train_errors, "r-+", linewidth=2, label="train") plt.plot(train_sizes, valid_errors, "b-", linewidth=3, label="valid") ... plt.show()


also val data should not be too up (may happen with extreme poly)

Tradeoff: model’s complexity will typically increase its variance and reduce its bias. Conversely, reducing a model’s complexity increases its bias and reduces its variance

- To Regularize
Ridge regression cost function, more like l2 norm
![image](https://github.com/kurtzace/diary-2024/assets/2136211/dc85aa5e-ed1d-4e0c-9892-b078262392fb)

Lasso Regression
Least absolute shrinkage and selection operator regression . more like l1 norm

![image](https://github.com/kurtzace/diary-2024/assets/2136211/63e18bb2-207d-4e10-bcb2-95f860039e5b)

Elastic Net Regression (middle ground between ridge regression and lasso re) mix ratio 'r'
![image](https://github.com/kurtzace/diary-2024/assets/2136211/364cd770-98cc-471b-a676-ca74d7d0c8c8)

Prefer: lasso or elastic net because they tend to reduce the useless features’ weights down to zero

Early Stopping: stop training as soon as the validation error reaches the minimum

basic version of early stopping, notice use of partial_fit()

from copy import deepcopy from sklearn.metrics import mean_squared_error from sklearn.preprocessing import StandardScaler

X_train, y_train, X_valid, y_valid = [...] # split the quadratic dataset

preprocessing = make_pipeline(PolynomialFeatures(degree=90, include_bias=False), StandardScaler()) X_train_prep = preprocessing.fit_transform(X_train) X_valid_prep = preprocessing.transform(X_valid) sgd_reg = SGDRegressor(penalty=None, eta0=0.002, random_state=42) n_epochs = 500 best_valid_rmse = float('inf')

for epoch in range(n_epochs): sgd_reg.partial_fit(X_train_prep, y_train) y_valid_predict = sgd_reg.predict(X_valid_prep) val_error = mean_squared_error(y_valid, y_valid_predict, squared=False) if val_error < best_valid_rmse: best_valid_rmse = val_error best_model = deepcopy(sgd_reg)



in contrast sklearn.base.clone() only copies the model’s hyperparameters.

## Logistic regression cost function
![image](https://github.com/kurtzace/diary-2024/assets/2136211/04dcb582-d308-4631-b927-4084c0d4e7c6)
sigmoid(t) = 1 / (1 + exp(-t))

`np.linspace(0, 3, 1000).reshape(-1, 1)  # reshape to get a column vector, to get between 0 to 3 `

see Figure 4-23. Estimated probabilities and decision boundary in book and code to see decision boundaries 

from `sklearn.linear_model import LogisticRegression ## log_reg = LogisticRegression(random_state=42) ## log_reg.fit(X_train, y_train)`

for multiple classes - Softmax regression. score is 
![image](https://github.com/kurtzace/diary-2024/assets/2136211/50be591a-eee9-4f7d-af69-07853a193340), function is ![image](https://github.com/kurtzace/diary-2024/assets/2136211/604514c7-99dc-4049-8b95-ab97f340e263).
cross entropy cost function 
![image](https://github.com/kurtzace/diary-2024/assets/2136211/2b69926f-f91d-4292-a14b-50ac410d2642) where y i k is target probability that the ith instance belongs to class k.

LogisticRegression solver="lbfgs", which is the default. It also applies ℓ2 regularization by default, which you can control using the hyperparameter C

Support Vector Machines

large margin classifier, best to be used after scaling

soft margin classifier - instances in middle of the street

reducing C makes the street larger

make_pipeline(StandardScaler(),LinearSVC(C=1, random_state=42)).fit(X, y). # svm_clf.predict(X_new). svm_clf.decision_function(X_new)

for non lin

make_pipeline(PolynomialFeatures(degree=3), StandardScaler(), LinearSVC(C=10, max_iter=10_000, random_state=42))

above for small datasets

other one is kernel trick for larger datasets SVC(kernel="poly", degree=3, coef0=1, C=5) ## coef0 controls how much the model is influenced by high-degree terms versus low-degree terms.

or Similarity Features (small or medium-sized nonlinear training sets) SVC(kernel="rbf", gamma=5, C=0.001) small gamma value makes the bell-shaped curve wider. If it is underfitting, you should increase γ

String kernels for text, DNA

ϵ (called tol ) for high precision needs. affects margin size in SVM

LinearSVR class to perform linear SVM regression

TODO: Under the Hood of Linear SVM Classifiers and understand Dual Problem.

Using a QP solver is one way to train an SVM. Another is to use gradient descent to minimize the hinge loss or the squared hinge loss

Mercer’s theorem. to map a and b to another space.

Decision tree

white box model, fast prediction log m

likely to overfit, so control max_features, max_depth, min_samplessplit,...Increasing min hyperparameters or reducing max_ hyperparameters will regularize the model.

from sklearn.tree import export_graphviz to export to dot file

to display in notebook

from graphviz import Source Source.from_file("iris_tree.dot")

gini=0 for pure node in decision tree

gini impurity alternate is entropy

CART algorithm - binary tree. ID3 is more than 2 children

tree_clf.predict_proba ## for classification

CART cost

DecisionTreeRegressor and other is DecisionTreeClassifier

χ2 test (chi-squared test), are used to estimate the probability that the improvement is purely the result of chance

CART cost function for regression

Sensitivity to training set rotation fixed by PCA

High Variance unless you set the random_state hyperparameter

ensemble

estimator.append(('LR', LogisticRegression(solver ='lbfgs', multi_class ='multinomial',max_iter = 200))) 
estimator.append(('SVC', SVC(gamma ='auto', probability = True))) 
estimator.append(('DTC', DecisionTreeClassifier())) 

# Voting Classifier with hard voting 
vot_hard = VotingClassifier(estimators = estimator, voting ='hard') 
vot_hard.fit(X_train, y_train) 
for name, clf in vot_hard.named_estimators_.items():
...     print(name, "=", clf.score(X_test, y_test))

soft voting. It often achieves higher performance than hard voting because it gives more weight to highly confident votes.

voting_clf.voting = "soft"
voting_clf.named_estimators["svc"].probability = True

same algo: When sampling is performed with replacement,⁠1 this method is called bagging⁠ (short for bootstrap aggregating⁠). When sampling is performed without replacement, it is called pasting.⁠ better models with bagging

max_features="sqrt" in bagging classfier for limiting featues

RandomForestClassifier(n_estimators=500, max_leaf_nodes=16

for Extra-Trees splitter="random" when creating a DecisionTreeClassi..

rnd_clf.featureimportances

boosting combines weak learners to a strong learner

Adaboost The algorithm then increases the relative weight of misclassified training instances. by boosting weights of misclassified

Predictor weight

then for all misclassfied exp(alpha) is multiplied to old weight

multiclass version of AdaBoost called [SAMME] (which stands for Stagewise Additive Modeling using a Multiclass Exponential loss function).

AdaBoostClassifier( DecisionTreeClassifier(max_depth=1), n_estimators=30, learning_rate=0.5, random_state=42)

Scikit-Learn’s GradientBoostingRegressor

if you set the n_iter_no_change hyperparameter to an integer value, say 10, then the GradientBoostingRegressor will automatically stop adding more trees during training if it sees that the last 10 trees didn’t help.

HGB: works by binning the input features, replacing them with integers. The number of bins is controlled by the max_bins hyperparameter,

make_pipeline(
    make_column_transformer((OrdinalEncoder(), ["ocean_proximity"]),  remainder="passthrough"),
    HistGradientBoostingRegressor(categorical_features=[0], random_state=42)
)
hgb_reg.fit(housing, housing_labels)

Stacking: replace voter with model, blender, or a meta learner

Dimensionality Reduction

Eg: merge pixels, remove white border

manifold hypothesis, which holds that most real-world high-dimensional datasets lie close to a much lower-dimensional

Principal component analysis (PCA) by maximizing variance preservation

X = [...]  # create a small 3D dataset
X_centered = X - X.mean(axis=0)
U, s, Vt = np.linalg.svd(X_centered)
c1 = Vt[0] #c2 = Vt[1]

projects the training set onto the plane defined by the first two principal components:

W2 = Vt[:2].T   # X2D = X_centered @ W2

pca = PCA(n_components=2) ## X2D = pca.fit_transform(X) ## pca.explained_variance_ratio_

unsuper

from sklearn.datasets import make_blobs
X, y = make_blobs([...])
kmeans = KMeans(n_clusters=k, random_state=42)
y_pred = kmeans.fit_predict(X)
kmeans.cluster_centers_

soft clustering. The score can be the distance between the instance and the centroid kmeans.transform(X_new).round(2)

model’s inertia, which is the sum of the squared distances between the instances and their closest centroids.

score() -> negative inertia

accelerated by avoiding many unnecessary distance calculations: Elkan

Also MiniBatchKMeans fir large datasets

Inertia vs k makes elbow

silhouette coefficient over all the instances.(b – a) / max(a, b)

k-means does not behave very well when the clusters have varying sizes, different densities, or nonspherical shapes

Image segmentation: Color or semantic or instamce, but prep it by X = image.reshape(-1, 3)

Label propagation is a semi-supervised machine learning algorithm that assigns labels to previously unlabeled data points. Here’s how it works:

Graph Creation: The algorithm starts by creating a graph that connects all examples (rows) in the dataset based on their distance (e.g., Euclidean distance). Nodes in the graph represent the training data, and edges represent similarities between them. Soft Labels: Nodes in the graph have label soft labels or label distributions based on the labels of nearby examples. These soft labels are propagated along the edges to connected nodes. Iterative Process: The process is repeated for a fixed number of iterations, gradually strengthening the labels assigned to unlabeled examples

Scikit-Learn also offers two classes that can propagate labels automatically: LabelSpreading and LabelPropagation in the sklearn.semi_supervised package.

density-based spatial clustering of applications with noise (DBSCAN)

DBSCAN has a fit_predict() method (no predict)

Others

Agglomerative clustering A hierarchy of clusters is built from the bottom up. Bubbles

The balanced iterative reducing and clustering using hierarchies (BIRCH) algorithm was designed specifically for very large datasets,

Affinity propagation and Spectral clustering only for small datasets

GaussianMixture for elliptical

from sklearn.mixture import GaussianMixture gm = GaussianMixture(n_components=3, n_init=10) gm.fit(X) uses expectation-maximization (EM) algorithm

Gaussian Mixtures for Anomaly Detection densities = gm.score_samples(X) density_threshold = np.percentile(densities, 2) anomalies = X[densities < density_threshold] find the model that minimizes a theoretical information criterion, such as the Bayesian information criterion (BIC) or the Akaike information criterion (AIC)

word “probability” is used to describe how plausible a future outcome x is (knowing the parameter values θ), while the word “likelihood” is used to describe how plausible a particular set of parameter values θ are, after the outcome x is known.

likelihood function is a fundamental concept in statistics and probability theory. Let's break it down:

Definition:
- The likelihood function represents the probability of observing the given data (sample) as a function of the model parameters.
- It quantifies how well the model explains the observed data.
Mathematical Formulation:
- Suppose we have a statistical model with parameters θ. The likelihood function, denoted as L(θ), is defined as: $$ L(\theta) = P(\text{data}|\theta) $$ where:
  - (P(\text{data}|\theta)) is the probability of observing the data given the model parameters θ.
  - The likelihood function is a function of θ, and we treat the data as fixed.
Maximum Likelihood Estimation (MLE):
- MLE is a common method for estimating model parameters.
- The idea is to find the parameter values that maximize the likelihood function: $$ \hat{\theta}{\text{MLE}} = \arg\max{\theta} L(\theta) $$
Example:
- Consider a simple case: a coin toss. We want to estimate the probability of getting heads (θ) based on observed coin flips.
- If we observe 10 heads and 5 tails, the likelihood function would be: $$ L(\theta) = \theta^{10} (1-\theta)^5 $$
- MLE would find the value of θ that maximizes this function.
Log-Likelihood:
- Often, we work with the log-likelihood (logarithm of the likelihood) because it simplifies calculations and numerical stability: $$ \log L(\theta) = 10\log(\theta) + 5\log(1-\theta) $$

In summary, the likelihood function helps us find the best-fitting model parameters based on observed data. It's a crucial tool in statistical inference!

Fast-MCD (minimum covariance determinant) Implemented by the EllipticEnvelope class, this algorithm is useful for outlier detection, in particular to clean up a dataset. It assumes that the normal instances (inliers) are generated from a single Gaussian distribution (not a mixture).

ANN

Warren McCulloch and the mathematician Walter Pitts. In their landmark paper⁠2 “A Logical Calculus of Ideas Immanent in Nervous Activity”

σ(z) = 1 / (1 + exp(–z)), also called the sigmoid function.

The hyperbolic tangent function: tanh(z) = 2σ(2z) – 1 - output value ranges from –1 to 1 (instead of 0 to 1 in the case of the sigmoid function).

ReLU(z) = max(0, z)

softplus activation function, which is a smooth variant of ReLU: softplus(z) = log(1 + exp(z)). Softplus is close to 0 when z is negative, and close to z when z is positive.

mlp_reg = MLPRegressor(hidden_layer_sizes=[50, 50, 50], random_state=42) #Scikit-Learn has an MLPClassifier class in the sklearn.neural_network package. It is almost identical to the MLPRegressor

Keras model = tf.keras.Sequential() model.add(tf.keras.layers.Input(shape=[28, 28])) model.add(tf.keras.layers.Flatten()) #if it receives a batch of shape [32, 28, 28], it will reshape it to [32, 784]. In other words, if it receives input data X, it computes X.reshape(-1, 784) model.add(tf.keras.layers.Dense(250, activation="relu")) model.add(tf.keras.layers.Dense(200, activation="relu")) model.add(tf.keras.layers.Dense(10, activation="softmax")) model.summary() hidden1 = model.layers[1] hidden1.get_weights() #weights, biases model.compile(loss="sparse_categorical_crossentropy", optimizer="sgd", metrics=["accuracy"]) #for binary we use sigmoid, more fine tuned optimizer=tf.keras. opti⁠mizers.SGD(learning_rate=....), or loss="categorical_crossentropy" history = model.fit(X_train, y_train, epochs=25,validation_data=(X_valid, y_valid)) #play with validation_split=0.1, underrepresented, it would be useful to set the class_weight

History object:

Training Parameters (history.params): These are the parameters used during training, such as batch size, number of epochs, and other settings.
Epochs List (history.epoch): This list contains the epoch numbers (from 0 to the total number of epochs minus 1) that the model went through during training.
Metrics History (history.history): The most important part! It’s a dictionary that holds various training metrics (e.g., loss, accuracy) measured at the end of each epoch on both the training set and, if applicable, the validation set.

model.predict(X_new) gives a lot of probabilities , try y_proba.argmax(axis=-1)

Sequential API tf.random.set_seed(42) norm_layer = tf.keras.layers.Normalization(input_shape=X_train.shape[1:]) model = tf.keras.Sequential([ norm_layer, tf.keras.layers.Dense(60, activation="relu"), tf.keras.layers.Dense(60, activation="relu"), tf.keras.layers.Dense(60, activation="relu"), tf.keras.layers.Dense(1) ]) optimizer = tf.keras.optimizers.Adam(learning_rate=1e-4) model.compile(loss="mse", optimizer=optimizer, metrics=["RootMeanSquaredError"]) norm_layer.adapt(X_train) history = model.fit(X_train, y_train, epochs=20, validation_data=(X_valid, y_valid)) mse_test, rmse_test = model.evaluate(X_test, y_test) X_new = X_test[:3] y_pred = model.predict(X_new)

Wide & Deep neural network: learn both deep patterns (using the deep path) and simple rules (through the short path)

example

# Define input layer
inputs = keras.Input(shape=(784,))

# Deep path
dense1 = layers.Dense(64, activation="relu")(inputs)
dense2 = layers.Dense(64, activation="relu")(dense1)
deep_output = layers.Dense(10)(dense2)  # Output for deep patterns

# Short path (simple rules)
short_output = layers.Dense(10)(inputs)  # Output for simple rules

# Combine both paths
combined_output = layers.Add()([deep_output, short_output])

# Create the model
model = keras.Model(inputs=inputs, outputs=combined_output, name="custom_model")

or concat_layer = tf.keras.layers.Concatenate() followed by concat = concat_layer([normalized, hidden2])

Additional example

input_wide = tf.keras.layers.Input(shape=[5])  # features 0 to 4
input_deep = tf.keras.layers.Input(shape=[6])  # features 2 to 7
norm_layer_wide = tf.keras.layers.Normalization()
norm_layer_deep = tf.keras.layers.Normalization()
norm_wide = norm_layer_wide(input_wide)
norm_deep = norm_layer_deep(input_deep)
hidden1 = tf.keras.layers.Dense(30, activation="relu")(norm_deep)
hidden2 = tf.keras.layers.Dense(30, activation="relu")(hidden1)
concat = tf.keras.layers.concatenate([norm_wide, hidden2])
output = tf.keras.layers.Dense(1)(concat)
model = tf.keras.Model(inputs=[input_wide, input_deep], outputs=[output])

architecture image

imagine if need 2 outputs classif.. and regression

[...]  # Same as above, up to the main output layer
output = tf.keras.layers.Dense(1)(concat)
aux_output = tf.keras.layers.Dense(1)(hidden2)
model = tf.keras.Model(inputs=[input_wide, input_deep],
                       outputs=[output, aux_output])
optimizer = tf.keras.optimizers.Adam(learning_rate=1e-3)
model.compile(loss=("mse", "mse"), loss_weights=(0.9, 0.1), optimizer=optimizer,
              metrics=["RootMeanSquaredError"]) #or loss {"output": "mse", "aux_output": "mse"}

subclass

class WideAndDeepModel(tf.keras.Model): def __init__(self, units=30, activation="relu", **kwargs):

model cannot be cloned using tf.keras.models.clone_model(); and when you call the summary() method, you only get a list of layers, without any information on how they are connected to each other.

model.save("my_keras_model", save_format="tf") # save_format="h5"

Callbacks

checkpoint_cb = tf.keras.callbacks.ModelCheckpoint("my_checkpoints", save_weights_only=True) history = model.fit([...], callbacks=[checkpoint_cb]) # early_stopping_cb = tf.keras.callbacks.EarlyStopping(patience=10, restore_best_weights=True)

pip install -q -U tensorboard-plugin-profile

Activation function General ReLU-specific optimizations. However, Swish is probably a better default for more complex tasks, and you can even try parametrized Swish with a learnable β parameter for the most complex tasks. Mish may give you slightly better results, but it requires a bit more compute. If you care a lot about runtime latency, then you may prefer leaky ReLU, or parametrized leaky ReLU for more complex tasks.

Batch normalization

four parameter vectors are learned in each batch-normalized layer: γ (the output scale vector) and β (the output offset vector) are learned through regular backpropagation, and μ (the final input mean vector) and σ (the final input standard deviation vector) are estimated using an exponential moving average.

other regularization techniques (such as dropout

If we define W′ = γ⊗W / σ and b′ = γ ⊗ (b – μ) / σ + β, the equation simplifies to XW′ + b′. So, if we replace the previous layer’s weights and biases (W and b) with the updated weights and biases (W′ and b′)

each BN layer adds four parameters per input: γ, β, μ, and σ

The last two parameters, μ and σ, are the moving averages; they are not affected by backpropagation, so Keras calls them “non-trainable

model = tf.keras.Sequential([ tf.keras.layers.Flatten(input_shape=[28, 28]), tf.keras.layers.Dense(300, kernel_initializer="he_normal", use_bias=False), tf.keras.layers.BatchNormalization(), tf.keras.layers.Activation("relu"), ....

momentum

1; for example, 0.9, 0.99, or 0.999. You want more 9s for larger datasets

hyperparameter is axis: it determines which axis should be normalized. It defaults to –1, meaning that by default it will normalize the last axis

never exceed some threshold. This is called gradient clipping

gradient clipping does not change the direction of the gradient vector, you should clip by norm by setting clipnorm instead of clipvalue

Transfer learning: Try freezing all the reused layers first (i.e., make their weights non-trainable so that gradient descent won’t modify

try dropping the top hidden layer(s) and freezing all the remaining hidden layers again. You can iterate until you find the right number of layers to reuse.

for layer in model_B_on_A.layers[:-1]: layer.trainable = False

After unfreezing the reused layers, it is usually a good idea to reduce the learning rate, once again to avoid damaging the reused weights.

unsupervised model, such as an autoencoder or a generative adversarial network (GAN)

Old technique: unsupervised pretraining—typically with restricted Boltzmann machines (RBMs; see the notebook at https://homl.info/extra-anns)—was the norm for deep nets

greedy layer wise pretraining in DNN

Faster Optimizers

Momentum θ ← θ – η∇θJ(θ). is ignorant of prev grad so momentum

optimizer = tf.keras.optimizers.SGD(learning_rate=0.001, momentum=0.9)

Nesterov momentum optimization, measures the gradient of the cost function not at the local position θ but slightly ahead in the direction of the momentum, at θ + βm

optimizer = tf.keras.optimizers.SGD(learning_rate=0.001, momentum=0.9, nesterov=True)

Adagrad

you should not use it to train deep neural networks

The RMSProp algorithm⁠ fixes this by accumulating only the gradients from the most recent iterations, as opposed to all the gradients since the beginning of training. It does so by using exponential decay in the first step.

optimizer = tf.keras.optimizers.RMSprop(learning_rate=0.001, rho=0.9)

Adam which stands for adaptive moment estimation, combines the ideas of momentum optimization and RMSProp: just like momentum optimization, it keeps track of an exponentially decaying average of past gradients; and just like RMSProp, it keeps track of an exponentially decaying average of past squared gradients.

The momentum decay hyperparameter β1 is typically initialized to 0.9, while the scaling decay hyperparameter β2 is often initialized to 0.999. As earlier, the smoothing term ε is usually initialized to a tiny number such as 10–7

optimizer = tf.keras.optimizers.Adam(learning_rate=0.001, beta_1=0.9, beta_2=0.999)

Nadam optimization is Adam optimization plus the Nesterov trick, so it will often converge slightly faster than Adam.

AdamW⁠is a variant of Adam that integrates a regularization technique called weight decay

Learning sch

Implementing power scheduling in Keras is the easiest option—just set the decay hyperparameter when creating an optimizer:

optimizer = tf.keras.optimizers.SGD(learning_rate=0.01, decay=1e-4)

def exponential_decay_fn(epoch): return 0.01 * 0.1 ** (epoch / 20)

lr_scheduler = tf.keras.callbacks.LearningRateScheduler(exponential_decay_fn) history = model.fit(X_train, y_train, [...], callbacks=[lr_scheduler])

ReduceLROnPlateau callback. For example, if you pass the following callback to the fit() method, it will multiply the learning rate by 0.5 whenever the best validation loss does not improve for five consecutive epochs

both ℓ1 and ℓ2 regularization, use tf.keras.regularizers.l1_l2()

Use Python’s functools.partial() function, which lets you create a thin wrapper for any callable, with some default argument values:

Dropout

2^N possible networks (where N is the total number of droppable neurons).

More generally, we need to divide the connection weights by the keep probability (1 – p) during training.

If you observe that the model is overfitting, you can increase the dropout rate. Conversely, you should try decreasing the dropout rate if the model underfits the training set. tf.keras.Sequential([ tf.keras.layers.Flatten(input_shape=[28, 28]), tf.keras.layers.Dropout(rate=0.2), tf.keras.layers.Dense(100, activation="relu", kernel_initializer="he_normal"), tf.keras.layers.Dropout(rate=0.2), tf.keras.layers.Dense(100, activation="relu", kernel_initializer="he_normal"), tf.keras.layers.Dropout(rate=0.2), tf.keras.layers.Dense(10, activation="softmax")

monte carlo MC dropout, which can boost the performance of any trained dropout model without having to retrain it or even modify it at all.

It’s also useful to know exactly which other classes are most likely.

Max norm reg computing ∥ w ∥2 after each training step and rescaling w if needed (w ← w r / ∥ w ∥2). dense = tf.keras.layers.Dense( 100, activation="relu", kernel_initializer="he_normal", kernel_constraint=tf.keras.constraints.max_norm(1.))

table at end of Training Deep Neural Networks is useful image

If you need a sparse model, you can use ℓ1 regularization (and optionally zero out the tiny weights after training).

Custom models

ML papers are released along with their implementations, and sometimes even with pretrained models. Check out https://paperswithcode.com/ to easily find them.

@ operator was added in Python 3.5, for matrix multiplication: it is equivalent to calling the tf.matmul()

Boost AI with unknown data

https://datasciencefestival.com/session/boost-your-ai-projects-with-unknown-data-and-unknown-questions/

Decentralize data access: data mesh

AI strategy depends on data strategy

Operations (running the business) vs analytical (optimizing the business)

Gen AI - generating data for ops of business and analyse by Analytics (reverse of previous where analytics use to drive future of business.

90s warehouse used to be engine, collection of files, DP used to manage data, onboard requests.

in files: s3, parquet

Datamesh: indivual pieces of data - model - eg list of customers churning. Data and context

overtime ..2010 - hadoop adoption - cheap storage - unstructured. Move away from modelling. Schema-less on write. Only have schema on read/runtime.

starburst abstraction: presto, scalable sql engine, corelations, hides complexity

Classic DS: Deterministic, logical, accurate. Cluster/Logistic/Regression. Objective decision.

GenAI: Probabilistic, creative, may not be factual, prone to hallucinations. Subjective decision.

Data Discovery > Model Development > Model Deployment

Data mgmt , has query response time

kurtzace / diary-2024

AI Study Continuation #7