Open kurtzace opened 5 months ago
large margin classifier, best to be used after scaling
soft margin classifier - instances in middle of the street
reducing C makes the street larger
make_pipeline(StandardScaler(),LinearSVC(C=1, random_state=42)).fit(X, y). # svm_clf.predict(X_new). svm_clf.decision_function(X_new)
for non lin
make_pipeline(PolynomialFeatures(degree=3), StandardScaler(), LinearSVC(C=10, max_iter=10_000, random_state=42))
above for small datasets
other one is kernel trick for larger datasets SVC(kernel="poly", degree=3, coef0=1, C=5) ## coef0 controls how much the model is influenced by high-degree terms versus low-degree terms.
or Similarity Features (small or medium-sized nonlinear training sets) SVC(kernel="rbf", gamma=5, C=0.001)
small gamma value makes the bell-shaped curve wider. If it is underfitting, you should increase γ
String kernels for text, DNA
ϵ (called tol ) for high precision needs. affects margin size in SVM
LinearSVR class to perform linear SVM regression
TODO: Under the Hood of Linear SVM Classifiers and understand Dual Problem.
Using a QP solver is one way to train an SVM. Another is to use gradient descent to minimize the hinge loss or the squared hinge loss
Mercer’s theorem. to map a and b to another space.
white box model, fast prediction log m
likely to overfit, so control max_features, max_depth, min_samplessplit,...Increasing min hyperparameters or reducing max_ hyperparameters will regularize the model.
from sklearn.tree import export_graphviz
to export to dot file
to display in notebook
from graphviz import Source Source.from_file("iris_tree.dot")
gini=0 for pure node in decision tree
gini impurity alternate is entropy
CART algorithm - binary tree. ID3 is more than 2 children
tree_clf.predict_proba ## for classification
CART cost
DecisionTreeRegressor and other is DecisionTreeClassifier
χ2 test (chi-squared test), are used to estimate the probability that the improvement is purely the result of chance
CART cost function for regression
Sensitivity to training set rotation fixed by PCA
High Variance unless you set the random_state hyperparameter
estimator.append(('LR', LogisticRegression(solver ='lbfgs', multi_class ='multinomial',max_iter = 200)))
estimator.append(('SVC', SVC(gamma ='auto', probability = True)))
estimator.append(('DTC', DecisionTreeClassifier()))
# Voting Classifier with hard voting
vot_hard = VotingClassifier(estimators = estimator, voting ='hard')
vot_hard.fit(X_train, y_train)
for name, clf in vot_hard.named_estimators_.items():
... print(name, "=", clf.score(X_test, y_test))
soft voting. It often achieves higher performance than hard voting because it gives more weight to highly confident votes.
voting_clf.voting = "soft"
voting_clf.named_estimators["svc"].probability = True
same algo: When sampling is performed with replacement,1 this method is called bagging (short for bootstrap aggregating). When sampling is performed without replacement, it is called pasting. better models with bagging
max_features="sqrt" in bagging classfier for limiting featues
RandomForestClassifier(n_estimators=500, max_leaf_nodes=16
for Extra-Trees splitter="random" when creating a DecisionTreeClassi..
rnd_clf.featureimportances
Adaboost The algorithm then increases the relative weight of misclassified training instances. by boosting weights of misclassified
Predictor weight
then for all misclassfied exp(alpha) is multiplied to old weight
multiclass version of AdaBoost called [SAMME] (which stands for Stagewise Additive Modeling using a Multiclass Exponential loss function).
AdaBoostClassifier( DecisionTreeClassifier(max_depth=1), n_estimators=30, learning_rate=0.5, random_state=42)
Scikit-Learn’s GradientBoostingRegressor
if you set the n_iter_no_change hyperparameter to an integer value, say 10, then the GradientBoostingRegressor will automatically stop adding more trees during training if it sees that the last 10 trees didn’t help.
HGB: works by binning the input features, replacing them with integers. The number of bins is controlled by the max_bins hyperparameter,
ex
make_pipeline(
make_column_transformer((OrdinalEncoder(), ["ocean_proximity"]), remainder="passthrough"),
HistGradientBoostingRegressor(categorical_features=[0], random_state=42)
)
hgb_reg.fit(housing, housing_labels)
Stacking: replace voter with model, blender, or a meta learner
Eg: merge pixels, remove white border
manifold hypothesis, which holds that most real-world high-dimensional datasets lie close to a much lower-dimensional
X = [...] # create a small 3D dataset
X_centered = X - X.mean(axis=0)
U, s, Vt = np.linalg.svd(X_centered)
c1 = Vt[0] #c2 = Vt[1]
projects the training set onto the plane defined by the first two principal components:
W2 = Vt[:2].T # X2D = X_centered @ W2
or
pca = PCA(n_components=2) ## X2D = pca.fit_transform(X) ## pca.explained_variance_ratio_
from sklearn.datasets import make_blobs
X, y = make_blobs([...])
kmeans = KMeans(n_clusters=k, random_state=42)
y_pred = kmeans.fit_predict(X)
kmeans.cluster_centers_
soft clustering. The score can be the distance between the instance and the centroid kmeans.transform(X_new).round(2)
model’s inertia, which is the sum of the squared distances between the instances and their closest centroids.
score() -> negative inertia
accelerated by avoiding many unnecessary distance calculations: Elkan
Also MiniBatchKMeans fir large datasets
Inertia vs k makes elbow
silhouette coefficient over all the instances.(b – a) / max(a, b)
k-means does not behave very well when the clusters have varying sizes, different densities, or nonspherical shapes
Image segmentation: Color or semantic or instamce, but prep it by X = image.reshape(-1, 3)
Label propagation is a semi-supervised machine learning algorithm that assigns labels to previously unlabeled data points. Here’s how it works:
Graph Creation: The algorithm starts by creating a graph that connects all examples (rows) in the dataset based on their distance (e.g., Euclidean distance). Nodes in the graph represent the training data, and edges represent similarities between them. Soft Labels: Nodes in the graph have label soft labels or label distributions based on the labels of nearby examples. These soft labels are propagated along the edges to connected nodes. Iterative Process: The process is repeated for a fixed number of iterations, gradually strengthening the labels assigned to unlabeled examples
Scikit-Learn also offers two classes that can propagate labels automatically: LabelSpreading and LabelPropagation in the sklearn.semi_supervised package.
DBSCAN has a fit_predict() method (no predict)
Agglomerative clustering A hierarchy of clusters is built from the bottom up. Bubbles
The balanced iterative reducing and clustering using hierarchies (BIRCH) algorithm was designed specifically for very large datasets,
Affinity propagation and Spectral clustering only for small datasets
from sklearn.mixture import GaussianMixture gm = GaussianMixture(n_components=3, n_init=10) gm.fit(X)
uses expectation-maximization (EM) algorithm
Gaussian Mixtures for Anomaly Detection
densities = gm.score_samples(X) density_threshold = np.percentile(densities, 2) anomalies = X[densities < density_threshold]
find the model that minimizes a theoretical information criterion, such as the Bayesian information criterion (BIC) or the Akaike information criterion (AIC)
word “probability” is used to describe how plausible a future outcome x is (knowing the parameter values θ), while the word “likelihood” is used to describe how plausible a particular set of parameter values θ are, after the outcome x is known.
likelihood function is a fundamental concept in statistics and probability theory. Let's break it down:
Definition:
Mathematical Formulation:
Maximum Likelihood Estimation (MLE):
Example:
Log-Likelihood:
In summary, the likelihood function helps us find the best-fitting model parameters based on observed data. It's a crucial tool in statistical inference!
Fast-MCD (minimum covariance determinant) Implemented by the EllipticEnvelope class, this algorithm is useful for outlier detection, in particular to clean up a dataset. It assumes that the normal instances (inliers) are generated from a single Gaussian distribution (not a mixture).
Warren McCulloch and the mathematician Walter Pitts. In their landmark paper2 “A Logical Calculus of Ideas Immanent in Nervous Activity”
σ(z) = 1 / (1 + exp(–z))
, also called the sigmoid function.
The hyperbolic tangent function: tanh(z) = 2σ(2z) – 1
- output value ranges from –1 to 1 (instead of 0 to 1 in the case of the sigmoid function).
ReLU(z) = max(0, z)
softplus activation function, which is a smooth variant of ReLU: softplus(z) = log(1 + exp(z)). Softplus is close to 0 when z is negative, and close to z when z is positive.
mlp_reg = MLPRegressor(hidden_layer_sizes=[50, 50, 50], random_state=42)
#Scikit-Learn has an MLPClassifier class in the sklearn.neural_network package. It is almost identical to the MLPRegressor
Keras
model = tf.keras.Sequential() model.add(tf.keras.layers.Input(shape=[28, 28])) model.add(tf.keras.layers.Flatten()) #if it receives a batch of shape [32, 28, 28], it will reshape it to [32, 784]. In other words, if it receives input data X, it computes X.reshape(-1, 784) model.add(tf.keras.layers.Dense(250, activation="relu")) model.add(tf.keras.layers.Dense(200, activation="relu")) model.add(tf.keras.layers.Dense(10, activation="softmax")) model.summary() hidden1 = model.layers[1] hidden1.get_weights() #weights, biases model.compile(loss="sparse_categorical_crossentropy", optimizer="sgd", metrics=["accuracy"]) #for binary we use sigmoid, more fine tuned optimizer=tf.keras. optimizers.SGD(learning_rate=....), or loss="categorical_crossentropy" history = model.fit(X_train, y_train, epochs=25,validation_data=(X_valid, y_valid)) #play with validation_split=0.1, underrepresented, it would be useful to set the class_weight
History object:
model.predict(X_new)
gives a lot of probabilities , try y_proba.argmax(axis=-1)
Sequential API
tf.random.set_seed(42) norm_layer = tf.keras.layers.Normalization(input_shape=X_train.shape[1:]) model = tf.keras.Sequential([ norm_layer, tf.keras.layers.Dense(60, activation="relu"), tf.keras.layers.Dense(60, activation="relu"), tf.keras.layers.Dense(60, activation="relu"), tf.keras.layers.Dense(1) ]) optimizer = tf.keras.optimizers.Adam(learning_rate=1e-4) model.compile(loss="mse", optimizer=optimizer, metrics=["RootMeanSquaredError"]) norm_layer.adapt(X_train) history = model.fit(X_train, y_train, epochs=20, validation_data=(X_valid, y_valid)) mse_test, rmse_test = model.evaluate(X_test, y_test) X_new = X_test[:3] y_pred = model.predict(X_new)
Wide & Deep neural network: learn both deep patterns (using the deep path) and simple rules (through the short path)
example
# Define input layer
inputs = keras.Input(shape=(784,))
# Deep path
dense1 = layers.Dense(64, activation="relu")(inputs)
dense2 = layers.Dense(64, activation="relu")(dense1)
deep_output = layers.Dense(10)(dense2) # Output for deep patterns
# Short path (simple rules)
short_output = layers.Dense(10)(inputs) # Output for simple rules
# Combine both paths
combined_output = layers.Add()([deep_output, short_output])
# Create the model
model = keras.Model(inputs=inputs, outputs=combined_output, name="custom_model")
or
concat_layer = tf.keras.layers.Concatenate()
followed by concat = concat_layer([normalized, hidden2])
Additional example
input_wide = tf.keras.layers.Input(shape=[5]) # features 0 to 4
input_deep = tf.keras.layers.Input(shape=[6]) # features 2 to 7
norm_layer_wide = tf.keras.layers.Normalization()
norm_layer_deep = tf.keras.layers.Normalization()
norm_wide = norm_layer_wide(input_wide)
norm_deep = norm_layer_deep(input_deep)
hidden1 = tf.keras.layers.Dense(30, activation="relu")(norm_deep)
hidden2 = tf.keras.layers.Dense(30, activation="relu")(hidden1)
concat = tf.keras.layers.concatenate([norm_wide, hidden2])
output = tf.keras.layers.Dense(1)(concat)
model = tf.keras.Model(inputs=[input_wide, input_deep], outputs=[output])
imagine if need 2 outputs classif.. and regression
[...] # Same as above, up to the main output layer
output = tf.keras.layers.Dense(1)(concat)
aux_output = tf.keras.layers.Dense(1)(hidden2)
model = tf.keras.Model(inputs=[input_wide, input_deep],
outputs=[output, aux_output])
optimizer = tf.keras.optimizers.Adam(learning_rate=1e-3)
model.compile(loss=("mse", "mse"), loss_weights=(0.9, 0.1), optimizer=optimizer,
metrics=["RootMeanSquaredError"]) #or loss {"output": "mse", "aux_output": "mse"}
subclass
class WideAndDeepModel(tf.keras.Model): def __init__(self, units=30, activation="relu", **kwargs):
model cannot be cloned using tf.keras.models.clone_model(); and when you call the summary() method, you only get a list of layers, without any information on how they are connected to each other.
model.save("my_keras_model", save_format="tf") # save_format="h5"
Callbacks
checkpoint_cb = tf.keras.callbacks.ModelCheckpoint("my_checkpoints", save_weights_only=True) history = model.fit([...], callbacks=[checkpoint_cb]) # early_stopping_cb = tf.keras.callbacks.EarlyStopping(patience=10, restore_best_weights=True)
pip install -q -U tensorboard-plugin-profile
Activation function General ReLU-specific optimizations. However, Swish is probably a better default for more complex tasks, and you can even try parametrized Swish with a learnable β parameter for the most complex tasks. Mish may give you slightly better results, but it requires a bit more compute. If you care a lot about runtime latency, then you may prefer leaky ReLU, or parametrized leaky ReLU for more complex tasks.
Batch normalization
four parameter vectors are learned in each batch-normalized layer: γ (the output scale vector) and β (the output offset vector) are learned through regular backpropagation, and μ (the final input mean vector) and σ (the final input standard deviation vector) are estimated using an exponential moving average.
other regularization techniques (such as dropout
If we define W′ = γ⊗W / σ and b′ = γ ⊗ (b – μ) / σ + β
, the equation simplifies to XW′ + b′.
So, if we replace the previous layer’s weights and biases (W and b) with the updated weights and biases (W′ and b′)
each BN layer adds four parameters per input: γ, β, μ, and σ
The last two parameters, μ and σ, are the moving averages; they are not affected by backpropagation, so Keras calls them “non-trainable
model = tf.keras.Sequential([ tf.keras.layers.Flatten(input_shape=[28, 28]), tf.keras.layers.Dense(300, kernel_initializer="he_normal", use_bias=False), tf.keras.layers.BatchNormalization(), tf.keras.layers.Activation("relu"), ....
momentum
1; for example, 0.9, 0.99, or 0.999. You want more 9s for larger datasets
hyperparameter is axis: it determines which axis should be normalized. It defaults to –1, meaning that by default it will normalize the last axis
never exceed some threshold. This is called gradient clipping
gradient clipping does not change the direction of the gradient vector, you should clip by norm by setting clipnorm instead of clipvalue
Transfer learning: Try freezing all the reused layers first (i.e., make their weights non-trainable so that gradient descent won’t modify
try dropping the top hidden layer(s) and freezing all the remaining hidden layers again. You can iterate until you find the right number of layers to reuse.
for layer in model_B_on_A.layers[:-1]: layer.trainable = False
After unfreezing the reused layers, it is usually a good idea to reduce the learning rate, once again to avoid damaging the reused weights.
unsupervised model, such as an autoencoder or a generative adversarial network (GAN)
Old technique: unsupervised pretraining—typically with restricted Boltzmann machines (RBMs; see the notebook at https://homl.info/extra-anns)—was the norm for deep nets
greedy layer wise pretraining in DNN
Momentum θ ← θ – η∇θJ(θ). is ignorant of prev grad so momentum
optimizer = tf.keras.optimizers.SGD(learning_rate=0.001, momentum=0.9)
Nesterov momentum optimization, measures the gradient of the cost function not at the local position θ but slightly ahead in the direction of the momentum, at θ + βm
optimizer = tf.keras.optimizers.SGD(learning_rate=0.001, momentum=0.9, nesterov=True)
Adagrad
you should not use it to train deep neural networks
The RMSProp algorithm fixes this by accumulating only the gradients from the most recent iterations, as opposed to all the gradients since the beginning of training. It does so by using exponential decay in the first step.
optimizer = tf.keras.optimizers.RMSprop(learning_rate=0.001, rho=0.9)
Adam which stands for adaptive moment estimation, combines the ideas of momentum optimization and RMSProp: just like momentum optimization, it keeps track of an exponentially decaying average of past gradients; and just like RMSProp, it keeps track of an exponentially decaying average of past squared gradients.
The momentum decay hyperparameter β1 is typically initialized to 0.9, while the scaling decay hyperparameter β2 is often initialized to 0.999. As earlier, the smoothing term ε is usually initialized to a tiny number such as 10–7
optimizer = tf.keras.optimizers.Adam(learning_rate=0.001, beta_1=0.9, beta_2=0.999)
Nadam optimization is Adam optimization plus the Nesterov trick, so it will often converge slightly faster than Adam.
AdamWis a variant of Adam that integrates a regularization technique called weight decay
Implementing power scheduling in Keras is the easiest option—just set the decay hyperparameter when creating an optimizer:
optimizer = tf.keras.optimizers.SGD(learning_rate=0.01, decay=1e-4)
def exponential_decay_fn(epoch): return 0.01 * 0.1 ** (epoch / 20)
lr_scheduler = tf.keras.callbacks.LearningRateScheduler(exponential_decay_fn) history = model.fit(X_train, y_train, [...], callbacks=[lr_scheduler])
ReduceLROnPlateau callback. For example, if you pass the following callback to the fit() method, it will multiply the learning rate by 0.5 whenever the best validation loss does not improve for five consecutive epochs
both ℓ1 and ℓ2 regularization, use tf.keras.regularizers.l1_l2()
Use Python’s functools.partial() function, which lets you create a thin wrapper for any callable, with some default argument values:
2^N possible networks (where N is the total number of droppable neurons).
More generally, we need to divide the connection weights by the keep probability (1 – p) during training.
If you observe that the model is overfitting, you can increase the dropout rate. Conversely, you should try decreasing the dropout rate if the model underfits the training set.
tf.keras.Sequential([ tf.keras.layers.Flatten(input_shape=[28, 28]), tf.keras.layers.Dropout(rate=0.2), tf.keras.layers.Dense(100, activation="relu", kernel_initializer="he_normal"), tf.keras.layers.Dropout(rate=0.2), tf.keras.layers.Dense(100, activation="relu", kernel_initializer="he_normal"), tf.keras.layers.Dropout(rate=0.2), tf.keras.layers.Dense(10, activation="softmax")
monte carlo MC dropout, which can boost the performance of any trained dropout model without having to retrain it or even modify it at all.
It’s also useful to know exactly which other classes are most likely.
Max norm reg
computing ∥ w ∥2 after each training step and rescaling w if needed (w ← w r / ∥ w ∥2).
dense = tf.keras.layers.Dense( 100, activation="relu", kernel_initializer="he_normal", kernel_constraint=tf.keras.constraints.max_norm(1.))
table at end of Training Deep Neural Networks is useful image
If you need a sparse model, you can use ℓ1 regularization (and optionally zero out the tiny weights after training).
ML papers are released along with their implementations, and sometimes even with pretrained models. Check out https://paperswithcode.com/ to easily find them.
@ operator was added in Python 3.5, for matrix multiplication: it is equivalent to calling the tf.matmul()
Decentralize data access: data mesh
AI strategy depends on data strategy
Operations (running the business) vs analytical (optimizing the business)
Gen AI - generating data for ops of business and analyse by Analytics (reverse of previous where analytics use to drive future of business.
90s warehouse used to be engine, collection of files, DP used to manage data, onboard requests.
in files: s3, parquet
Datamesh: indivual pieces of data - model - eg list of customers churning. Data and context
overtime ..2010 - hadoop adoption - cheap storage - unstructured. Move away from modelling. Schema-less on write. Only have schema on read/runtime.
starburst abstraction: presto, scalable sql engine, corelations, hides complexity
Classic DS: Deterministic, logical, accurate. Cluster/Logistic/Regression. Objective decision.
GenAI: Probabilistic, creative, may not be factual, prone to hallucinations. Subjective decision.
Data Discovery > Model Development > Model Deployment
Data mgmt , has query response time
Prelude / Prior Work
free courses
Hands on Machine learning Aurélien Géron notes
notebooks are here
Examples
Detecting credit card fraud This is anomaly detection, which can be tackled using isolation forests, Gaussian mixture models or autoencoders
common unsupervised task is association rule learning, in which the goal is to dig into large amounts of data and discover interesting relations between attributes.
Performance
Cross val
Confusion matrix (confused one with another)
Down is actual (neg then positive) Right is pred (neg then postive)
R recall (bottom right to left) = TP/(TP+FN)
P precision (bottom right to top) = TP/(TP+FP)
from sklearn.metrics import precision_score, recall_score # precision_score(y_train_5, y_train_pred)
F1 = 2 PR/(P+R)
from sklearn.metrics import f1_score
shoplift detect - high recall
video is safe for kids - high precision
PR curve
from sklearn.metrics import precision_recall_curve # precisions, recalls, thresholds = precision_recall_curve(y_train_5, y_scores)
predict more than threshold
sgd_clf.decision_function([some_digit]) > threshold
to return bool classROC curve plots the true positive rate (another name for recall) against the false positive rate (FPR) or fallout.
ROC curve plots sensitivity (recall) versus 1 – specificity.
good classfier should be far top left
Classifiers
from sklearn.ensemble import RandomForestClassifier # forest_clf = RandomForestClassifier(random_state=42) # y_probas_forest = cross_val_predict(forest_clf, X_train, y_train_5, cv=3, method="predict_proba")
above lacks decision_function when compared to SGDClassifiersgd and svc are binary classifiers (but can be overcome with one-versus-the-rest (OvR) strategy, or sometimes one-versus-all (OvA) or OvO (N × (N – 1) / 2 classifiers).) . LogisticReg, RandomFOrestClass.., GaussianNB are muti class
from sklearn.multiclass import OneVsRestClassifier # ovr_clf = OneVsRestClassifier(SVC(random_state=42))
one way to improve accu
X_train_scaled = StandardScaler().fit_transform(X_train.astype("float64"))
confusion matrix
to make the errors stand out more, you can try putting zero weight on the correct predictions.
# sample_weight = (y_train_pred != y_train)
could engineer new features that would help the classifier
data augmentation to add more rotated/skwed images
Multilabel Classification
from sklearn.neighbors import KNeighborsClassifier ## knn_clf.fit(X_train, np.c_[y_train_large, y_train_odd])
models can be organized in a chain: when a model makes a prediction, it uses the input features plus all the predictions of the models that come before it in the chain.
from sklearn.multioutput import ClassifierChain # chain_clf = ClassifierChain(SVC(), cv=3, random_state=42)
Multioutput : On the left is the noisy input image, and on the right is the clean target image. example use KNeighborsClassifier()
math
see norm normalized vector - v head , is the unit vector that points in the same direction as v
dot product # np.dot(u, v)
The projection of vector v onto u's axis
matrix mul ,
Python 3.5 has the @ infix operator for matrix multiplication, and NumPy 1.10 added support for it. A @ D is equivalent to np.matmul(A, D):
derivative cheatsheet
Grad descent
elongated curves when no feature scaling Optimal learning rate (neither too small(many iterations) not large (shuttle between valleys))
Partial derivatives of the cost function , vector
Gradient descent step
simulated annealing - gradually reduce learning rate
example epoch in lin regression
train_sizes, train_scores, valid_scores = learning_curve( LinearRegression(), X, y, train_sizes=np.linspace(0.01, 1.0, 40), cv=5, scoring="neg_root_mean_squared_error") train_errors = -train_scores.mean(axis=1) valid_errors = -valid_scores.mean(axis=1)
plt.plot(train_sizes, train_errors, "r-+", linewidth=2, label="train") plt.plot(train_sizes, valid_errors, "b-", linewidth=3, label="valid") ... plt.show()
from copy import deepcopy from sklearn.metrics import mean_squared_error from sklearn.preprocessing import StandardScaler
X_train, y_train, X_valid, y_valid = [...] # split the quadratic dataset
preprocessing = make_pipeline(PolynomialFeatures(degree=90, include_bias=False), StandardScaler()) X_train_prep = preprocessing.fit_transform(X_train) X_valid_prep = preprocessing.transform(X_valid) sgd_reg = SGDRegressor(penalty=None, eta0=0.002, random_state=42) n_epochs = 500 best_valid_rmse = float('inf')
for epoch in range(n_epochs): sgd_reg.partial_fit(X_train_prep, y_train) y_valid_predict = sgd_reg.predict(X_valid_prep) val_error = mean_squared_error(y_valid, y_valid_predict, squared=False) if val_error < best_valid_rmse: best_valid_rmse = val_error best_model = deepcopy(sgd_reg)