solomonxie commented 6 years ago

By having a wild thought on start a career on Artificial intelligence, started today to do a tiny research on how to study AI. It turns out it consist of a shit black hole of knowledges, which means I have to study all of those things, not mention even some basic terms are causing a headache to me, such like Machine Learning, Deep Learning, Data Mining, Big Data, Classification etc etc. Thus I tend to write down some notes and random thoughts here helping me to organise it better. Saying, a tiny pen is better than a great brain.

Prerequisites

Math

[x] Linear Algebra (Finished Basics) PCA, SVD, Eigen Decomposition, LU Decomposition, QR Decomposition, Norms, Projection, Symmetric Matrix, Orthogonal Matrix, Matrix Operations, Vector Space,
[ ] Differential Calculus
[ ] Integral Calculus
[ ] Multivariable Calculus 微分和积分、偏微分、向量值函数、方向梯度、海森、雅可比、拉普拉斯、拉格朗日分布
[ ] Statistics & Probability 组合、概率规则和公理、贝叶斯定理、随机变量、方差和期望、条件和联合分布、标准分布（伯努利、二项式、多项式、均匀和高斯）、矩母函数（Moment Generating Functions）、最大似然估计（MLE）、先验和后验、最大后验估计（MAP）和抽样方法。

Computer Science

[ ] Data Structures
[ ] Algorithm & Optimization 这对理解我们的机器学习算法的计算效率和可扩展性以及利用我们的数据集中稀疏性很重要。需要的知识有数据结构（二叉树、散列、堆、栈等）、动态规划、随机和子线性算法、图论、梯度/随机下降和原始对偶方法。

Programming (Python)

[x] Jupyter Notebook
[ ] matplotlib
[ ] Numpy
[ ] scikit-learn
[ ] pandas

solomonxie commented 6 years ago

Differ all the basic terms for starters

Artificial intelligence, is a superior area of all those below. But an AI could be human programmed to say something or recognise something, like coding line by line.
Machine Learning, could be so much better than human coded apps to do something. It lets machine itself to learn things and recognise things. So AI could whether has ML or not.
Deep Learning, a deeper science based logic for Machine Learning. Seems have something to do with neural something.
Data Mining, machine learning is based of a shit load of data. Like a human learning something needs to encounter many things until he gets it. So Data mining is organising wild informations and provides to Machine learning.
Big Data, means a load of informations, and the skill of Big data means how to handle with the data and how we can get some decisions from it.

solomonxie commented 6 years ago

Seems Kaggle and Ali Tianchi is a very good start for a beginner like me, and also it will be a good profile tag for a career starter who's without a solid CS background.

solomonxie commented 6 years ago

「Data mining」 or 「web crawling」 is not necessary for Machine Learning.

For instance web crawler is only getting EXTERNAL data. But lots of company themselves have already had big data set, like clients' information, patiences' informatin. Only thing they want you to do is how to analyse the data.

solomonxie commented 6 years ago

TL;DR. Archive Link: Before starting Machine Learning

solomonxie commented 6 years ago

Examples of some categories of ML

Unsupervised Learning examples

Only received data without any label or instruction, and you ask it to give you the answer.

Customer. You give a whole book of customer contact list to it, and ask it to tell you what is each customer's salary?
Language. So you're in a totally strange country and know nothing to their language, and there's no dictionary or any book as reference to your own language, then how do you understand them? And that is what Unsupervised Learning is gonna do to solve.

Reinforcement Learning examples

It will receive only SOME info, not by you but by itself, with some practice.

A toddler touch hot cup. Reinforcement learning is similar to Child Learning. A kid has curiosity to a hot cup and once he touches it he feels pain. So next time he remembers if he sees some white smoke come from the cup, he doesn't touch it.
Playing game. You let the computer to play a game without telling it what's the move, and it starts to try every possible move, practise and practise, it remembers every 'fault move' it made and avoid it happens next time.

Supervised Learning examples

You feed it info with labels, right or wrong, then give a new data let it decide it's right or wrong.

Spam and Ham emails. You give ton's of emails with classification label on each of it, let it learn , and let it predict a new email is a spam or ham.

solomonxie commented 6 years ago

ML 机器学习Python基础库安装

强烈建议在Virtualenv虚拟环境下安装所有的包，这样既不会和什么Anaconda冲突，又不会发生装不上、装错了、误删除等乌七八糟的事情。

建议进入Python3的虚拟环境：

$ source ~/VIRTUALENV-PATH/venv3/bin/activate

必备库安装

Numpy安装:

$ pip install numpy

# 或使用国内源
$ pip install numpy -i http://pypi.douban.com/simple --trusted-host pypi.douban.com

Pandas安装:

$ pip install pandas

# 或使用国内源
$ pip install pandas -i http://pypi.douban.com/simple --trusted-host pypi.douban.com

Scikit-Learn安装：

$ pip install scikit-learn

# 或使用国内源
$ pip install scikit-learn -i http://pypi.douban.com/simple --trusted-host pypi.douban.com

matplotlib安装:

$ pip install matplotlib

# 或使用国内源
$ pip install matplotlib -i http://pypi.douban.com/simple --trusted-host pypi.douban.com

scipy和seaborn安装：

$ pip install seaborn scipy

# 或使用国内源
$ pip install seaborn scipy  -i http://pypi.douban.com/simple --trusted-host pypi.douban.com

solomonxie commented 5 years ago

Hardwares for Deep Learning

Commands

Check OS: cat /etc/os-release
Check Storage: df -h
Check CPU: cat /proc/cpuinfo
Check RAM: cat /proc/meminfo

Check GPU:

$ sudo apt-get install nvidia-smi
$ nvidia-smi -stats

Google Colab (Jupyter Notebook)

Link: https://colab.research.google.com/notebooks/welcome.ipynb

OS: Ubuntu 17.10 Artful Aardvark
GPU: (12hrs) NVIDIA Tesla K80 GPU × 1
CPU: Intel(R) Xeon(R) CPU @ 2.30GHz × 1 Core
RAM: 13GB
Disk: 40GB

Kaggle Kernel (Jupyter Notebook)

Refer to Kaggle: How to use Kaggle - Kenels Link: https://www.kaggle.com/kernels

OS: Debian GNU/Linux 8 (jessie)
GPU: (Queue needed) NVidia Tesla K80 GPU × 1
CPU: Intel(R) Xeon(R) CPU @ 2.30GHz × 2 Cores
RAM: 25.5GB
Disk: 5.2GB

Azure Notebooks (Jupyter Notebook)

Link: https://notebooks.azure.com/Microsoft/libraries/samples

OS: Ubuntu 16.04.5 LTS
GPU: N/A
CPU: Intel(R) Xeon(R) CPU E5-2673 v4 @ 2.30GHz × 2 Cores
RAM: 4GB
Disk: 20GB

AWS p2

Refer to AWS: Amazon EC2 P2 Instances

GPU: NVIDIA Tesla K80 GPU × (1 to 16)
CPU: Intel Xeon® E5-2686 v4 × (4 to 64)
RAM: 61G or 488G or 732G

P2 instances provide up to 16 NVIDIA K80 GPUs, 64 vCPUs and 732 GiB of host memory, with a combined 192 GB of GPU memory, 40 thousand parallel processing cores, 70 teraflops of single precision floating point performance, and over 23 teraflops of double precision floating point performance. P2 instances also offer GPUDirect™ (peer-to-peer GPU communication) capabilities for up to 16 GPUs, so that multiple GPUs can work together within a single host.

DIY

GPU

NVIDIA Tesla K80 GPU: RMB 30,000, VRAM 24GB (GDDR5), 256 Bit

CPU

Intel Xeon® E5-2686 v4: RMB 12,000, 8 Cores, 1.7GHz

solomonxie commented 5 years ago

Build Simple Model for ML (Decision Tree)

Steps of building a model

Refer to Kaggle: Your First Machine Learning Model

The steps to building and using a model are:

Define: What type of model will it be? A decision tree? Some other type of model? Some other parameters of the model type are specified too.
Fit: Capture patterns from provided data. This is the heart of modeling.
Predict: Just what it sounds like
Evaluate: Determine how accurate the model's predictions are.

Underfitting and Overfitting

Refer to Kaggle: Underfitting and Overfitting

Overfitting: The model matches the training data almost perfectly, but does poorly in predicting new data.
Underfitting: The model fails to capture important patterns & distinctions in the data, so it performs poorly even in training data.

As the Decision Tree Model's depth goes deeper, the Underfitting goes lesser, but there is "turn" at which it starts Overfitting and the error goes larger.

For finding out the "turning point", we need to test out some depths , namely the max_leaf_nodes. At which the error start to turn descending to ascending, we will choose that depth as the best depth for training data in Decision Tree Model.

solomonxie commented 5 years ago

Random Forest Model

A "Forest" has a lot of "Trees".

The Random Forest model randomly creates a lot of Decision Trees, and average all predictions from each Decision Tree to get a better result.

solomonxie commented 5 years ago

Handle Missing Data

Most libraries (including scikit-learn) will give you an error if you try to build a model using data with missing values.

Refer to Kaggle: Handling Missing Values

Solution 1: Drop Columns with Missing Values

In many cases, you'll have both a training dataset and a test dataset. You will want to drop the same columns in both DataFrames.

So, it's somewhat usually not the best solution. However, it can be useful when most values in a column are missing.

Solution 2: Imputation

Imputation fills in the missing value with some number. Imputation is the standard approach, and it usually works well.

solomonxie commented 5 years ago

XGBoost

Refer to Kaggle: XGBoost

solomonxie commented 5 years ago

Partial Dependence Plots

Extract insights from your models. Insights many didn't even realize were possible.

Refer to Kaggle: Partial Dependence Plots

Partial dependence plots are a great way (though not the only way) to extract insights from complex models. These can be incredibly powerful for communicating those insights to colleagues or non-technical users.

solomonxie commented 5 years ago

Cross-Validation

Refer to Kaggle: Cross-Validation

On small datasets, the extra computational burden of running cross-validation isn't a big deal. These are also the problems where model quality scores would be least reliable with train-test split. So, if your dataset is smaller, you should run cross-validation.

solomonxie commented 5 years ago

Notes on Coursera ML Andrew Ng

In general, any machine learning problem can be assigned to one of two broad classifications: Supervised learning and Unsupervised learning.

Tom Mitchell provides a more modern definition: "A computer program is said to learn from experience E with respect to some class of tasks T and performance measure P, if its performance at tasks in T, as measured by P, improves with experience E."

Supervised learning: "Right answers" given.

It can be a Regression problem, which predict continuous valued output,
also can be a Classification problem, which predict discrete valued output.

Unsupervised learning:

is to find out some structure in a dataset, and find out clusters is a big part of the work. With unsupervised learning there is NO feedback based on the prediction results.

There's also Non-clustering problem for unsupervised learning

like the "cocktail party algorithm".

Octave

Octave is much more faster to implement a prototype than other languages. We can first use Octave to test our ideas, models, and transfer it into other languages when it's success.

Linear regression model

Cost function

Contour Plot

Gradient Descent intuition

Gradient descent algorithm

The gradient descent algorithm is: repeat until convergence: where j=0,1 represents the feature index number.

simultaneously update the parameters θ1, θ2...

"Batch" Gradient descent

"Batch": Each step of gradient descent computes ALL the training data.

solomonxie commented 5 years ago

Data Scientist

Data Science from Scratch First Principles with Python by Joel Grus

Data science lies at the intersection of:

Hacking skills
Math and statistics knowledge
Substantive expertise

solomonxie commented 5 years ago

One-hot Encoding

Refer to Quora: What is one-hot encoding and when is it used in data science? Refer to youtube: A demo of One Hot Encoding (TensorFlow Tip of the Week)

"Encoding" is to take a number to represent a categorical value.

Lable encoding:

But the problem is, those values would rather be nominal values instead of ordinal values, because we can't say one is greater or smaller than another. And there it comes One-hot encoding, which we only take numbers 1 & 0 to represent the categorical value, but in a TABLE:

Note that, the the result of encoding will be as a VECTOR.

In this case, the value for sample-1 is encoded to a vector [1,0,0,0], and value of sample-5 is encoded to vector [0,0,0,1].

solomonxie commented 5 years ago

R语言入门

solomonxie commented 5 years ago

Book: Intro to Statistical Learning (ISL)

Model selection: 「Prediction」 vs. 「Inference」

If our goal is to understand the data, namely the inference, then it's better to choose the model with more interpretability.
if our goal is to predict, then it's good to go with models that have more flexibility but less understandable, sometime it may refer to "non-linear" models.

Model selection: 「Flexibility」 vs. 「Interpretability」

snip20181018_5

「Clustering」

Cluster Analysis is NOT to compare input and output(x & y), but to compare two variables (x1 & x2), because in the case there's no clear associate for x & y, we have left no choice but to analyze variables themselves.

Model selection: 「Quantitative」 vs. 「Qualitative」

Numerical: Linear Regression
Categorical: Logistic Regression, boosting, KNN

Model selection: 「Train MAE」 vs. 「Test MAE」

In general, we do not really care how well the method works training on the Training data. Rather, we are interested in the accuracy of the predictions that we obtain when we apply our method to previously Unseen Test data.

We’d like to select the model for which the average of this quantity—the test MSE—is as small as possible.

There is no guarantee that the method with the lowest training MSE will also have the lowest test MSE.

Regardless of whether overfitting has occurred, we almost always want the Training MSE to be SMALLER than the Test MSE, because most statistical learning methods either directly or indirectly seek to minimize the training MSE. Overfitting refers specifically to the case in which a less flexible model would have yielded a smaller test MSE.

「Bias-Variance」 Trade-off

The relationship between bias, Variance and Test MSE

Evaluate Classification: 「Training Error Rate」 vs. 「Test error」

Training Error Rate:

Test Error:

「Bayes Classifier」

Bayes decision boundary: The Bayes classifier’s prediction is determined by the Bayes decision boundary; an observation that falls on the orange side of the boundary will be assigned to the orange class, and similarly an observation on the blue side of the boundary will be assigned to the blue class.

Bayes Error Rate: The Bayes classifier produces the lowest possible test error rate, called the Bayes error rate.

Note that: In theory we would always like to predict qualitative responses using the Bayes classifier. But for real data, we do not know the conditional distribution of Y given X, and so computing the Bayes classifier is impossible. Therefore, the Bayes classifier serves as an unattainable gold standard against which to compare other methods.

「K-Nearest Neighbors」 (KNN Classifier)

Many approaches attempt to estimate the conditional distribution of Y given X, and then classify a given observation to the class with highest estimated probability. One such method is the K-nearest neighbors (KNN) classifier.

KNN applies Bayes rule and classifies the test observation x0 to the class with the largest probability.

In the left-hand panel, we have plotted a small training data set consisting of six blue and six orange observations. Our goal is to make a prediction for the point labeled by the black cross. Suppose that we choose K = 3. Then KNN will first identify the three observations that are closest to the cross. This neighborhood is shown as a circle. It consists of two blue points and one orange point, resulting in estimated probabilities of 2/3 for the blue class and 1/3 for the orange class. Hence KNN will predict that the black cross belongs to the blue class. In the right-hand panel of Figure 2.14 we have applied the KNN approach with K = 3 at all of the possible values for X1 and X2, and have drawn in the corresponding KNN decision boundary.

solomonxie commented 5 years ago

Book: ISL chapter 3 Linear Regression

「Simple Linear Regression」 (SLR)

which is to examine how ONE variable effect the result.

Linear Regression: Assume there is linear relationship between x and y. so that's to be: in which the intercept & slope are unknown constants.

For the two constants or coefficients, we need to estimate them from dataset, as close as it can be.

Least squares is the easiest way to describe the closeness

Residual sum of squares (RSS):

「Multiple Linear Regression」(MLR)

Examine how multiple variables make a compact effect on the result.

solomonxie commented 5 years ago

Regret

Refer to wiki: Loss function - Regret

Leonard J. Savage argued that using non-Bayesian methods such as minimax, the loss function should be based on the idea of regret.

`0-1 loss function`

solomonxie commented 5 years ago

`Covariance` & `Correlation`

Refer to wiki: Covariance Refer to wiki: Correlation and dependence Refer to wiki: Covariance and correlation

solomonxie commented 5 years ago

Bias

Simple Linear Regression:

y = β0 + β1 * x

β0 is called the intercept because it determines where the line intercepts the y axis. In machine learning we can call this the bias, because it is added to offset all predictions that we make.

solomonxie commented 5 years ago

Stochastic Gradient Descent

Refer to: Linear Regression Tutorial Using Gradient Descent for Machine Learning

Gradient Descent is the process of minimizing a function by following the gradients of the cost function.

In Machine learning we can use a similar technique called stochastic gradient descent to minimize the error of a model on our training data.

Iteration: The way this works is that each training instance is shown to the model one at a time. The model makes a prediction for a training instance, the error is calculated and the model is updated in order to reduce the error for the next prediction.

This procedure can be used to find the set of coefficients in a model that result in the smallest error for the model on the training data. Each iteration the coefficients, called weights (w) in machine learning language are updated using the equation:

w = w – alpha * delta

Where w is the coefficient or weight being optimized, alpha is a learning rate that you must configure (e.g. 0.1) and gradient is the error for the model on the training data attributed to the weight.

solomonxie commented 5 years ago

Variables

"The statistical perspective frames data in the context of a hypothetical function (f) that the machine learning algorithm is trying to learn. That is, given some input variables (input), what is the predicted output variable (output)."

solomonxie commented 5 years ago

Hardware for Kaggle competition

I was shocked when I saw the response to a question that the minimum PC specs for running kaggle competition:

AWS EC2 p2 instance pricing

DigitalOcean droplet pricing

Buy a PC

solomonxie commented 5 years ago

Statistical Learning (OneNote) [DRAFT]

Basics

Learning Goal

WE FIRST ASSUME THERE EXISTS AN ALMIGHTY FUNCTION Y=f(x) FOR EVERY (x, y)

Model Selection

Interpretability
- Interpretable (Easy to understand)
- Predictable (More precise)
Labeled
- Supervised (With right answer, labeled) -> For prediction problem
  - Quantitative -> Regression Models
  - Qualitative -> Classification Models
- Unsupervised (Without right answer) -> For clustering analysis
  - Neural Network
Numeric
- Regression (Quantitative/Numerical Variables)
  - Linear Regression
    - Simple Linear Regression (One variable)
    - Multiple Linear Regression (Multiple variables)
- Classification (Qualitative/Categorical Variables)
  - Classifier
    - Logistic Regression
  - Bayes Classifier
  - KNN Classifier
Accuracy (Bias-Variance Trade-off)
- Regression settings
  - MSE
  - AVE
- Classification settings
  - Classifier
  - Bayes Classifier
  - KNN Classifier

Statistics

Plots

Distribution

Inferential Statistics

Probability

Hypothesis Test [DRAFT]

Random Variable

Bayesian Theorem [DRAFT]

Linear Regression

Mo: Linear Regression

Linear Least Squares

Linear least squares is ONE WAY to estimate a Linear function by finding the minimum value of squared-residuals.

The logic is:

Since we assume it's a linear function, so the only UNKNOWNS are coefficients β1 & β0
Our mission is to guess the two values which produce the minimum value of summed squared-residuals

Main Formulations of Linear least squares

Ordinary Least Squares (OLS), unweighted ○ Simple linear regression (SLR) ○ Multiple linear regression (MLR)
Weighted Least Squares (WLS)
Generalized Least Squares (GLS) Alternative Formulations
Iteratively reweighted least squares (IRLS)
Instrumental variables regression (IVR)
Total least squares (TLS)

Numerical methods for linear least squares

Inverting the matrix of the normal equations
Orthogonal decomposition methods

Regularized Linear Regression Modified version of Ordinary linear regression, not only minimize cost function, but also reduce complexity of the model.

Lasso Regression (L1 Regularization): minimize Absolute Sum of Coefficients
Ridge Regression (L2 Regularization): minimize Squared Absolute Sum of Coefficients

Gradient Descent

Gradient Descent Methods

Batch Gradient Descent (BGD) -> for small dataset ○ Parameters β1/β0 start from 0 ○ Iterate from 0 and update parameters every time ○ Stop until cost function cost(β1, β0) ≈ 0
Stochastic Gradient Descent (SGD) -> for large dataset ○ Parameters β1/β0 start from a random number ○ Random walk

Standard Procedures of Gradient Descent

Setup initial value of parameters β1 and β0
Form a cost function with β1, β0: cost(β1, β0)
Calculate derivative (slope) of cost function: delta = cost(..)'
Update parameters β1, β0 with improvement: β? = β? - (alpha * delta)
Next iteration with NEW parameters until cost(β1, β0) ≈ 0

Batch Gradient Descent (BGD) Calculating the derivative from all training data before calculating an update.

Asdf
asdfas

Stochastic Gradient Descent (SGD) Calculating the derivative from each training data instance and calculating the update immediately.

Asdfa
asfdsafd

Model Accuracy

Bias-Variance Trade-Off The prediction error for any machine learning algorithm can be broken down into three parts:

Bias Error: error caused by choosing models on interpretability ○ High-Bias models: Linear Regression, Logistic Regression … ○ Low-Bias models: Decision Trees, KNN, SVM
Variance Error: error caused by choosing models on flexibility
Irreducible Error (ε): cannot be reduced regardless of what algorithm is used. ○ High-Variance models: Decision trees ○ Low-Variance models: Linear Regression

Hypothesis Test for ML

Some examples of statistical hypothesis tests and their distributions from which critical values can be calculated are as follows: • Z-Test: Gaussian distribution (Normal Distribution). • Student t-Test: Student’s t-distribution. • Chi-Squared Test: Chi-Squared (𝜲²) distribution. • ANOVA: F-distribution (Fisher–Snedecor distribution).

Features Selection [DRAFT]

Features selection approaches

Stepwise Regression ○ Forward selection ○ Backward selection ○ Bidirectional elmination
LASSO

❶ Stepwise Regression Main approaches:

Forward Selection
Backward Selection
Bidirectional Elimination

❷ LASSO (least absolute shrinkage & selection operator)

Classification

Classification Basics

Encodings of Categories

Code with order/rank
Binary code: yes or no
One-hot Encoding

Why NOT Linear regression? Because the probability must fall between 0 and 1, but Linear regression is not sensible and may lead the result below 0 or above 1. To avoid that, we MUST model p(X) using a function gives output between 0 and 1. Many functions meet this description, logistic function in Logistic Regression is one of them.

Logistic Regression

is to predict the probability of a categorical dependent variable, which is a binary variable

Linear Discriminant Analysis

LDA (linear discriminant analysis) is an alternative to Logistic regression for the following reasons:

More reliable on handling more than 2 response classes
More stable if dataset size n is small
More stable if the classes are well-separated

solomonxie / blog-in-the-issues

Machine Learning diary 机器学习日记 #38

Prerequisites

Math

Computer Science

Programming (Python)

Differ all the basic terms for starters

「Data mining」 or 「web crawling」 is not necessary for Machine Learning.

TL;DR. Archive Link: Before starting Machine Learning

Examples of some categories of ML

Unsupervised Learning examples

Reinforcement Learning examples

Supervised Learning examples

ML 机器学习Python基础库安装

必备库安装

Hardwares for Deep Learning

Commands

Google Colab (Jupyter Notebook)

Kaggle Kernel (Jupyter Notebook)

Azure Notebooks (Jupyter Notebook)

AWS p2

DIY

GPU

CPU

Build Simple Model for ML (Decision Tree)

Steps of building a model

Underfitting and Overfitting

Random Forest Model

Handle Missing Data

Solution 1: Drop Columns with Missing Values

Solution 2: Imputation

XGBoost

Partial Dependence Plots

Cross-Validation

Notes on Coursera ML Andrew Ng

Supervised learning: "Right answers" given.

Unsupervised learning:

There's also Non-clustering problem for unsupervised learning

Octave

Linear regression model

Cost function

Contour Plot

Gradient Descent intuition

Gradient descent algorithm

"Batch" Gradient descent

Data Scientist

One-hot Encoding

R语言入门

Book: Intro to Statistical Learning (ISL)

Model selection: 「Prediction」 vs. 「Inference」

Model selection: 「Flexibility」 vs. 「Interpretability」

「Clustering」

Model selection: 「Quantitative」 vs. 「Qualitative」

Model selection: 「Train MAE」 vs. 「Test MAE」

「Bias-Variance」 Trade-off

Evaluate Classification: 「Training Error Rate」 vs. 「Test error」

「Bayes Classifier」

「K-Nearest Neighbors」 (KNN Classifier)

Book: ISL chapter 3 Linear Regression

「Simple Linear Regression」 (SLR)

「Multiple Linear Regression」(MLR)

Regret

0-1 loss function

Covariance & Correlation

Bias

Stochastic Gradient Descent

Variables

Hardware for Kaggle competition

AWS EC2 p2 instance pricing

DigitalOcean droplet pricing

Buy a PC

Statistical Learning (OneNote) [DRAFT]

Basics

Learning Goal

Model Selection

Statistics

Plots

Distribution

Inferential Statistics

Probability

`0-1 loss function`

`Covariance` & `Correlation`