solomonxie / blog-in-the-issues

A personalised tech-blog, notebook, diary, presentation and introduction.
https://solomonxie.github.io
68 stars 12 forks source link

Machine Learning diary 机器学习日记 #38

Open solomonxie opened 6 years ago

solomonxie commented 6 years ago

By having a wild thought on start a career on Artificial intelligence, started today to do a tiny research on how to study AI. It turns out it consist of a shit black hole of knowledges, which means I have to study all of those things, not mention even some basic terms are causing a headache to me, such like Machine Learning, Deep Learning, Data Mining, Big Data, Classification etc etc. Thus I tend to write down some notes and random thoughts here helping me to organise it better. Saying, a tiny pen is better than a great brain.

Prerequisites

Math

Computer Science

Programming (Python)

image

solomonxie commented 6 years ago

Differ all the basic terms for starters

solomonxie commented 6 years ago

Seems Kaggle and Ali Tianchi is a very good start for a beginner like me, and also it will be a good profile tag for a career starter who's without a solid CS background.

solomonxie commented 6 years ago

「Data mining」 or 「web crawling」 is not necessary for Machine Learning.

For instance web crawler is only getting EXTERNAL data. But lots of company themselves have already had big data set, like clients' information, patiences' informatin. Only thing they want you to do is how to analyse the data.

solomonxie commented 6 years ago

TL;DR. Archive Link: Before starting Machine Learning

solomonxie commented 6 years ago

Examples of some categories of ML

Unsupervised Learning examples

Only received data without any label or instruction, and you ask it to give you the answer.

Reinforcement Learning examples

It will receive only SOME info, not by you but by itself, with some practice.

Supervised Learning examples

You feed it info with labels, right or wrong, then give a new data let it decide it's right or wrong.

solomonxie commented 6 years ago
solomonxie commented 6 years ago
solomonxie commented 6 years ago
solomonxie commented 6 years ago
solomonxie commented 6 years ago
solomonxie commented 6 years ago
solomonxie commented 6 years ago
solomonxie commented 6 years ago
solomonxie commented 6 years ago

ML 机器学习Python基础库安装

强烈建议在Virtualenv虚拟环境下安装所有的包,这样既不会和什么Anaconda冲突,又不会发生装不上、装错了、误删除等乌七八糟的事情。

建议进入Python3的虚拟环境:

$ source ~/VIRTUALENV-PATH/venv3/bin/activate

必备库安装

Numpy安装:

$ pip install numpy

# 或使用国内源
$ pip install numpy -i http://pypi.douban.com/simple --trusted-host pypi.douban.com

Pandas安装:

$ pip install pandas

# 或使用国内源
$ pip install pandas -i http://pypi.douban.com/simple --trusted-host pypi.douban.com

Scikit-Learn安装:

$ pip install scikit-learn

# 或使用国内源
$ pip install scikit-learn -i http://pypi.douban.com/simple --trusted-host pypi.douban.com

matplotlib安装:

$ pip install matplotlib

# 或使用国内源
$ pip install matplotlib -i http://pypi.douban.com/simple --trusted-host pypi.douban.com

scipyseaborn安装:

$ pip install seaborn scipy

# 或使用国内源
$ pip install seaborn scipy  -i http://pypi.douban.com/simple --trusted-host pypi.douban.com
solomonxie commented 6 years ago

Hardwares for Deep Learning

Commands

Google Colab (Jupyter Notebook)

Link: https://colab.research.google.com/notebooks/welcome.ipynb

Kaggle Kernel (Jupyter Notebook)

Refer to Kaggle: How to use Kaggle - Kenels Link: https://www.kaggle.com/kernels

Azure Notebooks (Jupyter Notebook)

Link: https://notebooks.azure.com/Microsoft/libraries/samples

AWS p2

Refer to AWS: Amazon EC2 P2 Instances

P2 instances provide up to 16 NVIDIA K80 GPUs, 64 vCPUs and 732 GiB of host memory, with a combined 192 GB of GPU memory, 40 thousand parallel processing cores, 70 teraflops of single precision floating point performance, and over 23 teraflops of double precision floating point performance. P2 instances also offer GPUDirect™ (peer-to-peer GPU communication) capabilities for up to 16 GPUs, so that multiple GPUs can work together within a single host.

image

DIY

GPU

CPU

solomonxie commented 6 years ago

Build Simple Model for ML (Decision Tree)

Steps of building a model

Refer to Kaggle: Your First Machine Learning Model

The steps to building and using a model are:

Underfitting and Overfitting

Refer to Kaggle: Underfitting and Overfitting

As the Decision Tree Model's depth goes deeper, the Underfitting goes lesser, but there is "turn" at which it starts Overfitting and the error goes larger.

For finding out the "turning point", we need to test out some depths , namely the max_leaf_nodes. At which the error start to turn descending to ascending, we will choose that depth as the best depth for training data in Decision Tree Model.

image

solomonxie commented 6 years ago

Random Forest Model

A "Forest" has a lot of "Trees".

The Random Forest model randomly creates a lot of Decision Trees, and average all predictions from each Decision Tree to get a better result.

solomonxie commented 6 years ago

Handle Missing Data

Most libraries (including scikit-learn) will give you an error if you try to build a model using data with missing values.

Refer to Kaggle: Handling Missing Values

Solution 1: Drop Columns with Missing Values

In many cases, you'll have both a training dataset and a test dataset. You will want to drop the same columns in both DataFrames.

So, it's somewhat usually not the best solution. However, it can be useful when most values in a column are missing.

Solution 2: Imputation

Imputation fills in the missing value with some number. Imputation is the standard approach, and it usually works well.

solomonxie commented 6 years ago

XGBoost

Refer to Kaggle: XGBoost

image

solomonxie commented 6 years ago

Partial Dependence Plots

Extract insights from your models. Insights many didn't even realize were possible.

Refer to Kaggle: Partial Dependence Plots

image

Partial dependence plots are a great way (though not the only way) to extract insights from complex models. These can be incredibly powerful for communicating those insights to colleagues or non-technical users.

solomonxie commented 6 years ago

Cross-Validation

Refer to Kaggle: Cross-Validation

image

On small datasets, the extra computational burden of running cross-validation isn't a big deal. These are also the problems where model quality scores would be least reliable with train-test split. So, if your dataset is smaller, you should run cross-validation.

solomonxie commented 6 years ago

Notes on Coursera ML Andrew Ng

In general, any machine learning problem can be assigned to one of two broad classifications: Supervised learning and Unsupervised learning.

Tom Mitchell provides a more modern definition: "A computer program is said to learn from experience E with respect to some class of tasks T and performance measure P, if its performance at tasks in T, as measured by P, improves with experience E."

Supervised learning: "Right answers" given.

Unsupervised learning:

is to find out some structure in a dataset, and find out clusters is a big part of the work. With unsupervised learning there is NO feedback based on the prediction results.

There's also Non-clustering problem for unsupervised learning

like the "cocktail party algorithm".

Octave

Octave is much more faster to implement a prototype than other languages. We can first use Octave to test our ideas, models, and transfer it into other languages when it's success.

Linear regression model

image

Cost function

image

Contour Plot

image

Gradient Descent intuition

image

Gradient descent algorithm

The gradient descent algorithm is: repeat until convergence: image where j=0,1 represents the feature index number.

simultaneously update the parameters θ1, θ2... image

"Batch" Gradient descent

"Batch": Each step of gradient descent computes ALL the training data.

solomonxie commented 6 years ago

Data Scientist

Data Science from Scratch First Principles with Python by Joel Grus

Data science lies at the intersection of:

solomonxie commented 6 years ago

One-hot Encoding

Refer to Quora: What is one-hot encoding and when is it used in data science? Refer to youtube: A demo of One Hot Encoding (TensorFlow Tip of the Week)

"Encoding" is to take a number to represent a categorical value.

Lable encoding: image

But the problem is, those values would rather be nominal values instead of ordinal values, because we can't say one is greater or smaller than another. And there it comes One-hot encoding, which we only take numbers 1 & 0 to represent the categorical value, but in a TABLE:

image

Note that, the the result of encoding will be as a VECTOR.

In this case, the value for sample-1 is encoded to a vector [1,0,0,0], and value of sample-5 is encoded to vector [0,0,0,1].

image

solomonxie commented 6 years ago

R语言入门

solomonxie commented 6 years ago

Book: Intro to Statistical Learning (ISL)

Model selection: 「Prediction」 vs. 「Inference」

Model selection: 「Flexibility」 vs. 「Interpretability」

snip20181018_5

「Clustering」

Cluster Analysis is NOT to compare input and output(x & y), but to compare two variables (x1 & x2), because in the case there's no clear associate for x & y, we have left no choice but to analyze variables themselves.

image

Model selection: 「Quantitative」 vs. 「Qualitative」

Model selection: 「Train MAE」 vs. 「Test MAE」

In general, we do not really care how well the method works training on the Training data. Rather, we are interested in the accuracy of the predictions that we obtain when we apply our method to previously Unseen Test data.

We’d like to select the model for which the average of this quantity—the test MSE—is as small as possible.

There is no guarantee that the method with the lowest training MSE will also have the lowest test MSE.

image

Regardless of whether overfitting has occurred, we almost always want the Training MSE to be SMALLER than the Test MSE, because most statistical learning methods either directly or indirectly seek to minimize the training MSE. Overfitting refers specifically to the case in which a less flexible model would have yielded a smaller test MSE.

「Bias-Variance」 Trade-off

The relationship between bias, Variance and Test MSE

image

Evaluate Classification: 「Training Error Rate」 vs. 「Test error」

Training Error Rate: image

Test Error: image

「Bayes Classifier」

image

image

Bayes decision boundary: The Bayes classifier’s prediction is determined by the Bayes decision boundary; an observation that falls on the orange side of the boundary will be assigned to the orange class, and similarly an observation on the blue side of the boundary will be assigned to the blue class.

Bayes Error Rate: The Bayes classifier produces the lowest possible test error rate, called the Bayes error rate. image

Note that: In theory we would always like to predict qualitative responses using the Bayes classifier. But for real data, we do not know the conditional distribution of Y given X, and so computing the Bayes classifier is impossible. Therefore, the Bayes classifier serves as an unattainable gold standard against which to compare other methods.

「K-Nearest Neighbors」 (KNN Classifier)

Many approaches attempt to estimate the conditional distribution of Y given X, and then classify a given observation to the class with highest estimated probability. One such method is the K-nearest neighbors (KNN) classifier.

image

KNN applies Bayes rule and classifies the test observation x0 to the class with the largest probability.

image

In the left-hand panel, we have plotted a small training data set consisting of six blue and six orange observations. Our goal is to make a prediction for the point labeled by the black cross. Suppose that we choose K = 3. Then KNN will first identify the three observations that are closest to the cross. This neighborhood is shown as a circle. It consists of two blue points and one orange point, resulting in estimated probabilities of 2/3 for the blue class and 1/3 for the orange class. Hence KNN will predict that the black cross belongs to the blue class. In the right-hand panel of Figure 2.14 we have applied the KNN approach with K = 3 at all of the possible values for X1 and X2, and have drawn in the corresponding KNN decision boundary.

solomonxie commented 6 years ago

Book: ISL chapter 3 Linear Regression

「Simple Linear Regression」 (SLR)

which is to examine how ONE variable effect the result.

Linear Regression: Assume there is linear relationship between x and y. so that's to be: image in which the intercept & slope are unknown constants.

For the two constants or coefficients, we need to estimate them from dataset, as close as it can be.

Least squares is the easiest way to describe the closeness

Residual sum of squares (RSS): image image

image

image

「Multiple Linear Regression」(MLR)

Examine how multiple variables make a compact effect on the result.

image image image

solomonxie commented 6 years ago
solomonxie commented 6 years ago

Regret

Refer to wiki: Loss function - Regret

Leonard J. Savage argued that using non-Bayesian methods such as minimax, the loss function should be based on the idea of regret.

0-1 loss function

image

solomonxie commented 6 years ago

Covariance & Correlation

Refer to wiki: Covariance Refer to wiki: Correlation and dependence Refer to wiki: Covariance and correlation

solomonxie commented 6 years ago

Bias

Simple Linear Regression:

y = β0 + β1 * x

β0 is called the intercept because it determines where the line intercepts the y axis. In machine learning we can call this the bias, because it is added to offset all predictions that we make.

solomonxie commented 6 years ago

Stochastic Gradient Descent

Refer to: Linear Regression Tutorial Using Gradient Descent for Machine Learning

Gradient Descent is the process of minimizing a function by following the gradients of the cost function.

In Machine learning we can use a similar technique called stochastic gradient descent to minimize the error of a model on our training data.

Iteration: The way this works is that each training instance is shown to the model one at a time. The model makes a prediction for a training instance, the error is calculated and the model is updated in order to reduce the error for the next prediction.

This procedure can be used to find the set of coefficients in a model that result in the smallest error for the model on the training data. Each iteration the coefficients, called weights (w) in machine learning language are updated using the equation:

w = w – alpha * delta

Where w is the coefficient or weight being optimized, alpha is a learning rate that you must configure (e.g. 0.1) and gradient is the error for the model on the training data attributed to the weight.

solomonxie commented 6 years ago

Variables

"The statistical perspective frames data in the context of a hypothetical function (f) that the machine learning algorithm is trying to learn. That is, given some input variables (input), what is the predicted output variable (output)."

image

image

image

solomonxie commented 6 years ago

Hardware for Kaggle competition

I was shocked when I saw the response to a question that the minimum PC specs for running kaggle competition: image

AWS EC2 p2 instance pricing

image

DigitalOcean droplet pricing

image

Buy a PC

image

solomonxie commented 5 years ago

Statistical Learning (OneNote) [DRAFT]

Basics

Learning Goal

WE FIRST ASSUME THERE EXISTS AN ALMIGHTY FUNCTION Y=f(x) FOR EVERY (x, y)

image

image

image

Model Selection

Statistics

Plots

image

Distribution

image

image

Inferential Statistics

image

image

Probability

image

Hypothesis Test [DRAFT]

Random Variable

image

Bayesian Theorem [DRAFT]

Linear Regression

Mo: Linear Regression

image

image

image

image

Linear Least Squares

Linear least squares is ONE WAY to estimate a Linear function by finding the minimum value of squared-residuals.

The logic is:

Main Formulations of Linear least squares

Numerical methods for linear least squares

image

Regularized Linear Regression Modified version of Ordinary linear regression, not only minimize cost function, but also reduce complexity of the model.

image

Gradient Descent

Gradient Descent Methods

Standard Procedures of Gradient Descent

Batch Gradient Descent (BGD) Calculating the derivative from all training data before calculating an update.

Stochastic Gradient Descent (SGD) Calculating the derivative from each training data instance and calculating the update immediately.

Model Accuracy

Bias-Variance Trade-Off The prediction error for any machine learning algorithm can be broken down into three parts:

image

image

Hypothesis Test for ML

Some examples of statistical hypothesis tests and their distributions from which critical values can be calculated are as follows: • Z-Test: Gaussian distribution (Normal Distribution). • Student t-Test: Student’s t-distribution. • Chi-Squared Test: Chi-Squared (𝜲²) distribution. • ANOVA: F-distribution (Fisher–Snedecor distribution).

image

Features Selection [DRAFT]

Features selection approaches

❶ Stepwise Regression Main approaches:

❷ LASSO (least absolute shrinkage & selection operator)

Classification

Classification Basics

Encodings of Categories

Why NOT Linear regression? Because the probability must fall between 0 and 1, but Linear regression is not sensible and may lead the result below 0 or above 1. To avoid that, we MUST model p(X) using a function gives output between 0 and 1. Many functions meet this description, logistic function in Logistic Regression is one of them.

Logistic Regression

is to predict the probability of a categorical dependent variable, which is a binary variable

image

image

Linear Discriminant Analysis

LDA (linear discriminant analysis) is an alternative to Logistic regression for the following reasons: