Open solomonxie opened 6 years ago
Artificial intelligence
, is a superior area of all those below. But an AI could be human programmed to say something or recognise something, like coding line by line. Machine Learning
, could be so much better than human coded apps to do something. It lets machine itself to learn things and recognise things. So AI could whether has ML or not. Deep Learning
, a deeper science based logic for Machine Learning. Seems have something to do with neural something.Data Mining
, machine learning is based of a shit load of data. Like a human learning something needs to encounter many things until he gets it. So Data mining is organising wild informations and provides to Machine learning.Big Data
, means a load of informations, and the skill of Big data means how to handle with the data and how we can get some decisions from it.Seems Kaggle and Ali Tianchi is a very good start for a beginner like me, and also it will be a good profile tag for a career starter who's without a solid CS background.
For instance web crawler is only getting EXTERNAL data. But lots of company themselves have already had big data set, like clients' information, patiences' informatin. Only thing they want you to do is how to analyse the data.
Only received data without any label or instruction, and you ask it to give you the answer.
It will receive only SOME info, not by you but by itself, with some practice.
You feed it info with labels, right or wrong, then give a new data let it decide it's right or wrong.
强烈建议在Virtualenv虚拟环境下安装所有的包,这样既不会和什么Anaconda冲突,又不会发生装不上、装错了、误删除等乌七八糟的事情。
建议进入Python3的虚拟环境:
$ source ~/VIRTUALENV-PATH/venv3/bin/activate
Numpy
安装:
$ pip install numpy
# 或使用国内源
$ pip install numpy -i http://pypi.douban.com/simple --trusted-host pypi.douban.com
Pandas
安装:
$ pip install pandas
# 或使用国内源
$ pip install pandas -i http://pypi.douban.com/simple --trusted-host pypi.douban.com
Scikit-Learn
安装:
$ pip install scikit-learn
# 或使用国内源
$ pip install scikit-learn -i http://pypi.douban.com/simple --trusted-host pypi.douban.com
matplotlib
安装:
$ pip install matplotlib
# 或使用国内源
$ pip install matplotlib -i http://pypi.douban.com/simple --trusted-host pypi.douban.com
scipy
和seaborn
安装:
$ pip install seaborn scipy
# 或使用国内源
$ pip install seaborn scipy -i http://pypi.douban.com/simple --trusted-host pypi.douban.com
cat /etc/os-release
df -h
cat /proc/cpuinfo
cat /proc/meminfo
$ sudo apt-get install nvidia-smi
$ nvidia-smi -stats
Link: https://colab.research.google.com/notebooks/welcome.ipynb
Refer to Kaggle: How to use Kaggle - Kenels Link: https://www.kaggle.com/kernels
Link: https://notebooks.azure.com/Microsoft/libraries/samples
Refer to AWS: Amazon EC2 P2 Instances
P2 instances provide up to 16 NVIDIA K80 GPUs, 64 vCPUs and 732 GiB of host memory, with a combined 192 GB of GPU memory, 40 thousand parallel processing cores, 70 teraflops of single precision floating point performance, and over 23 teraflops of double precision floating point performance. P2 instances also offer GPUDirect™ (peer-to-peer GPU communication) capabilities for up to 16 GPUs, so that multiple GPUs can work together within a single host.
NVIDIA Tesla K80 GPU
: RMB 30,000, VRAM 24GB (GDDR5), 256 BitIntel Xeon® E5-2686 v4
: RMB 12,000, 8 Cores, 1.7GHzRefer to Kaggle: Your First Machine Learning Model
The steps to building and using a model are:
Refer to Kaggle: Underfitting and Overfitting
As the Decision Tree Model
's depth goes deeper, the Underfitting
goes lesser, but there is "turn" at which it starts Overfitting
and the error goes larger.
For finding out the "turning point", we need to test out some depths , namely the max_leaf_nodes
. At which the error start to turn descending to ascending, we will choose that depth as the best depth for training data in Decision Tree Model.
A "Forest" has a lot of "Trees".
The Random Forest
model randomly creates a lot of Decision Trees
, and average all predictions from each Decision Tree to get a better result.
Most libraries (including scikit-learn) will give you an error if you try to build a model using data with missing values.
Refer to Kaggle: Handling Missing Values
In many cases, you'll have both a training dataset and a test dataset. You will want to drop the same columns in both DataFrames.
So, it's somewhat usually not the best solution. However, it can be useful when most values in a column are missing.
Imputation fills in the missing value with some number. Imputation is the standard approach, and it usually works well.
Extract insights from your models. Insights many didn't even realize were possible.
Refer to Kaggle: Partial Dependence Plots
Partial dependence plots are a great way (though not the only way) to extract insights from complex models. These can be incredibly powerful for communicating those insights to colleagues or non-technical users.
Refer to Kaggle: Cross-Validation
On small datasets, the extra computational burden of running cross-validation isn't a big deal. These are also the problems where model quality scores would be least reliable with train-test split. So, if your dataset is smaller, you should run cross-validation.
In general, any machine learning problem can be assigned to one of two broad classifications: Supervised learning and Unsupervised learning.
Tom Mitchell provides a more modern definition: "A computer program is said to learn from experience E with respect to some class of tasks T and performance measure P, if its performance at tasks in T, as measured by P, improves with experience E."
predict continuous valued output
,predict discrete valued output
.is to find out some structure in a dataset, and find out clusters is a big part of the work. With unsupervised learning there is NO feedback based on the prediction results.
like the "cocktail party algorithm".
Octave is much more faster to implement a prototype than other languages. We can first use Octave to test our ideas, models, and transfer it into other languages when it's success.
The gradient descent algorithm is: repeat until convergence:
where j=0,1
represents the feature index number.
simultaneously update the parameters θ1, θ2...
"Batch": Each step of gradient descent computes ALL the training data.
Data Science from Scratch First Principles with Python by Joel Grus
Data science lies at the intersection of:
Refer to Quora: What is one-hot encoding and when is it used in data science? Refer to youtube: A demo of One Hot Encoding (TensorFlow Tip of the Week)
"Encoding" is to take a number to represent a categorical value.
Lable encoding
:
But the problem is, those values would rather be nominal values instead of ordinal values, because we can't say one is greater or smaller than another.
And there it comes One-hot encoding
, which we only take numbers 1 & 0
to represent the categorical value, but in a TABLE:
Note that, the the result of encoding will be as a VECTOR.
In this case, the value for sample-1 is encoded to a vector
[1,0,0,0]
, and value of sample-5 is encoded to vector[0,0,0,1]
.
inference
, then it's better to choose the model with more interpretability.Cluster Analysis
is NOT to compare input and output(x & y)
, but to compare two variables (x1 & x2)
,
because in the case there's no clear associate for x & y
, we have left no choice but to analyze variables themselves.
In general, we do not really care how well the method works training on the
Training data
. Rather, we are interested in the accuracy of the predictions that we obtain when we apply our method to previouslyUnseen Test data
.
We’d like to select the model for which the average of this quantity—the test MSE
—is as small as possible.
There is no guarantee that the method with the lowest training MSE will also have the lowest test MSE.
Regardless of whether overfitting has occurred, we almost always want the
Training MSE
to be SMALLER than theTest MSE
, because most statistical learning methods either directly or indirectly seek to minimize the training MSE. Overfitting refers specifically to the case in which a less flexible model would have yielded a smaller test MSE.
The relationship between bias
, Variance
and Test MSE
Training Error Rate:
Test Error:
Bayes decision boundary
: The Bayes classifier’s prediction is determined by the Bayes decision boundary; an observation that falls on the orange side of the boundary will be assigned to the orange class, and similarly an observation on the blue side of the boundary will be assigned to the blue class.
Bayes Error Rate
: The Bayes classifier produces the lowest possible test error rate, called the Bayes error rate.
Note that: In theory we would always like to predict qualitative responses using the
Bayes classifier
. But for real data, we do not know the conditional distribution of Y given X, and so computing theBayes classifier
is impossible. Therefore, the Bayes classifier serves as an unattainable gold standard against which to compare other methods.
Many approaches attempt to estimate the conditional distribution of Y given X, and then classify a given observation to the class with highest estimated probability. One such method is the K-nearest neighbors (KNN) classifier.
KNN applies Bayes rule
and classifies the test observation x0 to the class with the largest probability.
In the left-hand panel, we have plotted a small training data set consisting of six blue and six orange observations. Our goal is to make a prediction for the point labeled by the black cross. Suppose that we choose K = 3. Then KNN will first identify the three observations that are closest to the cross. This neighborhood is shown as a circle. It consists of two blue points and one orange point, resulting in estimated probabilities of 2/3 for the blue class and 1/3 for the orange class. Hence KNN will predict that the black cross belongs to the blue class. In the right-hand panel of Figure 2.14 we have applied the KNN approach with K = 3 at all of the possible values for X1 and X2, and have drawn in the corresponding KNN decision boundary.
which is to examine how ONE variable effect the result.
Linear Regression: Assume there is linear relationship between x and y. so that's to be: in which the intercept & slope are unknown constants.
For the two constants or coefficients, we need to estimate them from dataset, as close as it can be.
Least squares
is the easiest way to describe the closeness
Residual sum of squares (RSS)
:
Examine how multiple variables make a compact effect on the result.
Refer to wiki: Loss function - Regret
Leonard J. Savage argued that using non-Bayesian methods such as minimax, the loss function should be based on the idea of regret.
0-1 loss function
Covariance
& Correlation
Refer to wiki: Covariance Refer to wiki: Correlation and dependence Refer to wiki: Covariance and correlation
Simple Linear Regression:
y = β0 + β1 * x
β0
is called the intercept because it determines where the line intercepts the y axis.
In machine learning we can call this the bias
, because it is added to offset all predictions that we make.
Refer to: Linear Regression Tutorial Using Gradient Descent for Machine Learning
Gradient Descent is the process of minimizing a function by following the gradients of the cost function.
In Machine learning we can use a similar technique called stochastic gradient descent
to minimize the error of a model on our training data.
Iteration: The way this works is that each training instance is shown to the model one at a time. The model makes a prediction for a training instance, the error is calculated and the model is updated in order to reduce the error for the next prediction.
This procedure can be used to find the set of coefficients in a model that result in the smallest error for the model on the training data.
Each iteration the coefficients, called weights (w)
in machine learning language are updated using the equation:
w = w – alpha * delta
Where w
is the coefficient or weight being optimized,
alpha
is a learning rate that you must configure (e.g. 0.1) and
gradient
is the error for the model on the training data attributed to the weight.
"The statistical perspective frames data in the context of a hypothetical function (f) that the machine learning algorithm is trying to learn. That is, given some input variables (input), what is the predicted output variable (output)."
I was shocked when I saw the response to a question that the minimum PC specs for running kaggle competition:
WE FIRST ASSUME THERE EXISTS AN ALMIGHTY FUNCTION Y=f(x) FOR EVERY (x, y)
Linear least squares is ONE WAY to estimate a Linear function by finding the minimum value of squared-residuals.
The logic is:
Main Formulations of Linear least squares
Numerical methods for linear least squares
Regularized Linear Regression Modified version of Ordinary linear regression, not only minimize cost function, but also reduce complexity of the model.
Gradient Descent Methods
Standard Procedures of Gradient Descent
Batch Gradient Descent (BGD) Calculating the derivative from all training data before calculating an update.
Stochastic Gradient Descent (SGD) Calculating the derivative from each training data instance and calculating the update immediately.
Bias-Variance Trade-Off The prediction error for any machine learning algorithm can be broken down into three parts:
Some examples of statistical hypothesis tests and their distributions from which critical values can be calculated are as follows: • Z-Test: Gaussian distribution (Normal Distribution). • Student t-Test: Student’s t-distribution. • Chi-Squared Test: Chi-Squared (𝜲²) distribution. • ANOVA: F-distribution (Fisher–Snedecor distribution).
Features selection approaches
❶ Stepwise Regression Main approaches:
❷ LASSO (least absolute shrinkage & selection operator)
Encodings of Categories
Why NOT Linear regression? Because the probability must fall between 0 and 1, but Linear regression is not sensible and may lead the result below 0 or above 1. To avoid that, we MUST model p(X) using a function gives output between 0 and 1. Many functions meet this description, logistic function in Logistic Regression is one of them.
is to predict the probability of a categorical dependent variable, which is a binary variable
LDA (linear discriminant analysis) is an alternative to Logistic regression for the following reasons:
Prerequisites
Math
Computer Science
Programming (Python)