Machine Learning - Githubissues

Agenda

数据的标准化(Standardization)、平均移除法（mean removal）和方差归一化（variance scaling）
BLAS, LAPACK/機械学習の流れ/Scikit-learnに任せる/手法の種類
scikit-learn (Regression, K Nearest Neighbors classifier, Support Vector Machine, K-Means algorithm, Mean Shift )

常用库

Quandl：金融、経済などの数値データの検索エンジンで、さまざまなソースから得られたデータを検索し、グラフや表を表示させることができる。またデータはJSON、CSVなどの形式でダウンロードしたり、Plotlyなどのサービスに取り込ませることもできる。
scikit-learn：オープンソースの機械学習ライブラリで、分類や回帰、クラスタリングなどの機能が実装されています。
LightGBM LightGBM は Microsoft が開発した勾配ブースティング (Gradient Boosting) アルゴリズムを扱うためのフレームワーク。勾配ブースティングは決定木 (Decision Tree) から派生したアルゴリズムで、複数の決定木を逐次的に構築したアンサンブル学習をするらしい。

基础用语：

feature (特征)： ex. we pick a few features to define a person (say "Height", "Weight", and "Foot Size"). These are also called attributes.
Class (类别): The output category of your data. You can call these categories as well.

数据的标准化(Standardization)、平均移除法（mean removal）和方差归一化（variance scaling）

在机器学习的目标函数（objective function），比如SVM中的RBF或者线性模型中的L1和L2正则项，其中使用的元素，前提是所有的feature都是以0为中心，且方差的阶（order）都一致。如果一个feature的方差，比其它feature的阶（order）都大，那么它将在目标函数中占支配地位，从而使得estimator从其它feature上学习不到任何东西。 http://d0evi1.com/sklearn/preprocessing/

利用scale函数对数据进行归一化处理后，可以看到归一化后的数据，均值为0，方差为1。

preprocessing模块提供了另一个工具类：StandardScaler，它实现了Transformer API，来计算在一个训练集上的平均值和标准差（standard deviation）。

另一种标准化方式是，将feature归一化到给定的最大、最小值范围内，比如：[0,1]之间，这样，每个feature的最大绝对值为1. 可以使用：MinMaxScaler或者MaxAbsScaler。

sparse矩阵(稀疏矩阵)：如果对sparse数据进行中心化，会摧毁数据的稀疏性，十分敏感。我们可以对sparse数据进行特殊的归一化，尤其各种feature以不同的归一化方式进行。MaxAbsScaler 和 maxabs_scale 是专门处理稀疏数据的。

归一化定义：我是这样认为的，归一化化就是要把你需要处理的数据经过处理后（通过某种算法）限制在你需要的一定范围内。首先归一化是为了后面数据处理的方便，其次是保正程序运行时收敛加快。方法有如下： 1、线性函数转换，表达式如下：　　y=(x-MinValue)/(MaxValue-MinValue)　　说明：x、y分别为转换前、后的值，MaxValue、MinValue分别为样本的最大值和最小值。　　

2、对数函数转换，表达式如下：　　y=log10(x)　　说明：以10为底的对数函数转换。　　

3、反余切函数转换，表达式如下：　　y=atan(x)*2/PI　　

式(1)将输入值换算为[-1,1]区间的值，在输出层用式(2)换算回初始值，其中和分别表示训练样本集中负荷的最大值和最小值。　　

在统计学中，归一化的具体作用是归纳统一样本的统计分布性。归一化在0-1之间是统计的概率分布，归一化在-1--+1之间是统计的坐标分布。

什么时候对数据进行归一化 主要看模型是否具有伸缩不变性。

有些模型在各个维度进行不均匀伸缩后，最优解与原来不等价，例如SVM。对于这样的模型，除非本来各维数据的分布范围就比较接近，否则必须进行标准化，以免模型参数被分布范围较大或较小的数据dominate。

有些模型在各个维度进行不均匀伸缩后，最优解与原来等价，例如logistic regression。对于这样的模型，是否标准化理论上不会改变最优解。但是，由于实际求解往往使用迭代算法，如果目标函数的形状太“扁”，迭代算法可能收敛得很慢甚至不收敛。所以对于具有伸缩不变性的模型，最好也进行数据标准化。

“标准化” 和 “归一化” 的区别 标准化一般是指把均值调整成0，方差调整成1。归一化狭义上是指把最小值、最大值调整成0、1或-1、1，广义上也可以指标准化。

数据的标准化（normalization）是将数据按比例缩放，使之落入一个小的特定区间。在某些比较和评价的指标处理中经常会用到，去除数据的单位限制，将其转化为无量纲的纯数值，便于不同单位或量级的指标能够进行比较和加权。

BLAS, LAPACK

練習用データ Kaggle : https://www.kaggle.com/ UCI : https://archive.ics.uci.edu/ml/datasets.html

機械学習の流れ

Scikit-learnに任せる

scikit-learn

Regression

X_train, X_test, y_train, y_test = cross_validation.train_test_split(X, y, test_size=0.2)

clf = LinearRegression(n_jobs=-1)
clf.fit(X_train, y_train)
confidence = clf.score(X_test, y_test)

forecast_set = clf.predict(X_lately)

K Nearest Neighbors classifier

With Scikit-Learn, the KNN classifier comes with a parallel processing parameter called n_jobs. You can set this to be any number that you want to run simultaneous operations for. If you want to run 100 operations at a time, n_jobs=100. If you just want to run as many as you can, you set n_jobs=-1.

X_train, X_test, y_train, y_test = cross_validation.train_test_split(X, y, test_size=0.2)

clf = neighbors.KNeighborsClassifier() //or with parameter "n_jobs=-1"
clf.fit(X_train, y_train)
accuracy = clf.score(X_test, y_test)

Support Vector Machine The objective of the Support Vector Machine is to find the best splitting boundary between data. What the Support Vector Machine aims to do is, one time, generate the "best fit" line (but actually a plane, and even more specifically a hyperplane!) that best divides the data.

向量机(以下简称SVM)算法 http://www.cnblogs.com/pinard/p/6117515.html scikit-learn中SVM的算法库分为两类，一类是分类的算法库，包括SVC， NuSVC，和LinearSVC 3个类。另一类是回归算法库，包括SVR， NuSVR，和LinearSVR 3个类。相关的类都包裹在sklearn.svm模块之中。

在scikit-learn中，内置的核函数一共有4种，当然如果你认为线性核函数不算核函数的话，那就只有三种。　1）线性核函数（Linear Kernel）表达式为：K(x,z)=x∙z，就是普通的内积，LinearSVC 和 LinearSVR 只能使用它。

　　　　2) 多项式核函数（Polynomial Kernel）是线性不可分SVM常用的核函数之一，表达式为：（K(x,z)=（γx∙z+r)d ，其中，γ,r,d都需要自己调参定义,比较麻烦。

　　　　3）高斯核函数（Gaussian Kernel），在SVM中也称为径向基核函数（Radial Basis Function,RBF），它是libsvm默认的核函数，当然也是scikit-learn默认的核函数。表达式为：K(x,z)=exp(γ||x−z||2)，其中，γ大于0，需要自己调参定义。

　　　　4）Sigmoid核函数（Sigmoid Kernel）也是线性不可分SVM常用的核函数之一，表达式为：（K(x,z)=tanh（γx∙z+r)，其中，γ,r都需要自己调参定义。

　　　　一般情况下，对非线性数据使用默认的高斯核函数会有比较好的效果，如果你不是SVM调参高手的话，建议使用高斯核来做数据分析。　　

ーーーーーーーーーーーーーーーーーーーーー Kernels (核函数) カーネル法にとって最も重要なのはカーネル関数と呼ばれる内積演算に相当する関数です。カーネル関数k()k()は2つのデータ点x,x′x,x′に対して以下で定義されます。 k(x,x′)=φ(x)Tφ(x′)

核函数就是内积。

Clustering and Unsupervised machine learning Flat and Hierarchical With Flat clustering, the scientist tells the machine how many classes/clusters to find. With Hierarchical clustering(层次化的聚类), the machine figures out the groups and how many. 得出来的结构是一棵树.

The field of "Big Data Analysis" is generally a prime area for clustering.

K-Means algorithm (Flat clustering) The idea of K-Means is to attempt to cluster a given dataset into K clusters.
1. Take entire dataset, and set, randomly, K number of centroids. Centroids are just the "centers" of your clusters.
2. Calculate distance of each featureset to the centroids, and classify each featureset as the centroid class closest to it. Centroid classes are arbitrary (任意), you will likely just call the first centroid 0, the second centroid 1...and so on.
3. Once you have classified all data, now you take the "mean" of the groups, and set the new centroids as the mean of their associated groups.
4. Repeat #1, 2 and #3 until you are optimized.

clf = KMeans(n_clusters=2)
clf.fit(X)

centroids = clf.cluster_centers_
labels = clf.labels_

colors = ["g.","r.","c.","y."]
for i in range(len(X)):
    plt.plot(X[i][0], X[i][1], colors[labels[i]], markersize = 10)
plt.scatter(centroids[:, 0],centroids[:, 1], marker = "x", s=150, linewidths = 5, zorder = 10)
plt.show()

Mean Shift (Hierarchical clustering) Mean Shift is very similar to the K-Means algorithm, except for one very important factor: you do not need to specify the number of groups prior to training. The Mean Shift algorithm finds clusters on its own.

The way Mean Shift works is to go through each featureset (a datapoint on a graph), and proceed to do a hill climb operation.

d 次元空間内に観測点 {x⃗ i|i=1,⋯,n} が与えられたとき、観測点の分布が作り出す密度分布関数 f(x⃗ ) を考えることができる。点が集中している場所では f(x⃗ ) の値は大きく、点がまばらな場所では小さい値を取る。

Mean shift では半径 h の円を考え: 初期点からその円内にある点の平均を求める初期点から、求めた平均の点へと円の中心を移動するという動作を繰り返して、極大点を見つけます

centers = [[1,1,1],[5,5,5],[3,10,10]]

X, _ = make_blobs(n_samples = 100, centers = centers, cluster_std = 1.5)
print(X)

ms = MeanShift()
ms.fit(X)
labels = ms.labels_
cluster_centers = ms.cluster_centers_

print(cluster_centers)
n_clusters_ = len(np.unique(labels))
print("Number of estimated clusters:", n_clusters_)

説明：

sklearn.datasets.make_blobs Generate isotropic(等方性) Gaussian blobs for clustering.

X, y = make_blobs(n_samples=500, centers=4, random_state=8, cluster_std=2.4)
# n_samples:サンプル数 centers:中心点の数 random_state:seed値 cluster_std: ばらつき度合い

■train_test_split データセットから取り出した X y をさらに、「トレーニング用」と「テスト用」のデータに分割します。

from sklearn.model_selection import train_test_split

(X_train, X_test, y_train, y_test) = train_test_split( X, y, test_size=0.3, random_state=0, )

train_test_split には以下のような引数を与えます。 •第一引数: 特徴行列 X •第二引数: 目的変数 y •test_size=: テスト用のデータを何割の大きさにするか◦test_size=0.3 で、3割をテスト用のデータとして置いておけます •random_state=: データを分割する際の乱数のシード値◦同じ結果が返るように 0 を指定していますが、これは勉強用であり普段は指定しません

为了提高精度，数据的预处理（Preprocessing）和 Feature Engineering(特征工程) 很重要！ In Data Preprocessing, we usully do something like: •deal with outlier •categorical variable encoding.(One-Hot Encoding) •text data embedding.

IN Feature Engineering, we usully do something like: •Feature Extraction(Create some new feature) •Feature Selection(by feature correlation)

事例（titanic_traing.ipynb）：

Fill the na values in Fare based on embarked data

embarked = ['S', 'C', 'Q']
for port in embarked:
fare_to_impute = df_data.groupby('Embarked')['Fare'].median()[embarked.index(port)]
#计算出中央值
df_data.loc[(df_data['Fare'].isnull()) & (df_data['Embarked'] == port), 'Fare'] = fare_to_impute
#找到df_data['Fare']为空的所有行，然后对它们的"Fare"列对应的数值进行赋值。
# Fare in df_train and df_test:
df_train["Fare"] = df_data['Fare'][:891]
df_test["Fare"] = df_data['Fare'][891:]

Fill in missing Fare value in training set based on mean fare for that Pclass

for x in range(len(df_train["Fare"])):
if pd.isnull(df_train["Fare"][x]):
    pclass = df_train["Pclass"][x] #Pclass = 3
    df_train["Fare"][x] = round(df_train[df_train["Pclass"] == pclass]["Fare"].mean(), 4)
   #计算拥有相同Pclass的Fare数据的平均值，保留小数点后四位

rainit2006 / Python-room

Machine Learning #5

scikit-learn