Research 'Support Vector Machines' (wiki)

jeff1evesque / machine-learning

Web-interface + rest API for classification and regression (https://jeff1evesque.github.io/machine-learning.docs)

Other

256 stars 85 forks source link

Research 'Support Vector Machines' (wiki) #1

Closed jeff1evesque closed 6 years ago

jeff1evesque commented 10 years ago

To gain a better understanding of Support Vector Machines, we will research the following:

SVM Classification

General Definition
Requirement
Other Modeling Techniques (SVM in general)
Limitations (curse of dimensionality, applies generally to multi-dimensional models)

SVM Regression

Background: Regression Analysis
SVM: Regression Analysis

Loss Functions

Loss Functions: Quick Summary
Loss Functions: SVM Regression
Loss Functions: SVM Classification
Types of Regression

jeff1evesque commented 10 years ago

General Definition:

Support Vector Machines are a class of algorithms that allow machine learning to develop a model or structure from data during training.

An SVM classifies data by finding the best hyperplane that separates all data points of one class from those of the other class. The best hyperplane for an SVM means the one with the largest margin between the two classes. Margin means the maximal width of the slab parallel to the hyperplane that has no interior data points.

http://www.mathworks.com/help/stats/support-vector-machines-svm.html

Note: In a multi-class classification, a single training produces more than one SVM.

A support vector machine is a method of machine learning that attempts to take input data and classify into one of two categories. In order for a support vector machine to be effective, it is necessary to first use a set of training input and output data to build the support vector machine model that can be used for classifying new data.

A support vector machine develops this model by taking the training inputs, mapping them into multidimensional space, then using regression to find a hyperplane (a hyperplane is a surface in n-dimensional space that it separates the space into two half spaces) that best separates the two classes of inputs. Once the support vector machine has been trained, it is able to assess new inputs with respect to the separating hyperplane and classify it into one of the two categories.

http://www.mql5.com/en/articles/584

Support Vector Machines are capable of containing multi-class classifications (instead of only two categories). And, specifically can be accessed within the two SVM engines we are considering using:

Note: The decision function is a function created after training, and modeled as the hyperplane that separates the different classes. It determines whether sample points belong to a specific classification (or class). The decision function depends on a subset of the training sample (feature vectors), specifically the support vectors which are sample points closest to the hyperplane (decision line).

jeff1evesque commented 10 years ago

Requirements:

1.) Undergo dimensional reduction preprocessing.

2.) Define a training data set. Below, we outline how to proceed with training data. We show how this is done with the two-class, and then with a multi-class classification using the one-against-one approach. The one-against-one is the default approach for the multi-class classification. However, there is also the one-vs-the-rest approach for multi-class classification. For more information on one-vs-the-rest, refer to the scikit-learn documentation.

Prediction: using a two-class classification:

>>> from sklearn import svm
>>> X = [[0, 0], [1, 1]]
>>> Y = [0, 1]
>>> clf = svm.SVC()
>>> clf.fit(X, Y)  
SVC(C=1.0, cache_size=200, class_weight=None, coef0=0.0, degree=3,
gamma=0.5, kernel='rbf', probability=False, shrinking=True, tol=0.001,
verbose=False)

X is an array holding training samples, sometimes called feature vectors. Each sample, are vectors in the _nfeatures-dimensional space, and we have _nsamples of them. Therefore, X is a large array holding _nsamples arrays each of length _nfeatures. We also have an array Y of integer values, of size _nsamples, holding the class labels for the training samples. In this case, 0 is the class label to the the sample [0, 0], and 1 is a class label to the sample [1, 1].

After being fitted, the model can then be used to predict new values:

>>> clf.predict([[2., 2.]])
array([ 1.])

The method predict() takes data in the same form as the array X we used in fit(), that is an array of feature vectors. The predict() method predicts the class for each feature vector.

Note: the trailing periods in clf.predict([[2., 2.]]) cause the 2s to be treated as floats, rather than integers.

Note: in a two-class classification, also known as a binary classification, there is usually just one classifier (SVM). If the output passes a threshold (output > 0.5), then the sample is assigned to class 1. Otherwise, it is assigned to class 0.

Prediction: using a multi-class classification:

>>> X = [[0], [1], [2], [3]]
>>> Y = [0, 1, 2, 3]
>>> clf = svm.SVC()
>>> clf.fit(X, Y) 
SVC(C=1.0, cache_size=200, class_weight=None, coef0=0.0, degree=3,
gamma=1.0, kernel='rbf', probability=False, shrinking=True,
tol=0.001, verbose=False)
>>> dec = clf.decision_function([[1]])
>>> dec.shape[1] # 4 classes: 4*3/2 = 6
6

Note: Our above array X has a dimension of 4 x 1 (each vector has a cardinality of one). However, the array X can have any dimension n x m, such that n, m > 0 (each vector has a cardinality of m).

The line dec = clf.decision_function([[1]]) passes the sample point (i.e. feature vector) 1 to get the margins for it. Though, we can pass any continuous value into this function, it is only useful to pass in a values (sample points) from the training data set.

Note: to understand why we mentioned the plural term margins (rather than 'margin'), see the explanation below regarding the one-against-one approach.

The line dec.shape[1] returns the number of classifiers (SVM's) the training produced. Specifically, array.shape[0] returns the number of rows, array.shape[1] returns the number of columns, and dec.shape[] returns a tuple with the size of the array. So, for example dec.shape[] would return (10, 5) for a10 x 5` array.

Alternatively, we can display the following:

>>> dec = clf.decision_function([[1]])
>>> dec.shape[1] # 4 classes: 4*3/2 = 6

into one line, in order to return six decision values as follows:

>>> result = svm.decision_function(X)[1]
>>> result
array([-0.26383662, -0.73829377,  0.17362891, -0.63628993,  0.30002938,
        0.83268192])

Note: the six values in the above array() were randomly generated (not accurate).

To determine a prediction, we can implement the command as we did earlier during the two-class classification:

>>> clf.predict([[3]])

The above, is an example of multi-training using the one-against-one approach. It essentially trains a separate SVM for each pair of classes. For example, with the four classes defined above, we would have the following pairs:

1v2, 1v3, 1v4, 2v3, 2v4, 3v4

During prediction, the feature set will be run against each SVM, and a record will be placed (or tallied) for the classification that best fits. The decision function returns the raw results of the SVM's, or specifically, the margin results (distance of the samples X to the separating hyperplane) of each of the SVM's that were trained.

jeff1evesque commented 10 years ago

Other Modeling Techniques:

SVM models have similar functional form to neural networks and radial basis functions, both popular data mining techniques. However, neither of these algorithms has the well-founded theoretical approach to regularization that forms the basis of SVM. The quality of generalization and ease of training of SVM is far beyond the capacities of these more traditional methods.

SVM can model complex, real-world problems such as text and image classification, hand-writing recognition, and bioinformatics and biosequence analysis.

SVM performs well on data sets that have many attributes, even if there are very few cases on which to train the model. There is no upper limit on the number of attributes; the only constraints are those imposed by hardware. Traditional neural nets do not perform well under these circumstances.

http://docs.oracle.com/cd/B28359_01/datamine.111/b28129/algo_svm.htm#CHDDJFDJ

jeff1evesque commented 10 years ago

Limitations (curse of dimensionality):

Dimension reduction can often be implemented using principal component analysis, to prevent the curse of dimensionality.

IRC #machinelearning (08/14/14 ~ 11:30am EST):

jeffreylevesque: I want to test SVM. If i use weather data - http://w1.weather.gov/xml/current_obs/KTPA.xml, is that even enough information for useful classification?

wabash: Well, first thing is to understand 2 basic concepts (independent of math and theory and algorithms) wabash: http://www.visiondummy.com/2014/04/curse-dimensionality-affect-classification/

jeffreylevesque: Wabash, thank you. I will read that

wabash: Good. It's a well done explanation as to why your data must have an average distance, and why dimensionality needs exponentially increasing dataset sizes.

Note: Dimensional reduction is implemented before the classification has been defined within the SVM.

The dimensonal reduction step is typically a separate step from SVM:

X, y = iris.data, iris.target
from sklearn.decomposition import PCA
pca = PCA(n_components=2)
pca.fit(X)
X_reduced = pca.transform(X)
print "Reduced dataset shape:", X_reduced.shape

http://www.astro.washington.edu/users/vanderplas/Astr599/notebooks/17_SklearnIntro

Note: above assumes using scikit-learn, instead of other packaged SVM frameworks (i.e. pyML)

IRC #scikit-learn (08/14/14 ~ 5:45pm EST):

jeffreylevesque: does SVM apply dimensional reduction? jeffreylevesque: or, do i have to specify it

gcr: jeffreylevesque: you typically have to have a separate dimensionality reduction step gcr: eg. with sklearn.decomposition.PCA or similar

jeffreylevesque: so, the reduction step is separate from SVM, and therefore needs to be done before computing the SVM 'classification'?

NelleV: jeffreylevesque: yes

jeffreylevesque: NelleV, I know it's dependent on data, and the projects defining scope. But, is it often one finds the need to complete 'dimensional reduction'?

gcr: jeffreylevesque: Yup! Sklearn also doesn't sphere/whiten/normalize the data before using it, which is often another necessary step for SVMs. See, eg. http://scikit-learn.org/stable/modules/feature_selection.html or http://scikit-learn.org/stable/modules/decomposition.html or http://scikit-learn.org/stable/modules/random_projection.html for more info on dimensionality reduction, and see http://scikit-learn.org/stable/modules/preprocessing gcr: jeffreylevesque: However, if you prefer you can also build a pipeline that does everything at once (though you must still specify the steps explicitly); see http://scikit-learn.org/stable/modules/pipeline.html

IRC #machinelearning (08/14/14 ~ 6:10pm EST):

jeffreylevesque: pasky_: is it generally preferred to have completed 'dimensional reduction' before using SVM (or another learning model)? Is there any cases where it isn't required, but done so anyways - would that be considered bad practice?

wabash: SVM just has certain data requirements for it to work nicely without overfit. wabash: It's bad practice if you needed to do it but neglected to do it. wabash: It's fine practice if the data set is such that you don't need to do it.

jeffreylevesque: wabash, so, it's fine if I implement it for all cases then?

wabash: Yeah, it won't hurt unless you reduce the dim too far. So if you looked up PCA, you'll see that there's a point where you are throwing away too much information.

jeff1evesque commented 10 years ago

Background: Regression Analysis

Suppose we are given a training set of N observations:
((x1, y1), . . . , (xN, yN)) with xi ∈ Rd, yi ∈ R
The regression problem is to estimate f(x) from this data such that:
yi = f(xi)
http://www.robots.ox.ac.uk/~az/lectures/ml/lect4.pdf

Regression is a method for fitting a curve (not necessarily a straight line) through a set of points using some goodness-of-fit criterion. The most common type of regression is linear regression.

http://mathworld.wolfram.com/Regression.html

The process of taking your data points and coming up with an equation is called "regression", and the graph of the "regression equation" is called "the regression line".

http://www.purplemath.com/modules/scattreg3.htm

Regression analysis involves identifying the relationship between a dependent variable and one or more independent variables. A model of the relationship is hypothesized, and estimates of the parameter values are used to develop an estimated regression equation. Various tests are then employed to determine if the model is satisfactory. If the model is deemed satisfactory, the estimated regression equation can be used to predict the value of the dependent variable given values for the independent variables.

http://abyss.uoregon.edu/~js/glossary/correlation.html

Independent variables are characteristics that can be measured directly (example the area of a house). These variables are also caled predictor variables (used to predict the dependent variable) or explanatory variables (used to explain the behavior of the dependent variable).

Dependent variable is a characteristic whose value depends on the values of independent variables.

http://www2.gsu.edu/~dscaas/pptmba/5regr.ppt

We can never know for sure, but we can construct a range of values such that we are confident, at some definite level of probability, that the true value will fall within this range. This is called a “confidence interval”. A “confidence interval” always has a certain probability attached to it. A 90 percent interval is one for which we can be 90 percent confident that the interval brackets the true value.

http://users.wfu.edu/lawlor/Regression%20Basics.ppt

Most statistical problems are defined in terms of loss functions in the sense that loss functions define what a “good” estimator or a “good” prediction is.

The choice of a loss function cannot be formalized as a solution of a mathematical decision problem in itself, because such a decision problem would require the specification of another loss function. Therefore, the choice of a loss function requires informal decisions, which necessarily have to be subjective, or at least contain subjective elements.

http://www.ine.pt/revstat/pdf/rs070102.pdf

jeff1evesque commented 10 years ago

SVM: Regression Analysis

SVM uses an epsilon-insensitive loss function to solve regression problems. SVM regression tries to find a continuous function such that the maximum number of data points lie within the epsilon-wide insensitivity tube. Predictions falling within epsilon distance of the true target value are not interpreted as errors.

http://docs.oracle.com/cd/B28359_01/datamine.111/b28129/algo_svm.htm

In the same way as with classification approach there is motivation to seek and optimize the generalization bounds given for regression. They relied on defining the loss function that ignores errors, which are situated within the certain distance of the true value. This type of function is often called – epsilon intensive – loss function. The figure below shows an example of one-dimensional linear regression function with – epsilon intensive – band. The variables measure the cost of the errors on the training points. These are zero for all points that are inside the band.

Another picture shows a similar situation but for non-linear regression case.

http://kernelsvm.tripod.com/

The Support Vector Regression (SVR) uses the same principles as the SVM for classification, with only a few minor differences. First of all, because output is a real number it becomes very difficult to predict the information at hand, which has infinite possibilities. In the case of regression, a margin of tolerance (epsilon) is set in approximation to the SVM which would have already requested from the problem. But besides this fact, there is also a more complicated reason, the algorithm is more complicated therefore to be taken in consideration. However, the main idea is always the same: to minimize error, individualizing the hyperplane which maximizes the margin, keeping in mind that part of the error is tolerated.

http://www.saedsayad.com/support_vector_machine_reg.htm

jeff1evesque commented 10 years ago

Loss Functions: Quick Summary

Loss functions generally define cases, or boundaries of how to interpret points. For example, reference the ε-insensitive loss function defined below.

Loss Functions: SVM Regression

Suppose the data set:

regression-data-set

Loss functions are very important, and typically substituted into the regression model of choice. Consider, a linear regression:

regression-linear

with an ε-insensitive loss function:

regression-epsilon-loss-function

Short of proof, the ε-insensitive loss function coupled with contraints:

regression-constraints

produces the necessary w, and b needed for the original general linear regression equation:

regression-epsilon-loss-function

Note: Non-Linear Regression utilize loss functions coupled with constraints, in a similar way as the above Linear Regression.

http://users.ecs.soton.ac.uk/srg/publications/pdf/SVM.pdf

Loss Functions: SVM Classification

Loss functions (typically, defaults to the Hinge Loss function) are important components of the SVM optimization / learning algorithm. The derivations follow similarly to the earlier, Loss Functions: SVM Classification.

http://www.mit.edu/~9.520/spring11/slides/class06-svm.pdf

IRC #machinelearning (08/18/14 ~ 8:00pm EST):

jeffreylevesque: does SVM classification use loss functions as SVM regression does?

deadbeef: machine-head: of course deadbeef: i mean, both have a loss criterion

Loss Functions: Types of Functions

Hinge loss: Hinge loss works well for its purposes in SVM as a classifier, since the more you violate the margin, the higher the penalty is. However, hinge loss is not well-suited for regression-based problems as a result of its one-sided error.

The hinge loss function is the following:
l(y, f(x)) = max(0, 1 − y · f(x))
Squared loss: is well suited for the purpose of regression problems. However, it suffers from a critical flaw. Outliers are punished very heavily by the squaring of the error. Therefore, it is recommended to filter for outliers first.

The square loss function is the following:
l(y, f(x)) = (y − f(x))2
Absolute loss: is applicable to regression problems just like squared loss, and it avoids the problem of weighting outliers too strongly by scaling the loss only linearly instead of quadratically by the error amount

The absolute loss function is the following:
l(y, f(x)) = |y − f(x)|
Epsilon-insensitive loss: this loss function is ideal when small amounts of error (for example, in noisy data) are acceptable. It is identical in behavior to the absolute loss function, except that any points within some selected range incur no error at all. This error-free margin makes the loss function an ideal candidate for support vector regression.

The epsilon-insensitive loss function is the following:
l(y, f(x)) = |y − f(x)| 
http://courses.cms.caltech.edu/cs253/slides/cs253-14-GPs.pdf

jeff1evesque commented 9 years ago

IRC #scikit-learn (02/08/15 ~ 7:00pm EST):

jeffreylevesque: i have data that i select from my sql database jeffreylevesque: i need to restructure this data, so that I can pass it into an SVM algorithm

yankov: jeffreylevesque: try converting it to numpy array: np.asarray(your_list)

jeffreylevesque: yankov: i'm trying to understand "As other classifiers, SVC, NuSVC and LinearSVC take as input two arrays: an array X of size [n_samples, n_features] holding the training samples, and an array y of class labels (strings or integers), size [n_samples]:" jeffreylevesque: from http://scikit-learn.org/stable/modules/svm.html

yankov: so you’d pass it a matrix for features, and a vector for labels

jeffreylevesque: is X the training samples. so, basically a list of list? each element, or sublist is a feature?

yankov: if it’s a list of list, then each element of the list is a training sample (another list). and each sample is another list consisting of features (columns). yankov: it’s really easier to think of it as of matrix (2d dimensional array) yankov: where each row is a training sample yankov: and each column is a feature

jeffreylevesque: each sublist in X is equivalent to an observation, or a vector describing an dependent variable. And, each element of the sublist describes independent variables ?

yankov: each sublist in X is an individual observation yankov: you pass dependent variables as a separate list yankov: X - your observations yankov: y - your labels

jeffreylevesque: say i have social security number as a label jeffreylevesque: but, in Y, ssn would be an integer value

yankov: yeah you would put it in y

yankov: Y represent the dependent variables, something you trying to learn to predict yankov: Y - is what you trying to predict, X - is feature you are using to predict Y

jeffreylevesque: ah, so i have: jeff, bob, jim

yankov: what are you trying to predict?

jeffreylevesque: in Y: [0, 1, 2], where jeff = 0, bob = 1, jim = 2 jeffreylevesque: is that a correct understanding of Y?

yankov: yes. if jeff, bob and jim is what you trying to predict yankov: in other words you are trying to learn to map from X to y

jeffreylevesque: then an example of X: [ [23, 5, 2, 56000], [43, 4, 2, 120000], [32, 0, 3, 77000] ] jeffreylevesque: where each observation: [ age, how many children, cars owned, salary ] jeffreylevesque: is that a viable example of X?

yankov: yep

jeffreylevesque: when, i extract data, select from the database, it looks like:

(('dep-variable-1', 'indep-variable-1', 23.45), ('dep-variable-1', 'indep-variable-2', 98.01),
 ('dep-variable-1', 'indep-variable-3', 0.432), ('dep-variable-1', 'indep-variable-4', 325.0),
 ('dep-variable-1', 'indep-variable-5', 54.64), ('dep-variable-1', 'indep-variable-6', 0.002),
 ('dep-variable-1', 'indep-variable-7', 25.0), ('dep-variable-2', 'indep-variable-1', 24.32),
 ('dep-variable-2', 'indep-variable-2', 92.22), ('dep-variable-2', 'indep-variable-3', 0.356),
 ('dep-variable-2', 'indep-variable-4', 235.0), ('dep-variable-2', 'indep-variable-5', 64.45),
 ('dep-variable-2', 'indep-variable-6', 0.001), ('dep-variable-2', 'indep-variable-7', 31.0),
 ('dep-variable-3', 'indep-variable-1', 22.67), ('dep-variable-3', 'indep-variable-2', 101.21),
 ('dep-variable-3', 'indep-variable-3', 0.832), ('dep-variable-3', 'indep-variable-4', 427.0),
 ('dep-variable-3', 'indep-variable-5', 75.45), ('dep-variable-3', 'indep-variable-6', 0.002),
 ('dep-variable-3', 'indep-variable-7', 24.0))

jeffreylevesque: so, i need to reformat the data to what we discussed

yankov: you can just pass that thing to np.asarray yankov: and you’ll get 2d array yankov: and then you can split out the dep var yankov: arr = np.asarray(that_list) yankov: labels = arr[:, 0] yankov: X = arr[:, 1:] yankov: so Y = arr[:, 0]

jeffreylevesque: gotcha

yankov: y is just a common name for target (dependent) variable

jeffreylevesque: will i need a simple python structure to assign the numerical representation with the actual textual representation of the dependent / independent variables? jeffreylevesque: i guess that's up to me

yankov: i believe in sklearn it will automatically encode labels for you, but you probably have to take care of categoricals in X yourself yankov: you can use LabelEncoder in sklearn

jeff1evesque commented 9 years ago

The following is an example of using LabelEncoder in sklearn:

>>> le = preprocessing.LabelEncoder()
>>> le.fit(["paris", "paris", "tokyo", "amsterdam"])
LabelEncoder()
>>> list(le.classes_)
['amsterdam', 'paris', 'tokyo']
>>> le.transform(["tokyo", "tokyo", "paris"]) 
array([2, 2, 1]...)
>>> list(le.inverse_transform([2, 2, 1]))
['tokyo', 'tokyo', 'paris']