Grokking the problem representation

j2kun commented 7 years ago

So I was reading the paper, and got annoyed on page 2

screen shot 2017-03-25 at 3 15 01 pm

It reminded me how much effort it takes to understand the whole margin thing. Does this make sense to you, why they set it up this way?

Since the method is so analytical, maybe a good place to start would be to "grok" (for the reader) the representation and the margin.

Maybe a nice demo (which I started to chip away at) would be to have some linearly separable points, and show the normal vector to the hyperplane, along with the distance to the closest point. Then there would be buttons to scale the normal vector so that the the nearest point is "on the plane u=+/-1"

And, of course, the reader could drag the normal vector around and all of this stuff would update.

Let me know if you like the idea (we can skip it if not, and still use the code I wrote to make some diagrams; I won't mind!).

irpap commented 7 years ago

I had a look at your code, it's really cool :) It gives a random example each time. Having looked at this more, I think yes the demo is a great idea if we expect the reader to understand the formulas and what the symbols mean (or at least a better picture than the one in the paper)

I've read 3 pages and have a lot of questions! Some of them probably too basic, I've forgotten a lot of math and have to look everything up.

Do you understand the formula (1)? What is b? What is an "input vector"? Any vector? I had to look up the the equation of the plane defined by a normal vector (here http://mathworld.wolfram.com/Plane.html) and that looks similar enough so it made some sense, except the vector in that definition goes through a specific point.

Also I don't understand why "the nearest points lie on the planes u=+-1". What does that mean? That if you solve the equation w*x-b=1 you'll find another plane that goes through the nearest point on one side? How does this work mathematically? I understand that if u=0 it means the dot product is 0, thus the vectors are perpendicular, but what is the meaning of u=1 or -1? Also the nearest point could be anywhere, we have no information about it at this point, so why would this equation work?

About the margin equation, is ||w|| the length of the normal vector? Maybe that's something I'm missing I didn't realise w had a particular length.

That's not related to this specific issue but I thought I'd type it here:

Reading the first few pages I had some general questions I wanted to look up (I'm not asking them to you necessarily, just noting them)

what is QP optimisation?
as the paper is from 98, is this algorithm still relevant / in use or has it been replaced by newer methods? Is it different than the kernel functions / kernel trick or is it one of them? (I looked this up a bit and yes it's used in LIBSVM) (btw googling this I came across https://github.com/karpathy/svmjs might be relevant)

j2kun commented 7 years ago

Yeah, so formula 1 is the "typical" way to define what's called a hyperplane, which is an (n-1)-dimensional flat surface in n-dimensional space. In particular, let's say b=0 for now. If you are in n-dimensional space and you pick a unit vector w, you can say that w defines a (n-1)-dimensional plane by taking all vectors perpendicular to w (i.e., the dot product w*x = 0). In two dimensions, it would be a line, in 3, a 2-dimensional plane, etc.

For example, in 3 dimensions, if you have a vector w = (2,3,4), then the set of all vectors perpendicular to it are the solutions (x_1, x_2, x_3) to

2 * x_1 + 3 * x_2 + 4 * x_3 = 0

If you pick any two values for x_1 and x_2, then x_3 has to be chosen to make the equation hold, so you get those two "degrees of freedom" (dimensions) that define a plane. This formulation is kind of nice because it doesn't depend on the dimension of the underlying space. You just pick a single vector, and the hyperplane is whatever is perpendicular to it.

Now let's call H_0 the hyperplane for w*x=0. If instead of zero you put some other value b (where b is a scalar), the solution to the equation w*x = b gives you a hyperplane that is parallel to H_0, and passes through the point bw (note that if w is a unit vector, w*(bw) = b(w*w) = b).

So the b part of the equation is just saying how you decide to translate the hyperplane away from the origin. If b is positive, you'll be translating it in the direction that w is pointing, otherwise you'll be translating it in the direction -w. In this way, the data needed to specify a hyperplane is a vector (the normal vector w) and a number b, and you can get every hyperplane by picking the appropriate data. In the code I wrote so far, I'm putting the origin in the center of the canvas, and ignoring b.

Platt calls the vector x an "input" because (I'm pretty sure) he's thinking of the hyperplane as a hypothesis in the machine learning sense. That is, if you give me a hyperplane H defined by (w, b), I can think of it as a function h mapping an input x to w*x - b. This output number is positive if x is on the same side of H as w is pointing, and the output is negative otherwise (and zero if it's exactly on the hyperplane). A picture helps with this (as does trying the algebra for a simple example):

20170326_203801

The long and short of it is that if H is a hyperplane, and h(x) = w*x - b, then the classification label for which side of the hyperplane you're on is 1 if h(x) >= 0 else -1.

You can also compute the distance from a point x to the hyperplane, so long as your w is a unit vector (we're still assuming this, even though it's not true in the Platt paper). This is done by projecting x onto w via |x * w - b| / ||w||.

However! If you believe me, then when Platt writes,

the margin is defined by the distance of the hyperplane to the nearest of the positive and negative examples

it seems like he then goes on to say that the closest points in the dataset always give h(x) = +/-1. There are two hidden assumptions that make this work. The first is that you can always assume (if you're successfully solving SVM) that your solution hyperplane is exactly halfway in between two points (otherwise it's not maximizing the margin). The second is the hyperplane H defined by w only depends on the direction of w, so we get to pick the length of w freely. If we make w longer, then those parallel hyperplanes defined by w*x = 1 will move farther away from H. (maybe it would be cool to put this into our visualization?)

So if you want to ensure that the hyperplane defined by w*x - 1 = 0 always includes the closest data point on each side, you can just scale w longer or shorter until it hits them.

This link has some more info.

The obvious question is why bother rescaling things. The answer is that it will make the formulation of the optimization problem easier. Instead of figuring out how to say "the closest point to the separating plane has to be as far away as possible" (you don't know, a priori, which point is the closest, so how can you come up with a formula for it! This statement has a min and a max in it), you just say "maximize a quantity depending only on w." So these tricks are all just to make the math easier.

Was that helpful?

what is QP optimisation?

QP stands for quadratic programming. It's an optimization problem where the objective (the thing you're trying to max/minimize) is a quadratic polynomial, and the constraints are all linear equations or inequalities. You can see in (3) that we're minimizing the square norm of that vector, which is a quadratic polynomial. A simpler version is linear programming (LP), where the objective to maximize/minimize is linear. In general most quadratic programs can't be solved efficiently (unless P = NP). So the SVM formulation is quite special.

is this algorithm still relevant

I am pretty sure it's the only relevant algorithm! (But I don't know for sure) It does indeed allow for the kernel trick, which is in eqn 10 & 11

I hadn't seen Karpathy's implementation before. Nice find! (Albeit, the code is quite messy, we can do better :) )

j2kun commented 7 years ago

Also, chances are good I'm accidentally flipping my pluses and minuses from time to time.

irpap commented 7 years ago

That was super helpful! Very well explained, I almost understood it, I just need to reread it and try out a couple of examples. Thanks.

I am pretty sure it's the only relevant algorithm!

That makes it pretty motivating to learn!

j2kun / svm-sequential-minimal-optimization

Grokking the problem representation #1