rasbt / python-machine-learning-book

The "Python Machine Learning (1st edition)" book code repository and info resource
MIT License
12.18k stars 4.39k forks source link

Confusion in chapter 2 #6

Closed tmsimont closed 8 years ago

tmsimont commented 8 years ago

In chapter 2 you have some code for a simple perceptron model.

On page 27, you describe the code.

the net_input method simply calculates the vector product wTx

However, there is more than a simple vector product in the code:

def net_input(self, X):
    """Calculate net input"""
    return np.dot(X, self.w_[1:]) + self.w_[0]

In addition to the dot product, there is an addition. The text does not mention anything about what is this + self.w_[0]

Can you (or anyone) explain why that's there?

thanks, -trevor

rasbt commented 8 years ago

Hi, Trevor, sorry that I went over that so briefly. The self.w_[0] is basically the bias unit. I simply included the bias unit in the weight vector, which makes the math part easier, but on the other hand, it may make the code more confusing as you mentioned.

Let's say we have a 3x2 dimensional dataset X (3 training samples with 2 features). Also, let's just assume we have a weight 2 for feature 1 and a weight 3 for feature 2, and we set the bias unit to 4.

import numpy as np
>>> bias = 4.
>>> X = np.array([[2., 3.], 
...              [4., 5.], 
...              [6., 7.]])
>>> w = np.array([bias, 2., 3.])

In order to match the mathematical notation, we would have to add a vector of 1s to compute the dot-product:

>>> ones = np.ones((X.shape[0], 1))
>>> X_with1 = np.hstack((ones, X))
>>> X_with1
>>> np.dot(X_with1, w)
array([ 17.,  27.,  37.])

However, I thought that adding a vector of 1s to the training array each time we want to make a prediction would be fairly inefficient. So, instead, we can just "add" the bias unit (w[0]) to the do product (it's equivalent, since 1.0 * w_0 = w_0:

>>> np.dot(X, w[1:]) + w[0] 
array([ 17.,  27.,  37.])

Maybe it is helpful to walk through the matrix-vector multiplication by hand. E.g.,

| 1  2  3 |   | 4 |   | 1*4 + 2*2 + 3*3 |   | 17 |
| 1  4  5 | x | 2 | = | 1*4 + 4*2 + 5*3 | = | 27 |
| 1  6  7 |   | 3 |   | 1*4 + 6*2 + 7*3 |   | 37 |

which is the same as

| 2  3 |   | 4 |          | 2*2 + 3*3 |          | 13 + bias |   | 17 |
| 4  5 | x | 2 | + bias = | 4*2 + 5*3 | + bias = | 23 + bias | = | 27 |
| 6  7 |   | 3 |          | 6*2 + 7*3 |          | 33 + bias |   | 37 |

Hope that helps!

Let me add the explanation as an additional note to the notebook and close this issue, but please feel free to add a comment.

PS: Now that you mention it, I wrote the softmax classifier (http://rasbt.github.io/mlxtend/user_guide/classifier/SoftmaxRegression/) with explicit "bias" if it helps; there was definitely some trade-off I had to make in the book due to the publisher's page limitations.

tmsimont commented 8 years ago

Thanks for the quick response! It's amazing that the author of a popular book can respond to my question within an hour of my asking... What a time to be alive :)

I figured it had to be the bias. I've worked with ANN's in the past, but was surprised to see it in the code in Chapter 2 as the text had not yet discussed biases.

rasbt commented 8 years ago

Glad to hear that it was helpful! Hm, I just looked it up now, and you are right: I didn't use the term "bias" explicitly but used the term "threshold" instead (I think "threshold" may be more intuitive than "bias" for someone who hasn't heard of these concepts, yet? -- on the other hand, I think "bias" may be more commonly used in literature ...). In any case, I think in the ideal case, I should have mentioned both :P (however, there was this annoying 20-page limit for that chapter :( ).

screen shot 2016-04-01 at 2 45 37 pm

hdra commented 7 years ago

Hi, sorry for commenting on an old issue, but I'm having trouble understanding the part about "moving the θ to the left side of the equation, and defining weight zero as -θ".

Since the w0 gets adjusted with the error in each iteration, does that mean, if we were to keep the the θ where it is, it would also get adjusted with each iteration, so that θ = θ - Δw0?

hdra commented 7 years ago

Found the relevant thread in the mailing list: https://groups.google.com/forum/#!topic/python-machine-learning-reader-discussion-board/Yw4_PMc2RY4

rasbt commented 7 years ago

No worries, @hdra

a)

If you don't bring the threshold to the left side, you have the following decision rule:

if net_input_val >= threshold -> classify as 1
else -> classify as -1

If you trained the classifier via this scheme, you would do the classification as

w1x1 + w2x2... > w0 -> classify as 1

b)

Now, if you bring the threshold to the left side:

if net_input_val-threshold >= 0 -> classify as 1
else -> classify as -1

and if your classifier learned it that way, you would do:

if w0 + w1x1 + w2x2... >= 0 -> classify as 1

(note that in this case, the classifier would have learned a negated w0, i.e. (-1)*w0 if you would look at the weight vectors and compare between a) and b).)

In practice, it wouldn't make any difference and I'd say that it's mostly a convention.

Hope that helps!

hdra commented 7 years ago

So, with the a) approach, the threshold would get adjusted on each iteration?

rasbt commented 7 years ago

With both approaches, you would want to adjust the threshold / w0. It's just a matter of convenience/convention whether it's put on the left of right side of the equation.

hdra commented 7 years ago

Got it. Thanks for the clarification.

rasbt commented 7 years ago

Glad to hear that it makes more sense now. I agree that it's a bit confusing, and I thought I should mention the "threshold" to get a feeling where the "w_0" parameter is coming from :).

hdra commented 7 years ago

That, I actually get. The explanation on page 20 (of the PDF book) is pretty clear 👍.

My confusion was because for some reason I wasn't able to connect the adjustment of w_0 to adjusting the threshold on each iteration. I thought the threshold/w_0 is a constant, and adjustment of the w_0 was the result of the weights adjustment in general, hence the confusion.

rasbt commented 7 years ago

Btw. another could context to think about would be linear regression. If you have no bias unit or threshold, the regression fit (line) will always pass through the coordinate origin (at x=0 & y=0; think of simple linear regression with x & y axis). So, if your data is not centered at 0, you probably won't fit the data well. Here, the threshold basically determines where the linear regression line cuts the y-axis at x=0.