Closed tmsimont closed 8 years ago
Hi, Trevor,
sorry that I went over that so briefly. The self.w_[0]
is basically the bias unit. I simply included the bias unit in the weight vector, which makes the math part easier, but on the other hand, it may make the code more confusing as you mentioned.
Let's say we have a 3x2 dimensional dataset X
(3 training samples with 2 features). Also, let's just assume we have a weight 2
for feature 1 and a weight 3
for feature 2, and we set the bias unit to 4
.
import numpy as np
>>> bias = 4.
>>> X = np.array([[2., 3.],
... [4., 5.],
... [6., 7.]])
>>> w = np.array([bias, 2., 3.])
In order to match the mathematical notation, we would have to add a vector of 1s to compute the dot-product:
>>> ones = np.ones((X.shape[0], 1))
>>> X_with1 = np.hstack((ones, X))
>>> X_with1
>>> np.dot(X_with1, w)
array([ 17., 27., 37.])
However, I thought that adding a vector of 1s to the training array each time we want to make a prediction would be fairly inefficient. So, instead, we can just "add" the bias unit (w[0]
) to the do product (it's equivalent, since 1.0 * w_0 = w_0
:
>>> np.dot(X, w[1:]) + w[0]
array([ 17., 27., 37.])
Maybe it is helpful to walk through the matrix-vector multiplication by hand. E.g.,
| 1 2 3 | | 4 | | 1*4 + 2*2 + 3*3 | | 17 |
| 1 4 5 | x | 2 | = | 1*4 + 4*2 + 5*3 | = | 27 |
| 1 6 7 | | 3 | | 1*4 + 6*2 + 7*3 | | 37 |
which is the same as
| 2 3 | | 4 | | 2*2 + 3*3 | | 13 + bias | | 17 |
| 4 5 | x | 2 | + bias = | 4*2 + 5*3 | + bias = | 23 + bias | = | 27 |
| 6 7 | | 3 | | 6*2 + 7*3 | | 33 + bias | | 37 |
Hope that helps!
Let me add the explanation as an additional note to the notebook and close this issue, but please feel free to add a comment.
PS: Now that you mention it, I wrote the softmax classifier (http://rasbt.github.io/mlxtend/user_guide/classifier/SoftmaxRegression/) with explicit "bias" if it helps; there was definitely some trade-off I had to make in the book due to the publisher's page limitations.
Thanks for the quick response! It's amazing that the author of a popular book can respond to my question within an hour of my asking... What a time to be alive :)
I figured it had to be the bias. I've worked with ANN's in the past, but was surprised to see it in the code in Chapter 2 as the text had not yet discussed biases.
Glad to hear that it was helpful! Hm, I just looked it up now, and you are right: I didn't use the term "bias" explicitly but used the term "threshold" instead (I think "threshold" may be more intuitive than "bias" for someone who hasn't heard of these concepts, yet? -- on the other hand, I think "bias" may be more commonly used in literature ...). In any case, I think in the ideal case, I should have mentioned both :P (however, there was this annoying 20-page limit for that chapter :( ).
Hi, sorry for commenting on an old issue, but I'm having trouble understanding the part about "moving the θ to the left side of the equation, and defining weight zero as -θ".
Since the w0 gets adjusted with the error in each iteration, does that mean, if we were to keep the the θ where it is, it would also get adjusted with each iteration, so that θ = θ - Δw0
?
Found the relevant thread in the mailing list: https://groups.google.com/forum/#!topic/python-machine-learning-reader-discussion-board/Yw4_PMc2RY4
No worries, @hdra
a)
If you don't bring the threshold to the left side, you have the following decision rule:
if net_input_val >= threshold -> classify as 1
else -> classify as -1
If you trained the classifier via this scheme, you would do the classification as
w1x1 + w2x2... > w0 -> classify as 1
b)
Now, if you bring the threshold to the left side:
if net_input_val-threshold >= 0 -> classify as 1
else -> classify as -1
and if your classifier learned it that way, you would do:
if w0 + w1x1 + w2x2... >= 0 -> classify as 1
(note that in this case, the classifier would have learned a negated w0, i.e. (-1)*w0 if you would look at the weight vectors and compare between a) and b).)
In practice, it wouldn't make any difference and I'd say that it's mostly a convention.
Hope that helps!
So, with the a) approach, the threshold would get adjusted on each iteration?
With both approaches, you would want to adjust the threshold / w0. It's just a matter of convenience/convention whether it's put on the left of right side of the equation.
Got it. Thanks for the clarification.
Glad to hear that it makes more sense now. I agree that it's a bit confusing, and I thought I should mention the "threshold" to get a feeling where the "w_0" parameter is coming from :).
That, I actually get. The explanation on page 20 (of the PDF book) is pretty clear 👍.
My confusion was because for some reason I wasn't able to connect the adjustment of w_0
to adjusting the threshold on each iteration. I thought the threshold/w_0
is a constant, and adjustment of the w_0
was the result of the weights adjustment in general, hence the confusion.
Btw. another could context to think about would be linear regression. If you have no bias unit or threshold, the regression fit (line) will always pass through the coordinate origin (at x=0 & y=0; think of simple linear regression with x & y axis). So, if your data is not centered at 0, you probably won't fit the data well. Here, the threshold basically determines where the linear regression line cuts the y-axis at x=0.
In chapter 2 you have some code for a simple perceptron model.
On page 27, you describe the code.
However, there is more than a simple vector product in the code:
In addition to the dot product, there is an addition. The text does not mention anything about what is this
+ self.w_[0]
Can you (or anyone) explain why that's there?
thanks, -trevor