rasbt / python-machine-learning-book

The "Python Machine Learning (1st edition)" book code repository and info resource
MIT License
12.18k stars 4.39k forks source link

Chapter 2: confusion b/w perceptron code and SGD code #14

Closed VedAustin closed 8 years ago

VedAustin commented 8 years ago

In the perceptron part of the code, I see:

for xi, target in zip(X, y):
  update = self.eta * (target - self.predict(xi))
  self.w_[1:] += update * xi
  self.w_[0] += update

In the SGD part I see something similar except that everytime the new gradient point is calculated, the data is shuffled:

X, y = self._shuffle(X, y)
for xi, target in zip(X, y):
  cost.append(self._update_weights(xi, target))
def _update_weights(self, xi, target):
  """Apply Adaline learning rule to update the weights"""
  output = self.net_input(xi)
  error = (target - output)
  self.w_[1:] += self.eta * xi.dot(error)
  self.w_[0] += self.eta * error

I do not see any difference between the two except for the shuffling part and the part that one is binary value and the other is a real value (SGD). Did I misunderstand how fundamentally the weights are calculated for SGD and simple perceptron model. Ofcourse if there was a mini batch implementation, the code would have looked a lot more like adaptive linear neurons. But since you are taking sample by sample, they are implemented similarly?

rasbt commented 8 years ago

@VedAustin You are absolutely correct, the differences between the "classic" perceptron algorithm and the Adaline-SGD is a) the shuffling part and b) the fact that in the perceptron, you take the diff between class labels (1 and -1), whereas in Adaline, you take the difference between the class label and a continuous output. (As a side note, I'd also add shuffling to the perceptron to prevent cycles). In any case, the idea seems very similar, but taking the difference between the class label (desired output) and the Adaline activation (actual output) makes all the difference. In the latter case, you have "more information," i.e., the continuous output tells you by "how much" your prediction is off. Whereas in the perceptron, you only "ask" whether it is correct or not. That's one of the downsides of the perceptron, it doesn't update any further when everything was classified correctly. In practice, to achieve better generalization performance, you'd maybe want to "center" the decision boundary between two or more classes (e.g., the logistic regression or SVM algorithms in chapter 3 help with that).

Hope that answers your question :)

VedAustin commented 8 years ago

@rasbt Thank you so much for the quick reply and making it very clear. You have a gift in explaining things well!

rasbt commented 8 years ago

Thanks, I am glad to hear that it was helpful!