rasbt / python-machine-learning-book

The "Python Machine Learning (1st edition)" book code repository and info resource
MIT License
12.18k stars 4.39k forks source link

Implementation of AdalineSGD #60

Closed idavollen closed 6 years ago

idavollen commented 6 years ago

Hi,

First of all, thanks for your nice book, Python Machine Learning

I began to read it right now and I am wondering one thing about the implementation of AdalineSGD mentioned in book

    def fit(self, X, y):

        self._initialize_weights(X.shape[1])
        self.cost_ = []
        for i in range(self.n_iter):
            if self.shuffle:
                X, y = self._shuffle(X, y)
            cost = []
            for xi, target in zip(X, y):
                cost.append(self._update_weights(xi, target))
            avg_cost = sum(cost) / len(y)
            self.cost_.append(avg_cost)
        return self

    def _update_weights(self, xi, target):
        """Apply Adaline learning rule to update the weights"""
        output = self.net_input(xi)
        error = (target - output)
        self.w_[1:] += self.eta * xi.dot(error)
        self.w_[0] += self.eta * error
        cost = 0.5 * error**2
        return cost        

I think, the way to update self.w_[1:] in AdalineSGD is in fact the same as implementation of the batch AdalineGD, just implemented in different ways

            output = self.activation(X)
            errors = (y - output)
            self.w_[1:] += self.eta * X.T.dot(errors)

IMO, self.eta * X.T.dot(errors) operates on entire matrix X in AdalineGD, however AdalineSGD operates on row by row via for-loop (for xi, target in zip(X, y)) of the same X. It doesn't reflect the essential diff between AdalineGD and AdalineSGD as you mentioned in book

rasbt commented 6 years ago

Thanks for the note. In the AdalineSGD version, the weights are updated after each training sample; however, as you mentioned, the weights of AdalineGD are updated after one training epoch (looking at the full training set). This looks correct to me :P. Or in other words, the SGD variant cost is computed based on a single training sample each time and updates the weights based on the (negative) cost gradient of it. However, the GD version computes the cost from the complete dataset; thus, the cost gradient would be different compared to the SGD variant.

It doesn't reflect the essential diff between AdalineGD and AdalineSGD as you mentioned in book

Would you mind highlighting which other points you are referring to? :)

EDIT:

I think I know what you are referring to

IMO, self.eta * X.T.dot(errors) operates on entire matrix X in AdalineGD, however AdalineSGD operates on row by row via for-loop (for xi, target in zip(X, y)) of the same X

It's not merely replacing the matrix multiplication by a for-loop but doing the update after each training sample. If you would use the for-loop but only update the weights after a complete pass over the training set, this would be indeed similar to self.w_[1:] += self.eta * X.T.dot(errors). Would that answer your question?

idavollen commented 6 years ago

Thanks for explanation! I referenced to the text at the bottom of page 42 of your book

Instead of updating the weights on the sum of the accumulated errors over all samples. (batch GD, the disadvantage will be obvious if it has millions of sample training data sets) We update the weights incrementally for each training samle

With the same millions of sample training data sets, how could the advantage with Stochastic Gradient Descent will be manifested?

With the implementation code for AdalineSGD on page 45, for xi, target in zip(X, y) will also go through all these millions sample data sets. Therefore I am confused and can't see advantage with SGD against GD.

I am also wondering if it is possible to choose a small random range i.e. 100 rows of those millions of sample data sets to update the weights instead of the for-loop of entire millions of sample data sets in AdalineSGD

rasbt commented 6 years ago

Instead of updating the weights on the sum of the accumulated errors over all samples. (batch GD, the disadvantage will be obvious if it has millions of sample training data sets) [...] We update the weights incrementally for each training sample

Basically there are two advantages. One is that we update after each training sample (i.e., more updates), and the other one would be that the matrix in X.T.dot(errors) may not fit into memory so that incremental updates are necessary (or picture a scenario where you update your classifier as new training examples arrive in certain time intervals). While stochastic gradient descent is more noisy, it can also help with generalization performance (which is especially true in non-convex optimization problems, e.g., in multi-layer neural nets).

With the implementation code for AdalineSGD on page 45, for xi, target in zip(X, y) will also go through all these millions sample data sets. Therefore I am confused and can't see advantage with SGD against GD.

You are right in the sense that it goes through the same training examples. However, the weights are updated after each training example instead of accumulating them:

        for xi, target in zip(X, y):
            cost.append(self._update_weights(xi, target))

If you are updating the weights after each training sample, your update would be different compared to accumulating them since you use the weight to make a prediction.

I am also wondering if it is possible to choose a small random range i.e. 100 rows of those millions of sample data sets to update the weights instead of the for-loop of entire millions of sample data sets in AdalineSGD

Oh yeah, you can certainly update the weights in larger steps -- that's called mini-batch learning. This is actually the standard approach in multilayer (deep) neural network training. (For instance, the MLP implementation in Ch12 updates the weights in steps of ~1000 training samples.)

idavollen commented 6 years ago

Thanks again for spending your time answering my questions!

I am not totally convinced yet by your statements of advantages of SGD. With both SGD and batch GD, the same fit-api signature is being used as following:

def fit(self, X, y):
        """ Fit training data.

        Parameters
        ----------
        X : {array-like}, shape = [n_samples, n_features]
            Training vectors, where n_samples is the number of samples and
            n_features is the number of features.
        y : array-like, shape = [n_samples]
            Target values.

        Returns
        -------
        self : object

        """

from the point view of caller of fit(), no matter if it is SGD og GD, the total sample training data sets has to be loaded in memory, even for picture scenario or millions of sample training data, before fit() is being called although the SGD will update the weights incrementally with each sample data

If you are updating the weights after each training sample, your update would be different compared to accumulating them since you use the weight to make a prediction.

I tried to make some small changes in your code implementation for AdalineGD, doing both incremental updating the weights and accumulating update respectively in the same place fit(), incrementalWeight actually has the same results as deltaw in console

` self.w = np.zeros(1 + X.shape[1]) self.cost = []

    for i in range(self.n_iter):
        net_input = self.net_input(X)
        # Please note that the "activation" method has no effect
        # in the code since it is simply an identity function. We
        # could write `output = self.net_input(X)` directly instead.
        # The purpose of the activation is more conceptual, i.e.,  
        # in the case of logistic regression, we could change it to
        # a sigmoid function to implement a logistic regression classifier.
        incrementalWeight = np.zeros(X.shape[1])
        for xi, target in zip(X, y) :
            output = self.net_input(xi)
            error = (target - output)
            incrementalWeight += self.eta * xi.dot(error)
        output = self.activation(X)
        errors = (y - output)
        deltaw = self.eta * X.T.dot(errors)
        print("sgd incrementalWeight = ", incrementalWeight, " and deltaw = ", deltaw)
        self.w_[1:] += deltaw
        self.w_[0] += self.eta * errors.sum()
        cost = (errors**2).sum() / 2.0
        self.cost_.append(cost)
    return self

` I have just finished the first three chapters of your book, I am probably making some stupid statement :)

rasbt commented 6 years ago

from the point view of caller of fit(), no matter if it is SGD og GD, the total sample training data sets has to be loaded in memory, even for picture scenario or millions of sample training data, before fit() is being called although the SGD will update the weights incrementally with each sample data

That's a good point. Here, the fit method is more meant to run SGD and compare it to GD. However, the AdalineSGD version should also have a partial_fit method that doesn't require to have the entire dataset in X. Basically, this would change the workflow to something like this

for i in chunks:
    X_chunk, y_chunk = # load some data from disk or so
    adaline_sgd.partial_fit(X_chunk, y_chunk)

Regarding your implementation. It is expected that accumulating the weight updates produces the same results as the GD version that was implemented. It's essentially the same procedure but not vectorized using matrix multiplications. Sorry, but I think the part you are missing is that SGD is not supposed to accumulate weight updates but update the weights after each sample. Imagine the following scenario with 1 weight w=2.0.

Now, say you make a prediction on sample 1, you get some value val_1 for the weight update. Then, you make a prediction for sample 2, where you get some val_2 for the next weight update. In GD, you would accumulate those weight updates and apply them once you finished the pass through the training dataset. In SGD, however, you have the following:

Hope that helps!

idavollen commented 6 years ago

I see, thank you!

I have another question concerning implementation of GD or SGD (sorry for disturbing you by so many questions) where constructor requires a _niter together eta, learning rate. _niter is very essential in terms of the learning result for a given eta, i.e a big learning rate eta=0.01, _niter = 1000 has worse learning result than that for _niter=10. In reality it's not so easy to come out a good value for _niter. I am just wondering if the learning algorithm could intelligently decide how many iterations it should carry on until a certain condition is met instead of iterating the given _niter times. I have a rough idea:

self.w_minus_ = self.w_ = self.w_ = np.zeros(1 + X.shape[1])
new_w_=self.eta * X.T.dot(errors)
if (delta between (self.w_minus_, self.w_) and delta between (self.w_, new_w) converges in the same direction {
   self.w_minus = self.w
   self.w_ = new_w
       continue iterating
} else {
  stop iterating learning algorithm
}

Question 2 on cost function Machine Learning course by Andrew Ng at 4:21 defines cost function that is a little different from cost function defined in your book where you didn't divide by m

I am wondering if such division of m has any effect on learning algorithm in some sense?

rasbt commented 6 years ago

I am just wondering if the learning algorithm could intelligently decide how many iterations it should carry on until a certain condition is met instead of iterating the given n_iter times. I have a rough idea:

Oh yeah, setting a n_iter is actually a pretty naive thing to do. In practice, it is common to set a tolerance threshold and stop if e.g., the cost doesn't change more than this threshold between iterations or epochs. Another common thing (at least in deep learning applications) is to look at the training and validation loss and stop when the validation performance becomes worse (although the cost goes further down) due to overfitting.

Question 2 on cost function Machine Learning course by Andrew Ng at 4:21 defines cost function that is a little different from cost function defined in your book where you didn't divide by m

Good point. Note that it shouldn't affect the overall results since 1/n_samples is just added as a scaling factor to both cost and the gradient. You can regard it as an extra computational step, however, I think the 1/n_samples (i.e., using the MSE instead of the SSE) as a cost function has the nice property that the results are more comparable between differently-sized datasets (or let's say the cost would be less extreme).

idavollen commented 6 years ago

1/n_samples has another effect on value of learning rate, i.e. if 1/n_samples were to be used in AdalineGD, eta=0.001 would not have caused overshoot

rasbt commented 6 years ago

Oh yeah of course, you can think of your learning rate being scaled by 1/n_samples since the update becomes

new_weight = old_weight - learning_rate 1/n_samples (z - y)*x

instead of

new_weight = old_weight - learning_rate (z - y)x

idavollen commented 6 years ago

thanks a lot for your nice help!