Chap8: the way of mini-batch gradient descent updating weights

iamtrask / Grokking-Deep-Learning

this repository accompanies the book "Grokking Deep Learning"

7.46k stars 1.58k forks source link

Chap8: the way of mini-batch gradient descent updating weights #31

Open shenxiangzhuang opened 4 years ago

shenxiangzhuang commented 4 years ago

In chap8, the code of Batch gradient descent is confusing.

for j in range(iterations):
    error, correct_cnt = (0.0, 0)
    for i in range(int(len(images) / batch_size)):
        batch_start, batch_end = ((i * batch_size),((i+1)*batch_size))
        #...
        for k in range(batch_size):
            # ...
            weights_1_2 += alpha * layer_1.T.dot(layer_2_delta)
            weights_0_1 += alpha * layer_0.T.dot(layer_1_delta)

In short, I think the code should update the weights only x times where x equals the number of batch in each iteration rather that n times where n equals the number of training samples.

mikulatomas commented 4 years ago

Agree, the inner for loop should be only for calculating accuracy, not for updating weights. This way is definitely not faster.

This is my version: https://github.com/mikulatomas/grokking-deep-learning/blob/master/mnist/mnist_batch_dropout_multi_layer_network.ipynb

Baltazar-Ortega commented 4 years ago

Yes, the for loop is only for calculating accuracy. Check out chapter 9 last code example. I think its implemented well there.

DawnEve commented 4 years ago

Agree. When move those 5 lines out of inner loop, it runs faster than the previous version.

Before batch: alpha=0.005 I:349 Train-Error:0.1502 Train-Correct:0.982 Test-error:0.296 Test-Acc:0.8721 Time: 209.26

after batch: alpha=0.1 I:349 Train-Error:0.2124 Train-Correct:0.953 Test-error:0.285 Test-Acc:0.8777 Time: 46.89 alpha=0.5 I:349 Train-Error:0.1837 Train-Correct:0.962 Test-error:0.301 Test-Acc:0.8675 Time: 46.24

jorgekoronis commented 4 years ago

Hello

With reference to this code snippet, why do they divide by batch size on the following line? layer_2_delta = (labels[batch_start:batch_end]-layer_2) /batch_size I do not really see the need for that division. Might anyone explain why this division takes place at this point?

for k in range(batch_size):
correct_cnt += int(np.argmax(layer_2[k:k+1]) ==
np.argmax(labels[batch_start+k:batch_start+k+1]))
layer_2_delta = (labels[batch_start:batch_end]-layer_2) /batch_size
layer_1_delta = layer_2_delta.dot(weights_1_2.T)* relu2deriv(layer_1)
layer_1_delta *= dropout_mask
weights_1_2 += alpha * layer_1.T.dot(layer_2_delta)
weights_0_1 += alpha * layer_0.T.dot(layer_1_delta)

ValeriiKim commented 3 years ago

Hello

With reference to this code snippet, why do they divide by batch size on the following line? layer_2_delta = (labels[batch_start:batch_end]-layer_2) /batch_size I do not really see the need for that division. Might anyone explain why this division takes place at this point?
for k in range(batch_size):
correct_cnt += int(np.argmax(layer_2[k:k+1]) ==
np.argmax(labels[batch_start+k:batch_start+k+1]))
layer_2_delta = (labels[batch_start:batch_end]-layer_2) /batch_size
layer_1_delta = layer_2_delta.dot(weights_1_2.T)* relu2deriv(layer_1)
layer_1_delta *= dropout_mask
weights_1_2 += alpha * layer_1.T.dot(layer_2_delta)
weights_0_1 += alpha * layer_0.T.dot(layer_1_delta)
Hello, maybe they divide by batch size because the lines which compute deltas and weights updates are within inner loop? As was mentioned above these 5 lines must be outside of inner loop.

AshishPandagre commented 3 years ago

Yes, the for loop is only for calculating accuracy. Check out chapter 9 last code example. I think its implemented well there.

Thanks a lot.