q-viper / q-viper.github.io

A repo for blogging about ML and stuffs.
1 stars 0 forks source link

2020/06/05/convolutional-neural-networks-from-scratch-on-python/ #27

Open utterances-bot opened 2 years ago

utterances-bot commented 2 years ago

Convolutional Neural Networks From Scratch on Python - Quassarian Viper

Contents

https://q-viper.github.io/2020/06/05/convolutional-neural-networks-from-scratch-on-python/

ashesh2512 commented 2 years ago

Hello in the last line of the backpropagation algorithm section 3.1.2.6, shouldn't the last line be - layer.delta = layer.delta * layer.activation_dfn(layer.delta). I think it is missing a multiplication sign.

q-viper commented 2 years ago

Hello @ashesh2512, I have done it long ago and I have to give much time for best answers. I think you are partially right. My sole goal was to make it work just like I did for FFNN. But unlike FFNN, it is quite hard for us to find the exact contribution of input pixel unit to the output pixel units. Now I think it should be something like:

self.error = np.dot(nx_layer.weights, nx_layer.delta)
self.delta = self.error * self.activation_dfn(self.out)

I request you to try your idea and observe the result. And please let me know outcomes. I will reply here if I got the better explanation. Thank you.

ashesh2512 commented 2 years ago

Hi, I am trying to understand how the derivative for the activation function in the dropout layer evaluates to x ? Shouldn't it be 1 for the values that were not dropped, and 0 for others ? Additionally if the dropout layer type is conv2D, why is self.delta = nx_layer.delta ?

q-viper commented 2 years ago

Hello @ashesh2512,

  1. I have returned x as an derivative of activation in the sense that it do not actually have activation function but it have a mask that is responsible to forget certain inputs and error terms.
  2. The answer to your second question is similar to first one, if the dropout layer is before the Conv2D layer, it will receive a error values that is with equal size and values of that conv2D layer. Now for finding the error in this dropout layer, I have used mask instead of whole chain equation. I think this is not the right way to do but it works. I recommend you to follow this paper.
ashesh2512 commented 2 years ago

@q-viper The paper link doesn't seem to open.

q-viper commented 2 years ago

@ashesh2512 check again please.

ashesh2512 commented 2 years ago

@q-viper The derivative of the activation function should be taken w.r.t. x, and hence should be 0 for masked values, and 1 for unmasked values. Or am I missing something ?

q-viper commented 2 years ago

@ashesh2512 Yes you are right about the derivative. It seems I have made a logical error. As part of the back-propagation, I have made all the delta term to 0 for which output was also 0. I request you to experiment little bit in this part. But this does work neatly.

Aythami123 commented 1 year ago

Hello :)

Thanks for your code, which is really interesting and insightful. I think programming from scratch some of the classes and methods that one typically encounters on Tensorflow is a good thing to do (at least once) for anyone who actively works with neural nets. I wanted to point out a couple of things, but I will be doing it as I read more and more about the code.

Here goes something significant: in your FFL class at the beginning, there are things missing in the activation methods. In particular, in the method activation_dfn, you are missing the linear activation, and what’s more important, you are calculating the derivative of tanh wrong. It should be:

if self.activation == 'tanh':
    r = self.activation_fn(r)
    return 1 - r ** 2

Your are missing the second line! Also:

if self.activation == 'relu':
    r[r<0] = 0
    r[r>=0] = 1

You are missing the last line! Having these mistakes will completely ruin backprop.

Also, 3.1.2.4 seems to be full of errors. I am not sure whether you have included some old version by mistake (?). You are calculating the derivatives of tanh and sigmoid wrong (you are forgetting to write the formula in terms of self.activation_fn as before). Then you also forget to include relu. I have checked carefully many of your methods, and this is the only major thing I have found for now, but these are pretty important mistakes. With so many mistakes in the derivatives, backpropagation is going to be a mess.

Sorry for not leaving spacing after the ifs, but I don't really know how to do it. I rarely post here.

Aythami B.

q-viper commented 1 year ago

Hello @Aythami, thank you for finding this blog. I believe that this is not a perfect 'scratch from'.

For the first part of tanh, it does not have to calculate value from activation because the values passed inside activation_dfn are value from activation_fn itself.

  1. Inside apply_activation, output of this layer is calculated and put into self.out.

    def apply_activation(self, x):
        soma = np.dot(x, self.weights) + self.biases
        self.out = self.activation_fn(soma)        
        return self.out
  2. While doing backpropagate, output is passed and full filled the purpose of using activation_function.

    def backpropagate(self, loss, update = True):  
        for i in reversed(range(len(self.layers))):
            layer = self.layers[i]
            if layer == self.layers[-1]:
                layer.error = loss
                layer.delta = layer.error * layer.activation_dfn(layer.out) 
                layer.delta_weights += layer.delta * np.atleast_2d(layer.input).T
                layer.delta_biases += layer.delta
            else:
                nx_layer = self.layers[i+1]
                layer.backpropagate(nx_layer)
            if update:
                layer.delta_weights /= self.batch_size
                layer.delta_biases /= self.batch_size
        if update:      
            self.optimizer(self.layers)
            self.zerograd()

For the derivative of relu, you are correct.

Thank you.

Aythami123 commented 1 year ago

@q-viper. Thanks a lot for your answer!

Regarding 1., what I found weird is that in the activation_dfn method of FFL, you passed r activated for sigmoid and softmax, but not for tanh. I think now you commented out the activation of r in sigmoid. For softmax, I could understand that you do it differently since it would only appear as activation for the output layer and you can treat it separately. But in theory, sigmoid could be chosen as activation for intermediate layers (even if it is typically not very good), so if you are not activating the r in tanh, then you should not activate it in sigmoid either. Now it is commented out in your code, so it is fine, but I think yesterday it was there. Also, look at your formulas for the derivatives in 3.1.2.4. (iii) and (iv) are wrong. What does (iii) even have to do with the derivative of tanh?

Best regards,

Aythami B.

Aythami123 commented 1 year ago

Another thing I suggest you to change: in the apply_activation method inside the Dropout class, you generate the elements that you are going to drop out with the command random_indices = np.random.randint(0, len(flat), int(self.prob len(flat))). Note that np.random.randint(low,high,size) generates n=size integers between low and high-1 with repetition. There will typically be repeated integers. However, you do not want the number of elements you drop out to depend on the seed, you want it to be fixed and to be exactly int(self.prob len(flat))). Hence, make sure you generate the integers without repetition.

Best regards,

Aythami B.

q-viper commented 1 year ago

Hello @Aythami123, I updated codes today after reading your comments. Thank you for pointing the errors I made. I updated the possible fix. You are welcome to contribute if you find possible mistakes and want to fix yourself. :)

Aythami123 commented 1 year ago

Hello again :).

I have been working a lot on the code and adapting certain things to my taste. It has been super instructive, so thanks again for this. I have several remarks to make, especially regarding backprop, but also other little things. I will probably write an email to you at some point since it can also be useful to you. I wanted to ask: did you use the code on https://github.com/ShivamShrirao/dnn_from_scratch as a reference or did you code everything yourself deriving everything by hand and using your knowledge?

Anyway, I had one important question: why are you not initialising optimiser variables e.g. l.weights_momentum, l.weights_adam, etc anywhere in your main code? I am guessing at the beginning you need to set them to zero in all the trainable layers (at least once you specify your optimiser)? Then update the optimisers by using the class Optimizer after backprop on a batch and then reset them to zero again before you start to do backprop on the next batch? Could you also comment on the variables l.t, l.weights_adam1, l.weights_adam2 in Adam? I am guessing again they have to be initialised to zero. The optimisers part is the only one I haven't really read properly, so I will do some research on the optimisers. But you are never defining those variables so something is missing.

q-viper commented 1 year ago

Hello @Aythami123, I started by writing back propagation all by myself but I used the code from the repo you mentioned only for optimization techniques.

About initialization of those variables, please refer to Optimizers class. On the last line of following snippet, we call optimizer but pass training as False which initializes those attributes. For resetting them back, there is zero_grad method.

def __init__(self, layers, name=None, learning_rate = 0.01, mr=0.001):
        self.name = name
        self.learning_rate = learning_rate
        self.mr = mr
        keys = ["sgd", "iterative", "momentum", "rmsprop", "adagrad", "adam", "adamax", "adadelta"]
        values = [self.sgd, self.iterative, self.momentum, self.rmsprop, self.adagrad, self.adam, self.adamax, self.adadelta]
        self.opt_dict = {keys[i]:values[i] for i in range(len(keys))}
        if name != None and name in keys:
            self.opt_dict[name](layers=layers, training=False)
Aythami123 commented 1 year ago

Thanks a lot for your explanation @q-viper. Now I get it and it makes sense.