Closed initial-h closed 7 months ago
That's a great question! It took me some while to wrap my head around it too. Basically, think of a neuron as an abstract thing in between two pytorch linear layers. In the diagram below, your linear layers can be seen as defining the edges.
It's very un-intuitive considering how a network is actually implemented :)
Does this mean, for Layer 1 in this picture the incoming weights are w1 and outgoing weights are w2? If so, will w2 also be considered as the incoming weights for Layer N? Does this mean w1 will be reinitialize because they are the incoming weights of Layer1 and w2 will also be reinitialize because they are the incoming weights of LayerN?
Does this mean, for Layer 1 in this picture the incoming weights are w1 and outgoing weights are w2?
Yes.
If so, will w2 also be considered as the incoming weights for Layer N?
Yes, $W_2$ will be the ingoing weights for layer 2 and so on.
Does this mean w1 will be reinitialize because they are the incoming weights of Layer1 and w2 will also be reinitialize because they are the incoming weights of LayerN?
As I understood and implemented it: Yes. As the outgoing weights of layer N will be set to 0 in any case, it should not be an issue.
Will it affect the learning if the weights are set as 0 mannually? Since the weights are updated by BP, will the weights be updated if they are masked or the gradients are changed due to setting the value as 0? This confuses me a lot ...
I get the idea is to reinitialize the weights to make them can be updated and activated again, and also guarantee the output will not be changed due to reinitialization. But I don't get the full picture how to achieve it.
Thanks again.
Will it affect the learning if the weights are set as 0 mannually?
It will affect learning, but not necessarily in a negative way.
Since the weights are updated by BP, will the weights be updated if they are masked or the gradients are changed due to setting the value as 0?
Yes, they will still be updated because not all the weights in a layer are set to 0. There will still be some gradient flow through the layer, even after a reset. With each gradient step after the reset, more and more neurons will move away from 0 again.
At least, that's my understanding from debugging the code. I hope it helps. These are all really good questions, feel free to ask more!
Thanks for the reply! I have one more question. I'm curious if the incoming weights will be updated. For example in the picture below, assume the neuron at the bottom of the third layer is dormant, i.e. the value is zero. So the 3 incoming weights are reinitialized and the outgoing weights are masked/set as 0 (if I understand correctly), I'm curious if there are gradients (non-zero gradients) backpropagated to the 3 incoming weights since the outgoing weights are 0?
Hi @initial-h , I'm slow to reply atm due to the ICML rebuttals and some personal issues. If you have answered your question in the meantime, feel free to close the issue. Otherwise, I'll come back to it in a couple of weeks.
Sure, take your time and no hurry about it.
Let's look at it in terms of the last layer and the loss because that's what you highlighted.
The DQN loss is basically
Calculating the gradient w.r.t. $\mathbf w$ yields
So even if some of the weights are 0, the gradient of the last layer shouldn't be 0. Does that seem right?
Yeah, that makes sense. Thank you!
In the paper, it says reinitialize their incoming weights and zero out the outgoing weights. I'm confused since in my mind each layer of the network is just a matrix. I'm wondering what are the incoming weights and outgoing weights. Could you give me some hints? Thanks a lot!