jiecaoyu / XNOR-Net-PyTorch

PyTorch Implementation of XNOR-Net
476 stars 119 forks source link

Question about 1e+9 #37

Closed GrimReaperSam closed 6 years ago

GrimReaperSam commented 6 years ago

Hi,

I found in some networks, the gradients are multiplied by 1e+9 and some not. Do you have any idea why they do that? I know the reason must be tiny gradients on the earlier layers. However, I don't get how they came to the 1e+9 value, and as it is now they are multiplying all the layers by this value. Do you have any intuition or idea on what's actually happening and how do we get this number?

Best, Fayez

jiecaoyu commented 6 years ago

@GrimReaperSam 1e+9 is used for two reasons:

  1. Adam will suffer from accuracy problem. (I am not sure whether this will affect XNOR-Net. From my experience, I don't see any problem with Adam accuracy but it may need more experiments.)

  2. It is used to mitigate the effect of weight decay. It is actually dividing weight decay by 1e+9. (I think this is the main reason why 1e+9 is necessary). As I remember, numbers around 1e+3 or 1e+4 can also achieve the same effect and 1e+9 is just an extreme.

GrimReaperSam commented 6 years ago

@jiecaoyu Does this change from network to network? What if I don't use weight decay? I tried without it and it seems I get almost the same accuracy, the drop is there but I'm not sure how to figure out what's the best value of 1e+9. I'm trying to do some changes to the binarization scheme, this is why it's very important to understand what's actually happening there.

Also, I don't think Adam should suffer here, because Adam will learn when weights aren't changing and adapt accordingly, so giving him this boost doesn't make much sense to me. Am i wrong here?

jiecaoyu commented 6 years ago

@GrimReaperSam We are using 32-bit "floating-point" numbers for computation. So, for Adam, when the changing of the weights are too small, the value of the average weight changing cannot be accurately stored as floating-point numbers. That's what I mean for the "accuracy problem". Sorry for confusing.

I think not using weight decay should achieve similar accuracy. Probably you can try setting weight decay for binarized layers to be 0.

GrimReaperSam commented 6 years ago

Yep, I understand what you mean now! However, the gradients should be fine since there is a batchnorm layer everywhere, and the backprop is estimating the sign function by a HardTanh so there won't be neither exploding nor vanishing gradients. As of now, on cifar100, I drop only 1% in difference after 150 epochs (with vs without 1e+9), and that could be simply a random thing. I'm using a VGG16 architecture without binarizing the first and last layers. It's very surprising.

Yeah, I'm training without weight decay, currently stripped the network to a simple form.

I think this factor as it stands might be hiding some problem that occurred at some point but now is not needed ...