[L2 Regularization] Queries

swghosh commented 4 years ago

Hi,

The implementation shared on this repo uses (lasagne's) l2_regularization with a factor of 0.0005 for all the experiments. I assume that it is inspired by work done as a part of original WRN paper by Zagoruyko et al. In your SGDR paper (https://arxiv.org/abs/1608.03983), it is also mentioned that the value of weight decay should be 0.0005.

https://github.com/loshchil/SGDR/blob/5269a615448b93d6ab5926b4402eaaf1dafca230/SGDR_WRNs.py#L251-L252

As per your followup paper on Decoupling Weight Decay Regularization (https://arxiv.org/abs/1711.05101), I found that for SGD the value of l2 regularization should be rescaled by learning rate.

Can you please clarify something (though, it'd be directly in terms of TensorFlow based implementation):

tfa.optimizers.SGDW(0.0005, lr) with SGDR (lr = tf.keras.experimental.CosineDecayRestarts(...))
tf.keras.layers.Dense(kernel_initializer=tf.keras.regularizers.L2(0.0005)) and or tf.keras.layers.Conv2D(kernel_initializer=tf.keras.regularizers.L2(0.0005)) Whether the above two are equivalent or not?

Let me know if I can help you with any additional details on the internal TF implementations.

Thanks in advance.

loshchil commented 4 years ago

Hi,

The SGDR code you mentioned appeared before the decoupling weight decay paper and its SGDW, AdamW algorithms. The latter paper includes SGDWR and AdamWR.
Sorry, I am not familiar with TensorFlow. My best guess would be that the two mentioned functions will not be equivalent simple because the default L2 function has no idea (assuming they use some default implementation) that we would like to decouple weight decay and gradient.

swghosh commented 4 years ago

Thanks for the quick reply.

Cutting out that similarity part! (my bad)

My actual question is in light of the context: If I'm supposedly trying to exactly replicate your SGDR paper only (for CIFAR-10 using WRN-28-10 and t_0=10, t_mul=2, setting I personally chose), should adding only l2 regularization to all layers in the model suffice the weight decaying?

As per the lasagne implementation on this repo (and after reading lasagne documentation), my understanding is that:

lasagne.regularization.regularize_layer_params(all_layers, lasagne.regularization.l2) would apply a penalty to each weight of the network with given factor i.e. 0.0005.

loshchil commented 4 years ago

Yes, you don't need to deal with the L2 vs weight decay story (because the SGDR paper used the usual L2 regularization), using L2 reg should be fine.

swghosh commented 4 years ago

Thank you so much for the clarification. Much help! :)

loshchil commented 4 years ago

Thanks for your interest in the approach. Best, Ilya

loshchil / SGDR

[L2 Regularization] Queries #4