Batch normalization with or without learned offset

f0k commented 8 years ago

Nice paper! I just have a minor detail question for reimplementing it.

In https://github.com/liuzhuang13/DenseNetCaffe/blob/master/make_densenet.py#L8, you use: scale = L.Scale(batch_norm, bias_term=False, ...) This would correspond to batch normalization with learned gamma, but without beta. In https://github.com/liuzhuang13/DenseNet/blob/master/densenet.lua#L28, you use: convFactory:add(cudnn.SpatialBatchNormalization(nChannels)) This includes a learnable beta. So I think the Caffe code needs to be adapted to match the Torch implementation.

On a side note, the convolutions (both in Caffe and Torch, if I see correctly) all have a bias term, but that will be rendered meaningless by the following batch normalization.

liuzhuang13 commented 8 years ago

Yeah, the learnable beta should be there. I'll correct it, thanks for pointing out.

In the Torch code, we followed fb.resnet.torch's initialization. If the cudnn version >4.0 I think there's no bias, check this line https://github.com/liuzhuang13/DenseNet/blob/master/densenet.lua#L87

f0k commented 8 years ago

https://github.com/liuzhuang13/DenseNet/blob/master/densenet.lua#L87

:+1: Didn't see this (I don't know Torch, didn't expect a way to change this globally). You should probably still add bias_term=False in the Caffe model then.

I've written a Lasagne implementation for the CIFAR-10 experiments: https://github.com/Lasagne/Recipes/pull/84 I noticed that your 40-layer network has 41 layers with trainable weights (40 convolutions, one dense layer). And I'm getting more parameters than you (1.2M instead of 1.0M for the smallest network) -- maybe I'm doing something wrong. Let's see if I can reproduce the results. Cheers!

liuzhuang13 commented 8 years ago

Yeah, I already changed the bias_term in both conv and scale layer, thank you for pointing out bugs :)

Also, thank you for reimplementing it in Lasagne, it's exciting! But it's unclear to me why our 40-layer densenet has 41 layers. Each dense block has 12 layers, and there is one conv layer before entering the first dense block, two transition conv layers, and one final dense layer, so in total it's 12 * 3 + 1 + 2 + 1 = 40 layers.

I took a look at your code, probably your line 74 should be changed to (depth-4)//num_blocks. Note that transition layers have weight layers too!

Cheers!

f0k commented 8 years ago

Each dense block has 12 layers, and there is one conv layer before entering the first dense block, two transition conv layers, and one final dense layer

Ah, thanks, that was it! Mine had a transition layer after the third dense block. With this setup (and keeping 5000 examples for validation), I got 8.99% test error without data augmentation, and 5.25% with augmentation. This matches your result of 5.24% with augmentation, but not your 7% without augmentation. Will fix the architecture to remove the superfluous transition layer and re-run.

I took a look at your code, probably your line 74 should be changed to (depth-4)//num_blocks.

No, that 4 in your code is not actually a constant, it's 1 + num_blocks. That's why I changed it. My n = (depth - 1) // num_blocks is the number of layers of a dense block including the transition layer. I had missed the detail about not adding a transition after the last dense block.

f0k commented 8 years ago

I already changed the bias_term in both conv and scale layer

Okay, I'll close this issue then. We can keep discussing here.

liuzhuang13 commented 8 years ago

Did you add dropout=0.2 when there's no data augmentation? 8.99% is similar to the result we got without dropout using L=40, k=12

f0k commented 8 years ago

Did you add dropout=0.2 when there's no data augmentation?

Yes.

/edit: No. I had planned for it, but it was missing in the bn_relu_conv() makro.

liuzhuang13 commented 8 years ago

Well, we'll see. BTW, how long does it take to train in your environment?

f0k commented 8 years ago

Without the extra transition layer, I've got 40 layers with weights and 1019722 trainable parameters (this does not include the running average statistics of the batch normalization layers, as they're not trained by backpropagation).

BTW, how long does it take to train in your environment?

It took 28h 44min on a Tesla K40c. When I allow Theano to immediately free buffers (allow_gc=1, this is the default anyway, I had disabled it for performance), the network also fits on a GTX 970 and will take about 17h 45min (with 2.75 GiB memory usage). We currently don't have any beefier cards around, unfortunately. So I can post new results tomorrow.

liuzhuang13 commented 8 years ago

The number of parameters is exactly the same as ours, so now the networks should be the same Have you trained a ResNet with a similar accuracy under the same environment? If so, how's the training time? Because in our environment densenet(L=40, k=12) takes 7h only, and is comparable with ResNet. I'd like to know its relative training time in other platforms against ResNet. Thanks!

f0k commented 8 years ago

The number of parameters is exactly the same as ours, so now the networks should be the same

Great. The only remaining possible differences are the initialization and L2 decay, but it seems we get similar results, so either we're doing the same or it's not crucial.

Have you trained a ResNet with a similar accuracy under the same environment?

No, sorry.

Because in our environment densenet(L=40, k=12) takes 7h only

Which GPU is this?

I'd like to know its relative training time in other platforms against ResNet.

If you think about updating the paper, have a look at this nice talk: https://www.youtube.com/watch?v=xAoljeRJ3lU This is relevant for Figure 4 :)

liuzhuang13 commented 8 years ago

It's TITAN X, and cudnn v5.1. Using cudnn gives several times (3x-5x) speedup Thanks for sharing the video, we'll look into it

f0k commented 8 years ago

With the corrected architecture, I got the following: With augmentation, and no dropout:

Epoch 300/300, Batch 703/703 (100.00%) (took 3:32)
  training loss:        0.003841
  validation loss:      0.205178
  validation error:     5.35%
Final results:
  test loss:            0.222515
  test error:           5.54%

Without augmentation, with 20% dropout:

Epoch 300/300, Batch 703/703 (100.00%) (took 5:32) 
  training loss:        0.001126
  validation loss:      0.316466
  validation error:     8.65%
Final results:
  test loss:            0.333086
  test error:           9.43%

Both results are worse than before. Maybe the extra transition layer was helpful, although you didn't use it -- you might want to give it a try. The question is why the results are behind yours -- is it random fluctuation, do the missing 5000 training examples hurt (the ones I used for validation) or are there any important differences in our implementations? How much did your performance suffer when you used a validation set? Also note that the training loss is much lower than the validation loss. These are the pure categorical cross-entropies, without L2 norms. Was it the same for you?

liuzhuang13 commented 8 years ago

Both results are worse than before. Maybe the extra transition layer was helpful, although you didn't use it -- you might want to give it a try.

Thanks for your run! The purpose of the last transition layer seems not very clear so we didn't use it. I guess the performance gain by using it is due to more parameters, which can be turned into extra depth or width, so we'll leave the model as before. Thanks anyway!

How much did your performance suffer when you used a validation set?

Right, I guess maybe missing 5000 training examples hurt. When we use a validation set, on C10+ we got about 0.3%-0.5% worse results. On C10 that should be more since C10 is much smaller than C10+ in terms of data size, so 5000 is not a negligible part.

Was it the same for you?

Yes, in our setting training loss (or error) usually goes near to 0, but the validation or test error doesn't. We didn't monitor the validation loss though, sorry.

f0k commented 8 years ago

On C10 that should be more since C10 is much smaller than C10+ in terms of data size, so 5000 is not a negligible part.

Without the validation set, on C10 I got:

Final results:
  test loss:            0.312979
  test error:           8.73%

That's still a bit far from your 7.00%. My first hunch would be to compare the initialization. I see that for the convolutions, you use:

local n = v.kW*v.kH*v.nOutputPlane
v.weight:normal(0,math.sqrt(2/n))

I have lasagne.init.HeNormal(gain='relu'), which does sqrt(2/fan_in), where fan_in is input_channels * kW * kH. This is also what you do in the caffe implementation. So your Torch implementation will have quite a bit larger initial values (since there are much fewer output than input channels in the convolutions), and also similar ones for all layers in a dense block (since the number of output channels is constant, but the number of input channels increases). This could make a difference.

I can fix that (you may want to fix that as well for the Caffe version), but I'd like to check the dense layer initialization as well. Can you guide me where that is defined in Torch?

By the way, you may also want to compare results to https://arxiv.org/abs/1608.02908. They achieve about the same results as your best model, with half the number of parameters, using the same data augmentation.

f0k commented 8 years ago

I'd like to check the dense layer initialization as well. Can you guide me where that is defined in Torch?

It seems this would be the one: https://github.com/torch/nn/blob/651103f/Linear.lua#L21-L43 It initializes weights and biases uniformly from -stdv to stdv (not Gaussian!), with stdv = 1./math.sqrt(self.weight:size(2)), but you later override the biases to be zero: https://github.com/liuzhuang13/DenseNet/blob/cbb6bff/densenet.lua#L108

According to https://github.com/torch/nn/blob/651103f/Linear.lua#L6, the weight shape is outputSize x inputSize, so self.weights:size(2) is the fan-in.

I will adapt my implementation to reproduce your initialization and run again.

liuzhuang13 commented 8 years ago

So your Torch implementation will have quite a bit larger initial values

Yes, thanks for pointing out, in fact we didn't notice the difference between nInputPlane and nOutputPlane. For initialization we simply copied the piece of code from here https://github.com/facebook/fb.resnet.torch/blob/master/models/resnet.lua It's a bit weird for them to use nOutputPlane, since in the original paper ("he" or "msra" init scheme) I believe they used nInputPlane, and most deep learning package followed that.

Can you guide me where that is defined in Torch?

For dense layer, since they didn't do anything for weights explicitly except letting bias be zeros, so it's Torch's default init scheme. It seems it's "xavier" init scheme. You can take a look at here https://github.com/torch/nn/blob/master/Linear.lua.

I saw your new comment, but I'll keep the above paragraph.

That's still a bit far from your 7.00%. My first hunch would be to compare the initialization.

But I guess initialization shouldn't hurt that much. There are two other implementations which used "he"/"msra" init scheme (if my understanding was right) and got around 7% error on C10 https://github.com/t-hanya/chainer-DenseNet https://github.com/tdeboissiere/DeepLearningImplementations/tree/master/DenseNet

Thanks for your effort in this, can I see your training curve?

By the way, you may also want to compare results to https://arxiv.org/abs/1608.02908.

We only compared our architecture with some influential architectures. The purpose is to investigate the architecture's effect, while isolating other factors, not to "fine tune" on basic network architectures. The paper you refered to was a improvement on ResNet. Through some optimization of DenseNet (e.g., using bottleneck structures, reducing #channels in transition layers, which we did after the publication on arxiv), we also achieved better results with the same amount of parameters, but that's not the purpose of us. Thanks anyway!

f0k commented 8 years ago

But I guess initialization shouldn't hurt that much.

Before batch normalization, it did, but maybe not so much any more.

Thanks for your effort in this, Can I see your training curve?

Hmm, that's interesting, hadn't visualized it before: densenet There was a bump a bit before the second learning rate decrease:

Epoch 205/300, Batch 781/781 (100.00%) (took 3:55)
  training loss:        0.002623
Epoch 206/300, Batch 781/781 (100.00%) (took 3:55)
  training loss:        0.002659
Epoch 207/300, Batch 781/781 (100.00%) (took 3:55)
  training loss:        0.002213
Epoch 208/300, Batch 781/781 (100.00%) (took 3:55)
  training loss:        0.005006
Epoch 209/300, Batch 781/781 (100.00%) (took 3:55)
  training loss:        0.033068
Epoch 210/300, Batch 781/781 (100.00%) (took 3:55)
  training loss:        0.024484
Epoch 211/300, Batch 781/781 (100.00%) (took 3:55)
  training loss:        0.018788
Epoch 212/300, Batch 781/781 (100.00%) (took 3:55)
  training loss:        0.012629
Epoch 213/300, Batch 781/781 (100.00%) (took 3:55)
  training loss:        0.009062
Epoch 214/300, Batch 781/781 (100.00%) (took 3:55)
  training loss:        0.010042
Epoch 215/300, Batch 781/781 (100.00%) (took 3:55)
  training loss:        0.006990

Could have been bad luck, and unimportant, but who knows.

https://github.com/tdeboissiere/DeepLearningImplementations/tree/master/DenseNet

Hmm, they specifically added L2 decay on the biases (I don't), and they don't do dropout after the very first convolution (I do). Comparing, you also don't have a dropout layer after the initial convolution in your Torch implementation, but in Caffe. The chainer implementation also doesn't have this dropout layer. So this could be the culprit. I'll remove it. You should remove it from the Caffe implementation, too.

By the way, the paper says "we add a Dropout (Srivastava et al., 2014) layer after each convolutional layer".

/edit: Did you reproduce your results with your Caffe implementation?

liuzhuang13 commented 8 years ago

Could have been bad luck, and unimportant, but who knows.

It's normal for training loss to increase at the end of the 2nd stage, but such a sudden bump may be due to bad luck, and unimportant.

Hmm, they specifically added L2 decay on the biases (I don't)

By default Torch treat biases as weights too, so there was bias decay in our original implementation.

By the way, the paper says "we add a Dropout (Srivastava et al., 2014) layer after each convolutional layer".

Thanks for pointing out, we'll correct this in the next update of the paper.

Did you reproduce your results with your Caffe implementation?

Unfortunately not, I got 10% error in some initial try. I'll remove the first dropout too. Thanks

f0k commented 8 years ago

With the latest fixes, I get 9.33% on C10 and 7.13% on C10+. Mysterious.

f0k commented 8 years ago

Adding L2 decay for biases, I get 9.99% on C10 and 6.24% on C10+. (I don't necessarily attribute this to adding the L2 decay, maybe fluctuations are that large.)

Then I tried the Keras implementation: It is quite a bit slower (4:40 per epoch instead of 3:55), but eventually reaches 7.01% on C10. It has exactly the same number of trainable parameters, so the architecture should match.

Now the question is what is the crucial difference in my Lasagne implementation. I know Keras clips predictions with (1e-7, 1 - 1e-7) before applying the cross-entropy loss. It probably uses yet another initialization scheme. And its batch normalization is a little different. And it implements the dense blocks differently in how it does the concatenations (this shouldn't affect learning at all, though). None of this really strikes me.

liuzhuang13 commented 8 years ago

Yes, it seems he used uniform weight distribution instead of normal. It is still "he/msra" init scheme, so it uses nInputPlane instead of nOutputPlane. Strange for facebook to use nOutputPlane, I opened an issue to ask them there https://github.com/facebook/fb.resnet.torch/issues/106

BTW, what's the difference in batch norm layer between yours and his?

f0k commented 8 years ago

Strange for facebook to use nOutputPlane

But it seems this doesn't make a difference in practice.

BTW, what's the difference in batch norm layer between yours and his?

For inference, in Lasagne, we compute the exponential moving averages of the batch mean and batch inverse standard deviation, for compatibility to cuDNN. In Keras, they compute the exponential moving averages of the batch mean and batch variance. As far as I see, it should be the same otherwise (i.e., we both share statistics across spatial locations for convolutional layers). I think this difference is not important.

liuzhuang13 commented 8 years ago

I'll write down details I can think of in our Torch implementation, you can check them one by one, although some might seem trivial. Given the parameters are exactly the same, I'll omit some details related to this.

0.9 Nesterov momentum SGD with 0 dampening, learning rate 0.1 (0-150 epochs), 0.01 (150-225 epochs), 0.001 (225-300 epochs), no learning rate gradual decay.
Weight decay 1e-4, applied to all trainable parameters including batch norm's weight and bias.
Batch Norm has mean, std, weight and bias. Init scheme: beta(weight)=1, gamma(bias)=0
Dropout with drop rate 0.2 (no augmentation) after each convolution except the first.
Preprocessing using channel-wise means and stds.
Convolution layers don't have biases, dense layer has.
Batch size is 64.
Batch norm aren't inplace operations, while relus are.
22 non-overlapping avg pooling in transition layers, 88 global avg pooling before dense layer.

I'll add more if I can think of. Good luck

liuzhuang13 commented 8 years ago

Do you have visualization tools that can draw the model like in the keras implementation? If so, would it be convenient for you to send me a diagram of the model you built? The same number of parameters might not indicate the models are the same.

Thanks

f0k commented 8 years ago

I'll write down details I can think of in our Torch implementation

Thank you, I've got all these details correct. So does your Caffe implementation. Did you try re-running it after fixing what we've found here?

The same number of parameters might not indicate the models are the same.

Right. Yes, I can try to visualize a two-block, two-layers-per-block network as in the Keras version.

liuzhuang13 commented 8 years ago

Did you try re-running it after fixing what we've found here?

No, but I can try training one now. It would be slow, maybe takes 2 days, since our caffe is not built with CUDNN.

liuzhuang13 commented 8 years ago

If my understanding was right, the caffe code I wrote didn't normalize the input using channel means and stds, it only use pixel-wise means. I followed caffe's official examples by doing this. But that shouldn't be a big difference.

f0k commented 8 years ago

If so, would it be convenient for you to send me a diagram of the model you built?

Here you go (click to enlarge):

That's with two dense blocks of two convolutional layers each, totalling in a depth of 7. Should be the same as https://github.com/tdeboissiere/DeepLearningImplementations/blob/master/DenseNet/figures/densenet_archi.png, except that I have true global average pooling in the end and not fixed 8x8 pooling. Oh yes, and I concatenate incrementally, while the Keras implementation re-concatenates all previous layers of the current dense block before each convolution.

liuzhuang13 commented 8 years ago

It seems the networks are the same to me. I'm starting to train using the caffe code, it takes about 2 days.

Another thing maybe worth trying now is to plot the test error as a function of epochs. We have a record (attached, format: test error | training error | training loss , on C10) of that too, you can compare yours with ours or the keras implementation, to at least see when it starts to go wrong.

cifar10_300_1019722.txt

liuzhuang13 commented 8 years ago

The caffe run is finished. It doesn't converge as fast as Torch, but the final accuracy is approximately the same (on C10, 7.09% error rate). Here is a screenshot of the result:

I0923 18:34:40.541993 10308 solver.cpp:244]     Train net output #0: Accuracy1 = 1
I0923 18:34:40.542006 10308 solver.cpp:244]     Train net output #1: SoftmaxWithLoss1 = 0.00586424 (* 1 = 0.00586424 loss)
I0923 18:34:40.542013 10308 sgd_solver.cpp:106] Iteration 229599, lr = 0.001
I0923 18:34:40.578860 10308 solver.cpp:337] Iteration 229600, Testing net (#0)
I0923 18:35:26.080590 10308 solver.cpp:404]     Test net output #0: Accuracy1 = 0.9291
I0923 18:35:26.081670 10308 solver.cpp:404]     Test net output #1: SoftmaxWithLoss1 = 0.301182 (* 1 = 0.301182 loss)

So indeed our caffe version should be correct now, thank you for pointing out the mistakes.

f0k commented 8 years ago

So indeed our caffe version should be correct now, thank you for pointing out the mistakes.

You're welcome! Unfortunately, the Lasagne implementation is still not there yet.

Another thing maybe worth trying now is to plot the test error as a function of epochs.

I ran again over the weekend, changing the initialization of the initial convolution from HeNormal to the facebook one (had missed this at first), and clipping predictions as done in Keras. I also monitored test set performance now (although it feels like cheating). These are the loss and error curves for the Keras implementation compared to the Lasagne implementation: densenet_err I didn't monitor accuracy on the training set, so that line is missing on the bottom right. I've got a separate line for the L2 loss, though (the red dashed line on the top right, scaled by 1e-4, as done in training). I think Keras monitors both at once, so I've added up cross-entropy and L2 loss for the blue and green curve in the top right plot. Some interesting observations: Keras' curve is much more noisy, and seems to overfit less, when comparing the top two plots. Also Lasagne's curve starts out much higher in the initial few epochs (it's cut off at the top). It looks a bit like Lasagne's learning rate was lower. (One of my initial guesses was that Keras scales the learning rate by (1-momentum), as some implementations do, but this would have had the opposite effect.) It's interesting that the loss is completely dominated by the L2 loss after the first learning rate drop. Do you have some other log to compare this to?

Train net output #1: SoftmaxWithLoss1 = 0.00586424

Do you know if this includes the L2 penalty?

liuzhuang13 commented 8 years ago

Do you have some other log to compare this to?

Did you compare with our Torch log? I uploaded it at this page

Do you know if this includes the L2 penalty?

I think this doesn't include L2 penalty, since it says "SoftmaxLoss" and in caffe weight decay is not a loss term. This loss shown is only the loss of a mini-batch. I'll read your post in more detail later

f0k commented 8 years ago

Did you compare with our Torch log? I uploaded it at this page

Ah, hadn't seen it. It's a bit difficult to compare, since we're missing some lines on my and on your side: densenet_err2 About the only thing we can clearly see is that my train cross-entropy is noticeably lower.

liuzhuang13 commented 8 years ago

Using nOutputPlane was a mistake made by fb.resnet.torch, so you may use he_normal instead.

I feel like the sudden increase at around 225 epoch is not a good thing. And after 150 the learning seems to get stuck, my first feeling is to increase the initial learning rate so after 150 epoch that net can learning something. It is true that in different packages learning rate is computed differently, our caffe training curve is more like the keras one you showed (converged much slower, higher test error rate before 150 epoch, than our Torch training curve).

BTW, what's the weight decay you are using?

f0k commented 8 years ago

Using nOutputPlane was a mistake made by fb.resnet.torch, so you may use he_normal instead.

Yes, I've seen that. Good that you asked!

I feel like the sudden increase at around 225 epoch is not a good thing.

Yes, that's weird. I wonder why this happened both in this log and in an earlier one (https://github.com/liuzhuang13/DenseNetCaffe/issues/2#issuecomment-247648993). I've started from different random initializations.

It is true that in different packages learning rate is computed differently

Although I'm quite sure it's the same between Keras and Lasagne.

BTW, what's the weight decay you are using?

All parameters of the network are squared, summed up, multiplied by 1e-4 and added to the cross-entropy loss. Theano then computes the gradient of the total loss (cross-entropy and L2 penalty) with respect to the parameters, and the gradient expressions are used to construct the updates with the given learning rate and Nesterov momentum.

This has some implications:

There's momentum associated with the L2 decay
The L2 decay is 10 times higher than when updating it with plain SGD (due to the momentum)
The L2 decay is scaled by the learning rate (i.e., by 0.1 in the beginning, then by 0.01 and 0.001)

Do you know how it's implemented in Torch and Caffe? I'll check the Keras version.

f0k commented 8 years ago

Keras also adds the regularization loss to the loss updated by whatever optimization method was chosen (https://github.com/fchollet/keras/blob/f2aa89f/keras/engine/training.py#L631-L633). The L2 loss is defined the same way as in Lasagne (except for one day in July where it was defined as the mean of squared parameters, https://github.com/fchollet/keras/commit/b7edcf6#diff-6e3349d3f432641cc969b702bcd3edafR82). So it seems this is not the explanation either.

robertomest commented 8 years ago

Hi, I've implemented densenet in Keras myself and have achieved results comparable to the paper (93.58% without data augmentation and 94.72% with it). My training is similar to what you showed in the sense that it is quite noisy and, during the second part of training (with 0.01 lr), it appears to overfit, with validation accuracy decreasing. When the learning rate is decrease again, though, I do get the improved accuracy and the apparent overfitting does not occur. I'm also using He normal initialization. My training was done using Tensorflow backend (I'm guessing you are using Theano). For data augmentation I'm using Keras ImageGenerator and I'm shuffling samples after every epoch.

Quick question about Lasagne (since I'm not very familiar with it): when you change the learning rate, are you reinstantiating the optimizer? In Keras I use the backend to change the learning rate variable instead of instantiating a new optimizer, since doing so would reset the momentum. Maybe that could be a reason for the sudden increase in the loss around epoch 225 (when you changed the learning rate)?

You can see my implementation here if you'd like.

liuzhuang13 commented 8 years ago

Maybe that could be a reason for the sudden increase in the loss around epoch 225 (when you changed the learning rate)?

@robertomest If you take a closer look, the sudden increase is before the lr change, not after. So it is not due to the reinstantiation issue.

Also, could you please put your results (preferably training curves) into a README file and add some texts, so I can add a link to your repo at my README?

f0k commented 8 years ago

Quick question about Lasagne (since I'm not very familiar with it): when you change the learning rate, are you reinstantiating the optimizer?

No, the learning rate is changed without affecting the momentum history.

Maybe that could be a reason for the sudden increase in the loss around epoch 225 (when you changed the learning rate)?

If you look closely, the bump actually appeared directly before decreasing the learning rate. Anyway, thanks for your input! I actually had the same suspicion at first.

liuzhuang13 commented 8 years ago

It is normal that the curve will go up at the end of the 2nd learning rate stage, but it never increases suddenly in our training. You guys can take a look at the training curve figure in our paper.

liuzhuang13 commented 8 years ago

@f0k

https://github.com/torch/optim/blob/master/sgd.lua Here is the code about weight decay in Torch. It seems to me the learning rate and momentum has the effect on weight decay also.

f0k commented 8 years ago

It seems to me the learning rate and momentum has the effect on weight decay also.

Yes, you're right: https://github.com/torch/optim/blob/63994c7/sgd.lua#L48 They directly add the weights times the decay factor to the gradient. This is equivalent to a loss of decay*0.5*weight^2. That's half the amount I have, but it doesn't seem to be crucial either (since the Keras implementation is identical to mine in that respect).

Well, I'm slowly running out of ideas. Thanks a lot for all your patience, though! Since the Caffe implementation is now fixed and there are a lot of working reproductions already, I guess you can't really help much more. I'll still let you know if I make any progress.

liuzhuang13 commented 8 years ago

Ok! Another possible debugging method: substitute the densenet you built with a well-tested resnet model in Lasagne Recipes, and see whether the results are comparable with what they reported.

Or you can substitute the resnet model in the well-tested resnet code with the densenet model you built and see the results.

So you can see whether the problem lies in the model or outside the model (e.g., data, training setting). Or even better, the densenet may work using a well-tested training code.

Maybe you already did that

robertomest commented 8 years ago

@liuzhuang13 I added some info to the README.

@f0k I'm looking at your code and I can't see anything wrong in it. If you do make progress I'll be interested in hearing about it.

This discussion was insightful to me anyways (nice catch on the different implementation of weight decay on Keras/Lasagne vs Torch), so I thank you guys for that.

f0k commented 8 years ago

This discussion was insightful to me anyways (nice catch on the different implementation of weight decay on Keras/Lasagne vs Torch)

I also learned more about the details of the different packages than I had intended :) The reason I spent so much energy on that is that I'm afraid I missed some minor detail that has a large impact, and could be important to consider in other experiments.

f0k commented 8 years ago

At another attempt on C10+, I got 5.33%. Learning curves look like they should, with modest overfitting in the second learning phase and a drop in the third. Comparing to C10: densenet_err3

Continuing on C10, I switched to HeNormal initialization for all layers and ran again. And again, there's the bump shortly before the second learning rate drop. It's mysterious. Something seems to overshoot some limit. Maybe some weights are scaled down to exact zero by L2 decay so they cannot be countered by batch normalization any longer, something like that. (When there's nothing else to learn, the loss will improve by simply downscaling all weights equally. Batch normalization will take care that the function computed by the network doesn't change. That's a free gain! Or am I missing anything?) I'm curious to hear other theories.

I then set weight decay to 5e-5 to be compatible to Torch, and dumped the model every epoch so I could investigate. Lo and behold, the bump disappeared. But results are still around 8% (7.91% in the run with decay of 1e-4, 8.74% in the run with decay 5e-5): densenet_err4

And after 150 the learning seems to get stuck, my first feeling is to increase the initial learning rate so after 150 epoch that net can learning something.

The training loss in my implementation is much lower than other implementations, all of the time. After the first learning rate drop, it's basically at zero (about 1e-3, compared to 4e-2 in Torch, and 2e-1 in Keras including L2 penalty), and cannot improve further (it might already hit the prediction clipping to [1e-7, 1-1e-7] I added to copy Keras, which would mean there's not even a gradient to follow). That's why it's stuck. I'm not sure increasing the learning rate would help -- only if this makes it more difficult for the network to learn the training data.

liuzhuang13 commented 8 years ago

Yes the bump was mysterious and it's hard to guess whether it's harmful to the final results.

From my experience when the learning rate is too low the network will have low loss at first, but its potential is much lower and the curve will basically be flat at the end. I think it's worth a try to increase the learning rate, say, to like 0.2 or even 0.5 at first.

Anyway 8% to 7% is not that large, compared to 9%, @robertomest even got 6.5%. It's expected to get different results with different packages. Hope your pull request will be accepted!

robertomest commented 8 years ago

Thats is very weird indeed. I find it interesting that the noise level is so much lower when compared to Keras curves. I think increasing the learning rate could give us some new information (maybe?). Maybe we could try and mix some Keras and Lasagne code to try to decouple the model definition and training (use the Lasange model with Keras optimizers or vice-versa), maybe that would help us find out where the main difference is?

f0k commented 8 years ago

Anyway 8% to 7% is not that large

Sure, but it worries me that we reliably get 7% with other implementations (even one that is based on Theano as well). I want to know what's different.

I think it's worth a try to increase the learning rate, say, to like 0.2 or even 0.5 at first.

I will try. Or maybe there's something about the dropout or batch normalization that makes it easier for the network to overfit. But all implementations seem to use fully independent dropout, not spatial dropout, I've checked that already.

Hope your pull request will be accepted!

Oh, that's not the problem, I can merge it myself whenever I want :)

I find it interesting that the noise level is so much lower when compared to Keras curves.

Yes -- but compared to the Torch curves, they're quite similar (https://github.com/liuzhuang13/DenseNetCaffe/issues/2#issuecomment-249544221).

Maybe we could try and mix some Keras and Lasagne code

That's difficult because Keras abstracts away everything. I don't see how I could take the Theano expression created by Lasagne and train it with Keras, or vice versa. (Besides, the optimizer implementations look like they're the same.) I can just try to compare the expression graphs at the point they're getting compiled into Theano functions. Closely investigating the Keras version is probably the best way to solve this.

liuzhuang13 / DenseNetCaffe

Batch normalization with or without learned offset #2