Training AlexNet from scratch

SergeyMilyaev commented 9 years ago

Hi,

Did someone tried to train AlexNet from scratch using MatConvNet? With using cnn_imagenet script from the examples and enabled batch normalization my results are far below the performance of the imagenet-caffe-ref (the plot is attached).

Regards, Sergey

2015-11-03 00-03-53

vedaldi commented 9 years ago

Hi, there seem to be a very significant amount of overfitting which seems odd.

The final top1 performance seems to be slightly above 45% from the plot, and I guess you should be getting a performance of about 42-43%.All our machines are busy with CVPR at the moment, but I will try rerunning training next week and see if I can reproduce. There were some changes to batch normalization recently; while they should have not changed the behaviour of the function, it is possible that something went wrong.

On 2 Nov 2015, at 21:05, Sergey Milyaev notifications@github.com wrote:

Hi,

Did someone tried to train AlexNet from scratch using MatConvNet? With using cnn_imagenet script from the examples and enabled batch normalization my results are far below the performance of the imagenet-caffe-ref (the plot is attached).

Regards, Sergey

https://cloud.githubusercontent.com/assets/2150186/10894230/7111796e-81be-11e5-906c-788a3ba17861.png — Reply to this email directly or view it on GitHub https://github.com/vlfeat/matconvnet/issues/306.

AruniRC commented 9 years ago

Hi, I had tried training AlexNet from scratch using an older version of MatConvNet (beta-9) without batch normalization. Needed to use 'f25' data augmentation setting to reach 20% val-5 error in 68 epochs, which is close to the caffe-alex top-5 of 19.8%. Perhaps you are not using enough data jittering at training time. Also tuning the learning rates for more epochs should help.

SergeyMilyaev commented 9 years ago

Thanks, @vedaldi. By the way, I used 13 version, I will check the latest 16 version today. I also tried to use dropout and batch normalization to reduce overfitting, but it didn't improve validation accuracy. @AruniRC, thanks for the details of your experiment. By default the training uses 'stratch' type of data augmentation. To my opinion, learning rates seems to be fine since training error decreases well.

korkinof commented 9 years ago

I had a similar problem reproducing GoogleNet with batch normalisation. I tried to follow our existing implementation in Torch, with same data augmentation. To do so I had to fix several catastrophic bugs in dagnn (Concat and BatchNormalisation layers etc). The only thing that I changed was the learning rate to a more gradual decay. I was consistently getting around 5% worse validation error than Torch. After spending a full week on this, I did not have more time to investigate further and had to go back to our Torch development.

lenck commented 9 years ago

Hi, @korkinof sorry for the issues, do you think you can share the bugs? It would really help as the DAG interface is still quite young...

vedaldi commented 9 years ago

Hi, definitely we fixed a few bugs in > v13. The first release of the DaG had a bug in the concat layers for backprop.

On 3 Nov 2015, at 06:51, Sergey Milyaev notifications@github.com wrote:

Thanks, @vedaldi https://github.com/vedaldi. By the way, I used 13 version, I will check the latest 16 version today. I also tried to use dropout and batch normalization to reduce overfitting, but it didn't improve validation accuracy. @AruniRC https://github.com/AruniRC, thanks for the details of your experiment. By default the training uses 'stratch' type of data augmentation. To my opinion, learning rates seems to be fine since training error decreases well.

— Reply to this email directly or view it on GitHub https://github.com/vlfeat/matconvnet/issues/306#issuecomment-153261856.

vedaldi commented 9 years ago

Hi, ok I am running this again on our machines now (found a spare GPU)

In earlier experiments I was getting 41.9/ 19.3 val error with AlexNet with batch normalisation, so there seems to be something off here.

On 2 Nov 2015, at 21:05, Sergey Milyaev notifications@github.com wrote:

Hi,

Did someone tried to train AlexNet from scratch using MatConvNet? With using cnn_imagenet script from the examples and enabled batch normalization my results are far below the performance of the imagenet-caffe-ref (the plot is attached).

Regards, Sergey

https://cloud.githubusercontent.com/assets/2150186/10894230/7111796e-81be-11e5-906c-788a3ba17861.png — Reply to this email directly or view it on GitHub https://github.com/vlfeat/matconvnet/issues/306.

SergeyMilyaev commented 8 years ago

Some updates from my side: I ran experiments with v16, the curves didn't change, the first plot attached As @AruniRC I tried 'f25' data augmentation but it didn't improve my results, the second plot attached 2015-11-05 22-57-05 2015-11-05 22-56-07

vedaldi commented 8 years ago

Hi, are you running with batch normalisation?

I just finished rerunning our network (alexnet) and I can confirm that everything works fine on our side. This is my final performance:

[20 epochs 512 batch for eval, centre crop] 524.4 Hz objective: 1.797 top1error: 0.415

However, I have realised that there might be some confusion here. The validation error that is computed during training will in general be lower than the normal centre-crop validation error because:

the data is jittered (in particular, not center-corpped)
with batch normalization, that block is not disabled during validation. Since bnorm introduces noise in a similar manner to dropout, the validation error is higher

We will likely address these two issues’ in a future release in order to avoid potential confusion. However, for the time being I suggest to validate your network using the can_imagenet_evaluate script (or similar). I updated it to work on the DAG in thedevel` branch.

Note also that presently batch normalisation is not automatically removed for testing. So make sure to use large batches for evaluation, or do the extra step of removing bnrom from the network (this will require some legwork to compute appropriate scaling factors). We are currently working at doing all this automatically.

I am also attaching the latest graph I have obtained during training for comparison. Notice that the validation error appears to be 3-4% worse than what it really is, for the reasons explained above.

On 5 Nov 2015, at 20:04, Sergey Milyaev notifications@github.com wrote:

Some updates from my side: I ran experiments with v16, the curves didn't change, the first plot attached As @AruniRC https://github.com/AruniRC I tried 'f25' data augmentation but it didn't improve my results, the second plot attached https://cloud.githubusercontent.com/assets/2150186/10979985/bed735a4-8410-11e5-8531-2e329711e202.png https://cloud.githubusercontent.com/assets/2150186/10980096/3f5bb5e2-8411-11e5-86a1-b714c23431f2.png — Reply to this email directly or view it on GitHub https://github.com/vlfeat/matconvnet/issues/306#issuecomment-154174589.

AruniRC commented 8 years ago

Hi @SergeyMilyaev ,

here's the attached training plot for AlexNet, 'f25', MatConvNet-beta-9 (without batch normalization). Further training would have reduced val-5 error to below 20% but I stopped here because it was close enough to the results on MatConvNet webpage.

Have you tried training for a few more epochs, reducing the learning rate - the usual things? Your val-5 error is below 25%. So that 5% gap might just be breached with some more training.

selection_001

nicjac commented 8 years ago

@vedaldi I am curious why the evaluation script calls the training functions rather than just doing a simple evaluation (e.g. through eval() for DAG). Is it related to bnorm?

SergeyMilyaev commented 8 years ago

Hi, @AruniRC, thanks for the plots, I see that you have less overfitting, then in my evaluation. @vedaldi, thanks. Yes I'm using batch normalization, could you share you error plots during training? Did you get [20 epochs 512 batch for eval, centre crop] 524.4 Hz objective: 1.797 top1error: 0.415 with cnn_imagenet_evalutate? Did you replace bnorm blocks with appropriate scaling or just use batch size 512?

nicjac commented 8 years ago

@vedaldi @SergeyMilyaev I would also be interested to know how to best handle the bnorm blocks. I have been experiencing some weird issues when inferring class labels from unseen images.

kleinsound commented 8 years ago

Is it published anywhere what the output of cnn_imagenet should be out of the box with no modifications? That would be helpful in diagnosing possible problems with the dataset itself, since it is assumed that the imagenet dataset is a given (but it must be obtained and preprocessed separately).

SergeyMilyaev commented 8 years ago

Update from my experiments: I started training AlexNet without batch normalization. I observe much less overfitting compare to my previous experiments (the plot is attached). I see that the difference between the train and validation errors is similar to the results from the plots, which were presented in this discussion by @AruniRC. I suppose that finally a similar performance will be obtained. alexnet_train

vedaldi commented 8 years ago

Hi, yes, overfitting is reduced, but so is the final accuracy it seems. AlexNet + Batch Normalisation should stabilise around 45% error on the validation data (before bypassing batch normalisation and using only centre crops, which should further lower this down to 41-42%).

I have now been training SVMs and CNNs for years and my conclusion is that that even huge overfitting is not necessarily to be avoided at all costs. In fact, all it matters is that the validation/test error goes down and the best results are often obtained when the training data is overfitted.

On 17 Nov 2015, at 08:32, Sergey Milyaev notifications@github.com wrote:

Update from my experiments: I started training AlexNet without batch normalization. I observe much less overfitting compare to my previous experiments (the plot is attached). I see that the difference between the train and validation errors is similar to the results from the plots, which were presented in this discussion by @AruniRC https://github.com/AruniRC. I suppose that finally a similar performance will be obtained. https://cloud.githubusercontent.com/assets/2150186/11206248/c6b05e42-8d1e-11e5-8954-2fba60cb6385.png — Reply to this email directly or view it on GitHub https://github.com/vlfeat/matconvnet/issues/306#issuecomment-157308179.

kleinsound commented 8 years ago

Here's my results with bnorm, GPU, f25, with version 16. Getting these blowups also with bnorm off. net-train.pdf

SergeyMilyaev commented 8 years ago

Hi, @vedaldi. "(before bypassing batch normalisation and using only centre crops, which should further lower this down to 41-42%)" As I wrote on November 6th with cnn_imagenet_evalutate (which uses a single centre crop as I understand from the code, I didn't change the bnorm blocks and used the batch size of 512) I didn't observe that validation error decreases to 41-42% compare to the validation results on random crops during training from my plots. For the caffe-ref net this script reports the same error as presented in the models performance table at http://www.vlfeat.org/matconvnet/pretrained/, and this error is also close to the final error from the validation curves during training provided by @AruniRC. @vedaldi, how did you bypass batch normalisation in your evaluation?

AruniRC commented 8 years ago

Hi, as an addendum, I am attaching the training curve for training VGG-16 from scratch on ImageNet, just by running the sample script provided in MatConvNet beta-16 out-of-the-box.

I used batch normalization and data augmentation to 'f25', with the 'xavierimproved' initialization. (not doing RGB jittering - it gives about only 1% improvement as noted in the 'original' AlexNet ImageNet paper).

Update: Completed training till 20 epochs. My validation results are: top-1: 32.92 top-5: 12.59

These are worse than the reported results using very-deep-16: top-1: 28.8 top-5: 10.1

Older post: However, currently at 18th epoch (out of total 20 set in the sample code), my val error is about 3-4% worse than reported on MatConvNet webpage (val5 10.1%) http://www.vlfeat.org/matconvnet/pretrained/#imagenet-ilsvrc-classification. This seems also to be the case with training AlexNet from scratch that @SergeyMilyaev posted about.

The small difference could be a result of the validation error being computed with the 'f25' cropping and batch normalization being applied (as Vedaldi mentioned w.r.t. AlexNet results). I will run it again with center crops and test-time batch normalization and update results.

selection_001

vedaldi commented 8 years ago

Hi, do you mean “higher”?

Unfortunately the VD models are quite difficult to train. The original models, trained by Simonyan for the ImageNet challenge, were obtained in several steps, starting from shallower models and then gradually adding intermediate layers, and took weeks to get. On the contrary, the example in MatConvNet just applies the standard procedure used for all the other models.

We are now running a series of experiments to see if we can get these models trained in “one go” just like the others.

On 21 Nov 2015, at 03:40, AruniRC notifications@github.com wrote:

Hi, as an addendum, I am attaching the training curve for training VGG-16 from scratch on ImageNet, just by running the sample script provided in MatConvNet beta-16 out-of-the-box.

I used batch normalization and data augmentation to 'f25', with the 'xavierimproved' initialization. (not doing RGB jittering - it gives about only 1% improvement as noted in the 'original' AlexNet ImageNet paper). Currently at 18th epoch, my val error is about 4-5% lower than reported on MatConvNet webpage http://www.vlfeat.org/matconvnet/pretrained/#imagenet-ilsvrc-classification http://www.vlfeat.org/matconvnet/pretrained/#imagenet-ilsvrc-classification https://cloud.githubusercontent.com/assets/1054799/11316482/e1990708-8fd6-11e5-98c4-4ea7886a22a6.png — Reply to this email directly or view it on GitHub https://github.com/vlfeat/matconvnet/issues/306#issuecomment-158583722.

vedaldi commented 8 years ago

Hi, I did not in that evaluation. I simply set a pretty large batch size (512 images) so that the effect of the random fluctuations induced by batch normalization would be small enough to still results in good performance.

At any rate, we just finished coding the new version of MatConvNet that solves this issue by learning the batch normalisation moments together with the other parameters and applying those at test time. We are testing it on training a few standard models and should be able to release it soon. There is a preview in the development branch if you are adventurous.

On 17 Nov 2015, at 19:09, Sergey Milyaev notifications@github.com wrote:

Hi, @vedaldi https://github.com/vedaldi. "(before bypassing batch normalisation and using only centre crops, which should further lower this down to 41-42%)" As I wrote on November 6th with cnn_imagenet_evalutate (which uses a single centre crop as I understand from the code, I didn't change the bnorm blocks and used the batch size of 512) I didn't observe that validation error decreases to 41-42% compare to the validation results on random crops during training from my plots. For the caffe-ref net this script reports the same error as presented in the models performance table at http://www.vlfeat.org/matconvnet/pretrained/ http://www.vlfeat.org/matconvnet/pretrained/, and this error is also close to the final error from the validation curves during training provided by @AruniRC https://github.com/AruniRC. @vedaldi https://github.com/vedaldi, how did you bypass batch normalisation in your evaluation?

— Reply to this email directly or view it on GitHub https://github.com/vlfeat/matconvnet/issues/306#issuecomment-157473369.

AruniRC commented 8 years ago

Ah I see. Thanks for the explanations @vedaldi .

I just wrote some code for test time batch normalization in your current version of MatConvNet. The results of VD-16 on ImageNet are now: top1e: 32.3 top2e: 12.1

Just to make sure, the VD-16 results on the MatConvNet webpage are not from using a network trained running that example script? Then my current results need not necessarily match that.

nicjac commented 8 years ago

@vedaldi quick question - what implementation did you guys choose for batch normalization at test time? In the current (non-devel) release, I wrote a script to save statistics for a batch of training image and then I fixed those at test time. Will the new release use something similar? I am just wondering if I should expect to have to re-train some of my networks.

SergeyMilyaev commented 8 years ago

Hi, It seems that I finally figured out the problems on validation performance. I trained my models with loading pre-computed 256x256 ImageNet images. During pre-computing of image resizing I used imread function, which was followed by resizing procedures of cnn_imagenet_get_batch function. But the evaluation with a single centred crop was performed on the original ImageNet images. The validation results on the pre-computed 256x256 ImageNet images are better, than on a random crop, as mentioned by @vedaldi. At first I thought that this differences was caused by image resizing to 256x256 on uint8 values in my pre-computing. But after I had modified the cnn_imagenet_get_batch to perform image resizing of the original ImageNet images on uint8 values it didn't improved validation accuracy. I checked the image statistics from the original ImageNet and my pre-computed images and found that the maximum difference between average images is equal to 4. So I suppose that imread and imreadjpeg don't provide equal images, so the results on original images are different compare to the pre-computed. Also I found some strange behaviour of training error after stopping and resuming the training procedure, the plot is attached for AlexNet without batch normalization. I had to resume training once after disk failure (it was about iteration 36, first big drop in training error) and also I tried to reduce the learning rate after iteration 60 (again, second drop in training error). 2015-11-27 01-11-06

vedaldi commented 8 years ago

Hi, yes the new release will accumulate statistics as it trains and then apply that at test time (just evaluate the network in test mode to do it).

I am also including a sample script to remove batch normalisation in a permanent manner by incorporating the statistics in the convolutional layers. Such `deployed’ networks are a little faster.

On 26 Nov 2015, at 21:57, Nicolas Jaccard notifications@github.com wrote:

@vedaldi https://github.com/vedaldi quick question - what implementation did you guys choose for batch normalization at test time? In the current (non-devel) release, I wrote a script to save statistics for a batch of training image and then I fixed those at test time. Will the new release use something similar? I am just wondering if I should expect to have to re-train some of my networks.

— Reply to this email directly or view it on GitHub https://github.com/vlfeat/matconvnet/issues/306#issuecomment-159997878.

DeepestNet commented 8 years ago

Anyone please help me by answering my question. I am new in deep learning.

I am training VGG-F network using my image data. Is there any relation between layer learning rate and overall learning rate in cnn_train()? How can I efficiently adjust them?
I have 224x224x3x60000 image data, but matlab memory does not support. Please give me suggestion how can I handle them.

lyltencent commented 8 years ago

@vedaldi , Hi Dr. Vedaldi, when you mention 'test' mode, do you mean that using the vl_simplenn function to test a new image with the trained network ? Or you mean using the function 'cnn_train' at the validation state ?

Thanks,

masaff commented 7 years ago

Hi, I'm totally new to deep learning. My questions may be silly:) I wanted to fine-tune the pre-trained AlexNet on my dataset. I followed all the preparation steps. I started fine tuning and I got the shape mismatch error. source 96 3 11 11 and target 96 1 11 11. I think that error caused because my input data are single channel (grayscale) . I used --gray while I was using convert_imageset in order to create .lmdb files. I then changed the deploy.prototxt file by changing the shape 10 3 277 277 to 10 1 277 277. I then added force_gray: true in train_val.prototxt file. But I still receive the same error. Can you please help me? Is it possible to use a caffe pre-trained model on a single channel input data?

OluwoleOyetoke commented 7 years ago

@vedaldi I'm trying to train my AlexNet from the scratch using a database with 39, 000 images [227 by 227 by 3] divided into 43 clategories representing Road Traffic Signs. I'm using MATCONVNET v23

I get the error 'Index exceeds matrix dimension at line 230' at the dagnn.Loss layer where we have case 'log' t = - log(x(ci)) ; ........ I'm wondering what the problem could be ?

vlfeat / matconvnet

Training AlexNet from scratch #306