Questions regarding training parameters

hetong007 commented 6 years ago

First of all thank you for providing the training script and parameters about MobileNetV2 (the first repo I've ever seen).

I'm reproducing it for GluonCV thus have a couple of questions regarding the training:

How did you decide to set the number of epoch to 480 and batch size to 160?
Have you tried to train other MobileNetV2, i.e. 0.75, 0.5.
- I tried to train 0.75 with the same parameter, but it is 1.5% worse than tensorflow's official model: https://github.com/tensorflow/models/tree/master/research/slim/nets/mobilenet
Have you found a significant difference between training with/without your PR for nnvm?

I appreciate your help with my questions.

liangfu commented 6 years ago

Regarding to your questions,

I refer to resnet , which has been reproduced in this repo, for data augmentation. I use 480xN because resnet training can be reproduced that way, and batch size of 160 is the maximum size that two GTX-1080 can handle.
The training with multiplier=1.4 is undergoing, and I think I would release results with 0.75 and 0.5 as well. I don’t have the precision result at the moment, but I think you can reproduce the results listed in the link as long as you handle data augmentation and learning rate correctly.
The PR is just for inference code with nnvm, which lacks of the clip operator in mxnet frontend . There is no relation to prediction precision.

Hope the above notes can help you with your adventures with MNetV2.

hetong007 commented 6 years ago

Thank you so much for the quick reply!

I am curious about the num_epochs = 480 at: https://github.com/liangfu/mxnet-mobilenet-v2/blob/master/train_imagenet.py#L56 . What makes you decide to train the model for 480 epochs?

By using your settings, I can reach 71.7% with multiplier=1.0. With the same setting, however, I'm not getting too close to the claimed 69.8% ( so far 68.7% at epoch 200). Will double check my setting, and look forward to your result for 1.4!

liangfu commented 6 years ago

Good question, your success is just around the corner! At epoch around 200, I turned augmentation level to 3, and random scale between 0.533 and 0.6, this step fine-tunes the network to focus on the the specific region and prevents over fitting. After 30 to 40 epochs, I turned the aug_level to 1, and set random scale range between 0.533 and 0.535. Then you would reproduce the result.

You can forget the ‘num_epoch=480’, I was just trying to set an infinite value while avoid making the server running excessively long. I think I might upload the training log, which might be more intuitive to illustrate the argument settings.

hetong007 commented 6 years ago

For multiplier=1.0, I didn't change the augmentation and still gets to 71.7%.

But since I failed with 0.75, I'll try your augmentation approach. Thanks again for sharing!

liangfu commented 6 years ago

That sounds great, but that might consume a long time for training I guess. How many epochs you got until it converge to 71.7?

hetong007 commented 6 years ago

With the 80*2 batch size, this script hit 71.8% at the 261-th epoch.

I'll let it run through the entire 480 epochs and publish the model and training logs to GluonCV.

liangfu commented 6 years ago

Thank you for sharing. I would change my training strategy and try again later.

I still think even after your converge to 71.8 without changing aug_level, I suggest try changing augmentation level and random scale range I referred previously, which is really effective at the very end of the training stage.

hetong007 commented 6 years ago

Yes I'm quite interested in seeing its effect, will definitely resume and try that out, after my training with 0.75.

AIROBOTAI commented 6 years ago

@liangfu Great work! Could you please share the training log?

liangfu commented 6 years ago

Training logs have been uploaded, please look into the log folder.

AIROBOTAI commented 6 years ago

@liangfu Thanks for sharing! I checked the logs, for multiplier=1.0, it achieves 71.7, for multiplier=1.4, it achieves 73.0. The reported numbers in the original paper are 72.0 and 74.7 respectively. Any idea how to match the reported numbers? Thanks!

liangfu / mxnet-mobilenet-v2

Questions regarding training parameters #4