Efficientnetb1-b7 hyper parameters

pichuang1984 commented 5 years ago

First of all thanks for the fantastic code!

I am wondering if anyone has successfully reproduce (or close to it) the results for Efficientnetb1-b7? I am able to reproduce b0 with jiefengpeng's setting: ./distributed_train.sh 8 ../ImageNet/ --model efficientnet_b0 -b 256 --sched step --epochs 500 --decay-epochs 3 --decay-rate 0.963 --opt rmsproptf --opt-eps .001 -j 8 --warmup-epochs 5 --weight-decay 1e-5 --drop 0.2 --color-jitter .06 --model-ema --lr .128

The same setting (with adjusted drop rate) for b1 came with only 78.11 (with EMA enabled), compared to 78.8% reported in the paper.

rwightman commented 5 years ago

@pichuang1984 Unfortunately, nobody has shared successful h-params with the larger models and I haven't had a chance to try. If anyone does, I'll start a collection of hparams on the README (with attribution) for any shared successes.

A few things to try:

Google trains with a different batch norm epsilon and momentum than PyTorch defaults. You can use the flag --bn-tf to enable both. I have trained with the PyTorch defaults for other models, but perhaps it has more impact for larger models? Downside to this is that you always need to set the epsilon to non-default for inference as well (something to remember for deployment).
I assume you enabled drop_connect in the model entrypoint as has been mentioned in other issues? kwargs['drop_connect_rate'] = 0.2
I added experimental implementation of AutoAugmentation a while back. I tested it with a ResNet model that is faster to train and it worked well, slight improvement. Since Google had success with AutoAugment for their EfficientNet, it might have an impact? Use the flag --aa v0 for a setup that's similar to Google's efficientnet AA or --aa original for one closer to the origina AA paper. The color jitter flag is ignored when aa is set.

I added a memory efficient Swish impl last week so you should see some memory usage improvements while training...

rwightman commented 5 years ago

Also, with 8 * 256 effective batch size, you might want to try bumping up the the EMA decay a notch .. maybe --model-ema-decay 0.9999 (default is 0.9998)

pichuang1984 commented 5 years ago

Thanks for the detailed feedback. I am in the process of trying couple settings and will get back to you if any of these runs are (or not) able to reproduce the reported accuracy.

pichuang1984 commented 5 years ago

i am getting a pickle error when trying to use the autoAugment. It seems that since it is a lambda function it is not pickable during multiprocessing (distributed training). I am curious if you have encountered this issue before?

rwightman commented 5 years ago

@pichuang1984 I'm aware of pickle concerns with lambdas but haven't run into the problem. I can run distributed training w/ AA enabled (on one Linux machine with multiple GPUs) and don't have any problems.

I'm not sure it's the distributed processes though, since there shouldn't be any sharing of dataset / transforms for that. Seems more likely to be the main process sharing the dataset with the worker processes... I know people have had issues with lambdas + transforms on Windows with PyTorch... I'll make a note to replace them with normal functions sometime next week but you can probably do it yourself quicker.

pichuang1984 commented 5 years ago

@rwightman I am actually using a customized training script so that might be why. Not sure exactly what the problem is yet but your point makes sense. I will continue debugging as well as working on replacing them. Also looking forward to your implementation.

pichuang1984 commented 5 years ago

I am able to closely reproduce B1 results (top1/top5: 78.74/94.4) based on the parms from https://arxiv.org/pdf/1908.06022.pdf. Here are some settings I use/derive: decay_epoch: 2.4 decay_rate: 0.99 color_jitter: 0.1 base_lr = 0.016 ema_decay = 0.9998 The actual learning rate is: (total # of images) / 256 * 0.016 per iteration

I also tried my implementation of the auto augment but didn't see an accuracy increase.

rwightman commented 5 years ago

@pichuang1984 Thanks for the update. Did you switch the LR scheduler setup so that it's adjust on updates and not epochs?

pichuang1984 commented 5 years ago

@rwightman No i didn't, still using your implementation based on epochs. I am also using pytorch BN default instead of TF. Using TF BN parms actually drop the accuracy by 0.8% for B1

The same receipt doesn't seen to work for B2 though. I am still ~ 1% short from the paper. Any recommendation will be appreciated for B2/B3 and so on.

rwightman commented 4 years ago

@pichuang1984 I just finished training B2 to 80.4%. Weights added and hparms posted in the 'what's new', I used my new RandAugment implemenation.

Something to watch out for when using distributed training is that the distributed validation while it's training (and thus checkpoint selection) doesn't necessarily match up exactly the actual validation from the saved weights. This can be sometimes improved or mitigated by averaging a few checkpoints after the fact or keeping a larger number of checkpoints and validating against all to find the best afterwards.

michaelklachko commented 4 years ago

@rwightman how long did it take to train B2 from scratch? How many GPUs did you use, and which ones?

pichuang1984 commented 4 years ago

@rwightman This is very cool! I have not been able to reproduce results for B2/B3. So far my best results are B2: 79.7% and B3: 81.1%. Will try your hparams.

Is this caused by the difference between running validation on single GPU vs distributed? Do you think if we enforce the validation to run on only single GPU will that mitigate this problem?

rwightman commented 4 years ago

@michaelklachko not super quick, 13 days on two Titan RTX running in FP16 with AMP at near full tilt (95% utilization).. I got impatient and killed training at epoch 420 or so and went through the results. Things had levelled off for a while by that point.

@pichuang1984 I think the biggest reason is the batchnorm running stats, they remain on on each distributed node separately and aren't synched. When validation is performed each GPU is using slightly different BN stats for the validation, the stats from rank=0 end up being saved in the checkpoint. One can use sync-bn but it really slows things down as the BN calcs are synced each batch, and can actually hurt training with larger batch sizes. I have tried looking for an implemenation that maybe syncs the stats once per epoch, after the training iteration, before validation and model saving, but I didn't find anything like that. I may try it myself someday and see if it helps...

rwightman commented 4 years ago

But I must say, with the addition of the JIT + memory optimized Swish implementation I added, it has made this process quicker, allowed slightly larger batch sizes, probably would have been a number of days more before I made that change.

pichuang1984 commented 4 years ago

@rwightman Ah I see. I think there is an all_reduce call in Pytorch (https://pytorch.org/docs/stable/distributed.html) that we can utilize? We can just sync up the BN parameters once before validation instead of every mini-batch.

rwightman commented 4 years ago

@pichuang1984 yup, I'm experimenting with that at this very moment... will let it run for a day and add an argument if it seems good

            train_metrics = train_epoch(
                epoch, model, loader_train, optimizer, train_loss_fn, args,
                lr_scheduler=lr_scheduler, saver=saver, output_dir=output_dir,
                use_amp=use_amp, model_ema=model_ema)

            if args.distributed and True:
                if args.local_rank == 0:
                    print("Averaging bn running means and vars")
                # ensure every node has the same running bn stats before eval/save
                for bn_name, bn_buf in unwrap_model(model).named_buffers(recurse=True):
                    if ('running_mean' in bn_name) or ('running_var' in bn_name):
                        torch.distributed.all_reduce(bn_buf, op=dist.ReduceOp.SUM)
                        bn_buf /= float(args.world_size)

            eval_metrics = validate(model, loader_eval, validate_loss_fn, args)

pichuang1984 commented 4 years ago

Nice looking forward to hear your good news :)

michaelklachko commented 4 years ago

@rwightman, I see. I'm going to train on 2x v100, so hopefully will be a little faster. I'm thinking to try training B0 first. To reproduce your 76.9% B0 result, do I need to change any params in the command you provided for B2?

Also, to clarify:

These results (B0 and B2) are done without BN stats syncing, correct?
Did you validate picking the best checkpoint or averaging across multiple ones (per your comment regarding distributed validation)?

rwightman commented 4 years ago

@michaelklachko yeah, for B0 you'd want to reduce the drop rate to 0.2., make sure you scale the LR to your global batch size (.016 * global_batch) / 256 where global batch is the value of -b * number of distributed nodes

All the training so far was done without BN stats syncing, yes. I'm currently experimenting with my 'once per epoch' BN stats syncing and it does seem to be providing more consistent validation to saved checkpoint correspondence, but not clear it's impacting the end result (for good or bad). Still running my experiments.

It's probably quickest just to average the best checkpoints you end up with in the output folder for your run, keeps the 'best' 10 by default.

michaelklachko commented 4 years ago

A few more questions if you don't mind:

The BN sync you're trying currently - it's the same as BN sync done in AMP, only you want to do it once per epoch, correct?
Would you recommend increasing batch size to the maximum that fits in v100 memory (32GB)?
If you had to train multiple B0 models, and you had two V100 cards, would you prefer to train them one at a time using 2x v100, or two different models in parallel, one per V100?
How are 01 and 02 AMP modes different from a practical perspective? Which one is faster and which one is more stable?
I just finished training Mobilenet v2, using my own implementation, but I only got 71.6% accuracy using the hyperparams from here which are supposed to get 72.2%. Which accuracy did you get on Mobilenet v2, and what params did you use?

I appreciate your help!

rwightman commented 4 years ago

@michaelklachko

It's simpler than the BN sync done in AMP. The AMP or native Pytorch SyncBN keeps the running mean/var updated on a per batch basis as you train by calculating the batch statistics across all GPUs. My solution just averages the running mean/var across all nodes at the end of each training epoch, technically the average of variances isn't correct. I feel it's 'good enough' and better than nothing.
Yes, within this range you shouldn't be getting into the range where your batch sizes are big enough to hurt training, just scale the LR properly.
Depends. If you want to try more hparams, do multiple runs in parallel
I do not notice much difference between O1 and O2 with these models.
I haven't trained MobilenetV2, I have trained V3 and Mnasnet from scratch with earlier versions of this repo see #11 ...very similar hparams to B0, probably better with the new randaugment but unsure.

michaelklachko commented 4 years ago

Ok, submitted two B0 jobs to the cluster (256 and 384 batch sizes). I also tested it on 4 GPUs (Titan Xp) dev server, and it's most likely being bottlenecked by the CPU. The CPU on that server has 6 cores and all 12 logical cores are 99-100% utilized, while GPUs are in 70-80% range. This is with -j 4, and -b 128.

How do you choose an optimal value for num_workers?

rwightman commented 4 years ago

@michaelklachko yes, sounds like the CPU is bottleneck, exceeding the logical core count in workers will just make things worse. I usually use 4-6 for # workers, but try to keep the total number workers across all training sessions or distributed nodes < logical cores. Pillow-SIMD can be a big help because pillow image decompression and processing is slow. OpenCV is much faster but requires setting up a whole nother pipeline of augmentations, dataset, etc. DALI can be used for that but is a bit of a pain to get setup.

pichuang1984 commented 4 years ago

@rwightman Any chance you can share the unwrap_model function call? Trying to implement the same thing and would like to avoid duplicating the work.

Thanks!

rwightman commented 4 years ago

@pichuang1984 it's on the reduce-bn branch

pichuang1984 commented 4 years ago

@rwightman Any chance you can share the training log for your latest EfficientNet (B2) run? Have been trying to reproduce your result (top-1 80.4) without much success. Best I can get is around 80 at ~epoch 400.

Thanks, and Happy Holiday

rwightman commented 4 years ago

@pichuang1984 Summary attached, the best single checkpoint was a little over 80.2, top 8-10 averaged hit 80.4. If you run post train validation, remember to run with --use-ema if you trained with --use-ema

summary.csv.zip

JoinWei-PKU commented 4 years ago

Hi, @rwightman ;Thanks for your great works. Recently, I am reproducing the B4 results. When I utilize your implemented RandAA, it indeed improves the performance. However, I see you set the max value of magnitude is 10. However, in google TPU code, the hyper-parameters of magnitude on B5 is 17, and B7 is 28, respectively. This phenomenon confuses me. Are there some differences between your implementation and TPU code? or just bugs?

rwightman commented 4 years ago

@JoinWei-PKU this issue has come up before, see #63 ... there are also two issues on the TPU repository about this. I believe it is a misunderstanding and difference between the paper and TPU impl as going above 10 doesn't make sense for many of the augmentations. The whole setup is a bit troubling too because some augmentations increase in intensity with magnitude but others actually decrease (solarize, posterize) or others have a low point in the middle (color, saturation, etc)

michaelklachko commented 4 years ago

@rwightman Happy new year!

I just finished B0 training - results are consistent and better than I expected. I tried 8 simulations: Single GPU with batch size 384, two GPU batch size 384, two GPUs batch size 128, and four GPUs batch size 128 (in all cases, the batch size is per GPU). Learning rate is scaled appropriately. Each config ran twice. The command is identical to your B2 command, with --drop 0.2 instead of 0.3. Dec 4 commit code.

The results range from 77.55 to 77.70 - this is the EMA value, the best epoch per training run. Not averaged across multiple checkpoints. Non-EMA was usually ~0.2-0.4% worse for the corresponding epoch.

How does this compare to the results you got for B0?
Do you think averaging weights across best N checkpoints would improve it further?
Have you tried training with AdvProp?
Did syncing BN every epoch help in your experiments?

I haven't noticed any significant difference between training on a single GPU with batch size 384 vs training on four GPUs (batch size 128), so it seems BN sync is not that important, at least not for B0 with these params.

rwightman commented 4 years ago

@michaelklachko thanks for the update! I'll add your result to the README training section. Those are good results, I have not tried training B0 or B1 with those newer hparams that leverage my RandAugment impl.

I think you may get a bit more of a bump by averaging a few checkpoints. It usually helps in training sessions where EMA also helps. So with the EfficientNet style training there are usually some gains (+.1-.3%), but hardly any gain with SGD + cosine that's my goto for ResNets. I'll push some averaging code to this repository today so you can try, just needs a quick cleanup :)

I'd like to give AdvProp a try. I was planning to come up with a mechanism for doing the Aux BatchNorm (batch norm per sub-batch / samples) first. I had some old adversarial training code to dust off. It is going to be much slower though (or require 2-4x the # GPUs) since it runs PGD attacks on the fly to generate training examples.

I can't say that doing my once per epoch BN stats sync (I'm calling it dist-bn as an arg to differentiate from sync-bn) improves the end result. However, it does make the validation results during training more consistent. So when training is done and I'm left with my N best checkpoints, they are more likely to actually be the N best and the numbers recorded match post training validation on the saved checkpoints.

pichuang1984 commented 4 years ago

@rwightman can you please elaborate more on what you meant by averaging a few checkpoints? In the current setup the saver will keep the top K best checkpoints, and you are saying that if we average the EMA weights of these top-K best checkpoints through post-processing, we will get an additional small gain on accuracy?

I am still trying to reproduce your B2 result. I have been trying to use distributed GPU (32/64 GPUs) for this experiment. I then scale the learning rate linearly based on your setup. For instance, if your setup is 128 batch size, 1 gpu, with a base_lr of 0.016, then i will scale the learning rate as follows:

new_lr = base_lr total_number_of_gpus (i.e., 32) my_batch_size / 128

However I notice that sometimes with AMP this will not converge and show that the scaling goes to 10^-50, eventually resulting in a NaN loss. Turning AMP off typically resolves this issue. I think this is because the learning rate becomes too significant. I have tried increasing the warm up epochs to 10 but doesn't seen to help. Has anyone encountered a similar issue?

rwightman commented 4 years ago

@pichuang1984 the divisor should be 256, otherwise no issues with your calculations (my .016 was for batch 128 with 2 gpus, so 256 total batch size).

Using 32-64 GPUs is obviously something I have not tried so I can't really comment. I've seen the warmup ramp up causing stability issues with rmsprop, you could try disabling it by setting --warmup-epochs 0 ... otherwise, could try bumping rmsprop eps even higher?

You are definitely using the rmsproptf right? The PyTorch default rmsprop goes unstable much easier than my modified one with these hparams.

Failing all that, if you get me some time on a 32-64 GPU cluster I'll run some experiments ;)

rwightman commented 4 years ago

@pichuang1984 @michaelklachko I just cleaned up and pushed my checkpoint averaging script, give it a try. Typically I just ./avg_checkpoints.py --input output/train/mylasttraining -n ? try a few different n.

Michael, it sounds like your B0 results are better than my old ones. I wouldn't mind hosting yours if you're willing to share. Maybe you can hit 78 with the averaging :)

michaelklachko commented 4 years ago

Here you go: https://drive.google.com/open?id=1JysN-66uIi7LuEGCcn0S0c184ImP-6AN

rwightman commented 4 years ago

@michaelklachko thanks, just added them... the averaging was counter productive for these weights, 77.7 it is.

pichuang1984 commented 4 years ago

@pichuang1984 Summary attached, the best single checkpoint was a little over 80.2, top 8-10 averaged hit 80.4. If you run post train validation, remember to run with --use-ema if you trained with --use-ema

summary.csv.zip

@rwightman. Thanks so much for sharing the training log. Just want to double check. In your summary.csv, the eval_prec1 and eval_prec5 are the validation with or without the ema weights?

EDIT My bad, the eval_prec1 and eval_prec5 of your summary.csv should be with ema weights.

rwightman commented 4 years ago

@pichuang1984 yeah, I actually wanted to see both just recently for an experiment, so will probably add separate ema columns soon...

pichuang1984 commented 4 years ago

@rwightman Thanks for confirming. It seems that I can reproduce your 2gpu experiment results, though its only at epoch 16 now, which means that the problem arises when trying to scale up the number of GPUs...

rwightman commented 4 years ago

I'm going to close this issue now, I think there are enough hparams to get great results, at least for B0-B2 and somewhat B3.

I made a change to the weight init based on a discovery (that it didn't 100% match the TF TPU impl init) by another user of this code that impacts all of these models, specifically the depthwise convs. It may prove helpful in getting that last bit of performance on the larger models. Please let me know if anyone tries training B3+ from scratch with the changed init.

huggingface / pytorch-image-models

Efficientnetb1-b7 hyper parameters #45