Closed pichuang1984 closed 4 years ago
@pichuang1984 Unfortunately, nobody has shared successful h-params with the larger models and I haven't had a chance to try. If anyone does, I'll start a collection of hparams on the README (with attribution) for any shared successes.
A few things to try:
--bn-tf
to enable both. I have trained with the PyTorch defaults for other models, but perhaps it has more impact for larger models? Downside to this is that you always need to set the epsilon to non-default for inference as well (something to remember for deployment).drop_connect
in the model entrypoint as has been mentioned in other issues? kwargs['drop_connect_rate'] = 0.2
--aa v0
for a setup that's similar to Google's efficientnet AA or --aa original
for one closer to the origina AA paper. The color jitter flag is ignored when aa is set.I added a memory efficient Swish impl last week so you should see some memory usage improvements while training...
Also, with 8 * 256 effective batch size, you might want to try bumping up the the EMA decay a notch .. maybe --model-ema-decay 0.9999
(default is 0.9998)
Thanks for the detailed feedback. I am in the process of trying couple settings and will get back to you if any of these runs are (or not) able to reproduce the reported accuracy.
i am getting a pickle error when trying to use the autoAugment. It seems that since it is a lambda function it is not pickable during multiprocessing (distributed training). I am curious if you have encountered this issue before?
@pichuang1984 I'm aware of pickle concerns with lambdas but haven't run into the problem. I can run distributed training w/ AA enabled (on one Linux machine with multiple GPUs) and don't have any problems.
I'm not sure it's the distributed processes though, since there shouldn't be any sharing of dataset / transforms for that. Seems more likely to be the main process sharing the dataset with the worker processes... I know people have had issues with lambdas + transforms on Windows with PyTorch... I'll make a note to replace them with normal functions sometime next week but you can probably do it yourself quicker.
@rwightman I am actually using a customized training script so that might be why. Not sure exactly what the problem is yet but your point makes sense. I will continue debugging as well as working on replacing them. Also looking forward to your implementation.
I am able to closely reproduce B1 results (top1/top5: 78.74/94.4) based on the parms from https://arxiv.org/pdf/1908.06022.pdf. Here are some settings I use/derive: decay_epoch: 2.4 decay_rate: 0.99 color_jitter: 0.1 base_lr = 0.016 ema_decay = 0.9998 The actual learning rate is: (total # of images) / 256 * 0.016 per iteration
I also tried my implementation of the auto augment but didn't see an accuracy increase.
@pichuang1984 Thanks for the update. Did you switch the LR scheduler setup so that it's adjust on updates and not epochs?
@rwightman No i didn't, still using your implementation based on epochs. I am also using pytorch BN default instead of TF. Using TF BN parms actually drop the accuracy by 0.8% for B1
The same receipt doesn't seen to work for B2 though. I am still ~ 1% short from the paper. Any recommendation will be appreciated for B2/B3 and so on.
@pichuang1984 I just finished training B2 to 80.4%. Weights added and hparms posted in the 'what's new', I used my new RandAugment implemenation.
Something to watch out for when using distributed training is that the distributed validation while it's training (and thus checkpoint selection) doesn't necessarily match up exactly the actual validation from the saved weights. This can be sometimes improved or mitigated by averaging a few checkpoints after the fact or keeping a larger number of checkpoints and validating against all to find the best afterwards.
@rwightman how long did it take to train B2 from scratch? How many GPUs did you use, and which ones?
@rwightman This is very cool! I have not been able to reproduce results for B2/B3. So far my best results are B2: 79.7% and B3: 81.1%. Will try your hparams.
Is this caused by the difference between running validation on single GPU vs distributed? Do you think if we enforce the validation to run on only single GPU will that mitigate this problem?
@michaelklachko not super quick, 13 days on two Titan RTX running in FP16 with AMP at near full tilt (95% utilization).. I got impatient and killed training at epoch 420 or so and went through the results. Things had levelled off for a while by that point.
@pichuang1984 I think the biggest reason is the batchnorm running stats, they remain on on each distributed node separately and aren't synched. When validation is performed each GPU is using slightly different BN stats for the validation, the stats from rank=0 end up being saved in the checkpoint. One can use sync-bn but it really slows things down as the BN calcs are synced each batch, and can actually hurt training with larger batch sizes. I have tried looking for an implemenation that maybe syncs the stats once per epoch, after the training iteration, before validation and model saving, but I didn't find anything like that. I may try it myself someday and see if it helps...
But I must say, with the addition of the JIT + memory optimized Swish implementation I added, it has made this process quicker, allowed slightly larger batch sizes, probably would have been a number of days more before I made that change.
@rwightman Ah I see. I think there is an all_reduce call in Pytorch (https://pytorch.org/docs/stable/distributed.html) that we can utilize? We can just sync up the BN parameters once before validation instead of every mini-batch.
@pichuang1984 yup, I'm experimenting with that at this very moment... will let it run for a day and add an argument if it seems good
train_metrics = train_epoch(
epoch, model, loader_train, optimizer, train_loss_fn, args,
lr_scheduler=lr_scheduler, saver=saver, output_dir=output_dir,
use_amp=use_amp, model_ema=model_ema)
if args.distributed and True:
if args.local_rank == 0:
print("Averaging bn running means and vars")
# ensure every node has the same running bn stats before eval/save
for bn_name, bn_buf in unwrap_model(model).named_buffers(recurse=True):
if ('running_mean' in bn_name) or ('running_var' in bn_name):
torch.distributed.all_reduce(bn_buf, op=dist.ReduceOp.SUM)
bn_buf /= float(args.world_size)
eval_metrics = validate(model, loader_eval, validate_loss_fn, args)
Nice looking forward to hear your good news :)
@rwightman, I see. I'm going to train on 2x v100, so hopefully will be a little faster. I'm thinking to try training B0 first. To reproduce your 76.9% B0 result, do I need to change any params in the command you provided for B2?
Also, to clarify:
These results (B0 and B2) are done without BN stats syncing, correct?
Did you validate picking the best checkpoint or averaging across multiple ones (per your comment regarding distributed validation)?
@michaelklachko yeah, for B0 you'd want to reduce the drop rate to 0.2., make sure you scale the LR to your global batch size (.016 * global_batch) / 256
where global batch is the value of -b
* number of distributed nodes
All the training so far was done without BN stats syncing, yes. I'm currently experimenting with my 'once per epoch' BN stats syncing and it does seem to be providing more consistent validation to saved checkpoint correspondence, but not clear it's impacting the end result (for good or bad). Still running my experiments.
It's probably quickest just to average the best checkpoints you end up with in the output folder for your run, keeps the 'best' 10 by default.
A few more questions if you don't mind:
The BN sync you're trying currently - it's the same as BN sync done in AMP, only you want to do it once per epoch, correct?
Would you recommend increasing batch size to the maximum that fits in v100 memory (32GB)?
If you had to train multiple B0 models, and you had two V100 cards, would you prefer to train them one at a time using 2x v100, or two different models in parallel, one per V100?
How are 01 and 02 AMP modes different from a practical perspective? Which one is faster and which one is more stable?
I just finished training Mobilenet v2, using my own implementation, but I only got 71.6% accuracy using the hyperparams from here which are supposed to get 72.2%. Which accuracy did you get on Mobilenet v2, and what params did you use?
I appreciate your help!
@michaelklachko
It's simpler than the BN sync done in AMP. The AMP or native Pytorch SyncBN keeps the running mean/var updated on a per batch basis as you train by calculating the batch statistics across all GPUs. My solution just averages the running mean/var across all nodes at the end of each training epoch, technically the average of variances isn't correct. I feel it's 'good enough' and better than nothing.
Yes, within this range you shouldn't be getting into the range where your batch sizes are big enough to hurt training, just scale the LR properly.
Depends. If you want to try more hparams, do multiple runs in parallel
I do not notice much difference between O1 and O2 with these models.
I haven't trained MobilenetV2, I have trained V3 and Mnasnet from scratch with earlier versions of this repo see #11 ...very similar hparams to B0, probably better with the new randaugment but unsure.
Ok, submitted two B0 jobs to the cluster (256 and 384 batch sizes). I also tested it on 4 GPUs (Titan Xp) dev server, and it's most likely being bottlenecked by the CPU. The CPU on that server has 6 cores and all 12 logical cores are 99-100% utilized, while GPUs are in 70-80% range. This is with -j 4, and -b 128.
How do you choose an optimal value for num_workers?
@michaelklachko yes, sounds like the CPU is bottleneck, exceeding the logical core count in workers will just make things worse. I usually use 4-6 for # workers, but try to keep the total number workers across all training sessions or distributed nodes < logical cores. Pillow-SIMD can be a big help because pillow image decompression and processing is slow. OpenCV is much faster but requires setting up a whole nother pipeline of augmentations, dataset, etc. DALI can be used for that but is a bit of a pain to get setup.
@rwightman Any chance you can share the unwrap_model function call? Trying to implement the same thing and would like to avoid duplicating the work.
Thanks!
@pichuang1984 it's on the reduce-bn branch
@rwightman Any chance you can share the training log for your latest EfficientNet (B2) run? Have been trying to reproduce your result (top-1 80.4) without much success. Best I can get is around 80 at ~epoch 400.
Thanks, and Happy Holiday
@pichuang1984 Summary attached, the best single checkpoint was a little over 80.2, top 8-10 averaged hit 80.4. If you run post train validation, remember to run with --use-ema if you trained with --use-ema
Hi, @rwightman ;Thanks for your great works. Recently, I am reproducing the B4 results. When I utilize your implemented RandAA, it indeed improves the performance. However, I see you set the max value of magnitude is 10. However, in google TPU code, the hyper-parameters of magnitude on B5 is 17, and B7 is 28, respectively. This phenomenon confuses me. Are there some differences between your implementation and TPU code? or just bugs?
@JoinWei-PKU this issue has come up before, see #63 ... there are also two issues on the TPU repository about this. I believe it is a misunderstanding and difference between the paper and TPU impl as going above 10 doesn't make sense for many of the augmentations. The whole setup is a bit troubling too because some augmentations increase in intensity with magnitude but others actually decrease (solarize, posterize) or others have a low point in the middle (color, saturation, etc)
@rwightman Happy new year!
I just finished B0 training - results are consistent and better than I expected. I tried 8 simulations: Single GPU with batch size 384, two GPU batch size 384, two GPUs batch size 128, and four GPUs batch size 128 (in all cases, the batch size is per GPU). Learning rate is scaled appropriately. Each config ran twice. The command is identical to your B2 command, with --drop
0.2 instead of 0.3. Dec 4 commit code.
The results range from 77.55 to 77.70 - this is the EMA value, the best epoch per training run. Not averaged across multiple checkpoints. Non-EMA was usually ~0.2-0.4% worse for the corresponding epoch.
I haven't noticed any significant difference between training on a single GPU with batch size 384 vs training on four GPUs (batch size 128), so it seems BN sync is not that important, at least not for B0 with these params.
@michaelklachko thanks for the update! I'll add your result to the README training section. Those are good results, I have not tried training B0 or B1 with those newer hparams that leverage my RandAugment impl.
I think you may get a bit more of a bump by averaging a few checkpoints. It usually helps in training sessions where EMA also helps. So with the EfficientNet style training there are usually some gains (+.1-.3%), but hardly any gain with SGD + cosine that's my goto for ResNets. I'll push some averaging code to this repository today so you can try, just needs a quick cleanup :)
I'd like to give AdvProp a try. I was planning to come up with a mechanism for doing the Aux BatchNorm (batch norm per sub-batch / samples) first. I had some old adversarial training code to dust off. It is going to be much slower though (or require 2-4x the # GPUs) since it runs PGD attacks on the fly to generate training examples.
I can't say that doing my once per epoch BN stats sync (I'm calling it dist-bn
as an arg to differentiate from sync-bn
) improves the end result. However, it does make the validation results during training more consistent. So when training is done and I'm left with my N best checkpoints, they are more likely to actually be the N best and the numbers recorded match post training validation on the saved checkpoints.
@rwightman can you please elaborate more on what you meant by averaging a few checkpoints? In the current setup the saver will keep the top K best checkpoints, and you are saying that if we average the EMA weights of these top-K best checkpoints through post-processing, we will get an additional small gain on accuracy?
I am still trying to reproduce your B2 result. I have been trying to use distributed GPU (32/64 GPUs) for this experiment. I then scale the learning rate linearly based on your setup. For instance, if your setup is 128 batch size, 1 gpu, with a base_lr of 0.016, then i will scale the learning rate as follows:
new_lr = base_lr total_number_of_gpus (i.e., 32) my_batch_size / 128
However I notice that sometimes with AMP this will not converge and show that the scaling goes to 10^-50, eventually resulting in a NaN loss. Turning AMP off typically resolves this issue. I think this is because the learning rate becomes too significant. I have tried increasing the warm up epochs to 10 but doesn't seen to help. Has anyone encountered a similar issue?
@pichuang1984 the divisor should be 256, otherwise no issues with your calculations (my .016 was for batch 128 with 2 gpus, so 256 total batch size).
Using 32-64 GPUs is obviously something I have not tried so I can't really comment. I've seen the warmup ramp up causing stability issues with rmsprop, you could try disabling it by setting --warmup-epochs 0
... otherwise, could try bumping rmsprop eps even higher?
You are definitely using the rmsproptf
right? The PyTorch default rmsprop goes unstable much easier than my modified one with these hparams.
Failing all that, if you get me some time on a 32-64 GPU cluster I'll run some experiments ;)
@pichuang1984 @michaelklachko I just cleaned up and pushed my checkpoint averaging script, give it a try. Typically I just ./avg_checkpoints.py --input output/train/mylasttraining -n ?
try a few different n.
Michael, it sounds like your B0 results are better than my old ones. I wouldn't mind hosting yours if you're willing to share. Maybe you can hit 78 with the averaging :)
@michaelklachko thanks, just added them... the averaging was counter productive for these weights, 77.7 it is.
@pichuang1984 Summary attached, the best single checkpoint was a little over 80.2, top 8-10 averaged hit 80.4. If you run post train validation, remember to run with --use-ema if you trained with --use-ema
@rwightman. Thanks so much for sharing the training log. Just want to double check. In your summary.csv, the eval_prec1 and eval_prec5 are the validation with or without the ema weights?
EDIT My bad, the eval_prec1 and eval_prec5 of your summary.csv should be with ema weights.
@pichuang1984 yeah, I actually wanted to see both just recently for an experiment, so will probably add separate ema columns soon...
@rwightman Thanks for confirming. It seems that I can reproduce your 2gpu experiment results, though its only at epoch 16 now, which means that the problem arises when trying to scale up the number of GPUs...
I'm going to close this issue now, I think there are enough hparams to get great results, at least for B0-B2 and somewhat B3.
I made a change to the weight init based on a discovery (that it didn't 100% match the TF TPU impl init) by another user of this code that impacts all of these models, specifically the depthwise convs. It may prove helpful in getting that last bit of performance on the larger models. Please let me know if anyone tries training B3+ from scratch with the changed init.
First of all thanks for the fantastic code!
I am wondering if anyone has successfully reproduce (or close to it) the results for Efficientnetb1-b7? I am able to reproduce b0 with jiefengpeng's setting:
./distributed_train.sh 8 ../ImageNet/ --model efficientnet_b0 -b 256 --sched step --epochs 500 --decay-epochs 3 --decay-rate 0.963 --opt rmsproptf --opt-eps .001 -j 8 --warmup-epochs 5 --weight-decay 1e-5 --drop 0.2 --color-jitter .06 --model-ema --lr .128
The same setting (with adjusted drop rate) for b1 came with only 78.11 (with EMA enabled), compared to 78.8% reported in the paper.