huggingface / pytorch-image-models

The largest collection of PyTorch image encoders / backbones. Including train, eval, inference, export scripts, and pretrained weights -- ResNet, ResNeXT, EfficientNet, NFNet, Vision Transformer (ViT), MobileNetV4, MobileNet-V3 & V2, RegNet, DPN, CSPNet, Swin Transformer, MaxViT, CoAtNet, ConvNeXt, and more
https://huggingface.co/docs/timm
Apache License 2.0
31.47k stars 4.7k forks source link

How to tune hyperparameters #119

Closed liqi0126 closed 3 years ago

liqi0126 commented 4 years ago

Thanks for your great work! This repo includes many state-of-the-art methods and is easy to reproduce. So I would like to ask for some advice about insight into tuning hyperparameters. There are many hyperparameters given in the script of training, and I'm wondering how did you dig them out. How should I adjust them if I am training a model of my own?

Kshitij09 commented 4 years ago

@rwightman I'm also stumbled upon same question. How could I know what are the hyperparameters that I'm able to tune? I agree it varies depending on the model, but I'm looking for some documentation or a place to dig into. Particularly, I'm interested in customizing activation_layer, attention_layer, anit-aliasing layer

rwightman commented 3 years ago

@liqi17thu @Kshitij09 this reply has been a long time coming, but going to move to discussions as it seems a good thread to leave open.

There's no magic to hparam tuning, just persistence and spending enough time with your models and datasets to get a feel for what works and what doesn't. With more experience you'll get better at finding starting points but will still have to search for the optimal values. If you've got enough compute, just doing exhaustive sweeps can be a good strategy, especially if your dataset isn't large.

The starting points, and large improvement jumps in hparams often comes from ideas put forward in various papers and their combinations. My current 'best' ImageNet hparams are based on a blend of ideas in EfficientNet, RandAugment, and 'Bag-of-Tricks' papers + experience based additions which include turning up the augmentation and regularization significantly. This won't necessarily work as well on a different dataset, especially smaller ones, or for fine-tuning instead of training from scratch, etc.

Fine-tuning, I often find a simple combo of Adam + Plateau LR schedule, or Momentum/Nesterov (SGD) w/ cosine decay and lower LR than you'd use for training from scratch is a good starting point. Also less augmentation is usually a good starting point.