A Recipe for Training Neural Networks

Establish baselines and visualize the train/eval metrics.
Fix random seed and run code twice to ensure to get the same results.
Disable any unnecessary fanciness, e.g., data augmentation.
Plot test losses of entire data instead of only batches.
Ensure the loss started with right value, e.g., -log(1/n_classes).
Visualize the fixed test batch to see the "dynamics" of how model learns (be aware of very low or very high learning rates).
Be aware of view and transpose/permute.
Write simple code that can work first and refactor to more generalizable version later.

Follow the most related paper and try their simplest architecture that achieves good performance. Do not start to customize things in early stage.
Use Adam optimizer with 3e-4 learning rate is safe.
If we have multiple signals, plug them into model one by one to ensure that you get a performance boost you'd expect.
Be careful with learning rate decay (different dataset size/problem requires different learning rate decay schedule). Disable learning rate decay first and tune this later.

Don't spend a lot of engineering costs to squeeze juice out of a small dataset when you could instead be collecting more data.
Data augmentation.
Creative augmentation: domain randomization, use of simulation, clever hybrids such as inserting (potentially simulated) data into scenes, or even GANs.
Pretraining.
Stick with supervised learning.
Smaller input dimensionality (try input small image).
Decrease the batch size (small batch = stronger regularization).
Add dropout (dropout2d for CNNs).
Weight decay.
Try larger model (early stopped performance may be better than the small ones).

howardyclo / papernotes