Are very local features enough or do we need global context?
How much variation is there and what form does it take?
What variation is spurious and could be preprocessed out?
Does spatial position matter or do we want to average pool it out?
How much does detail matter and how far could we afford to downsample the images?
How noisy are the labels?
Visualize the statistics and the outliers along any axis.
Setup Training/Evaluation and Start from Simple Model
Establish baselines and visualize the train/eval metrics.
Fix random seed and run code twice to ensure to get the same results.
Disable any unnecessary fanciness, e.g., data augmentation.
Plot test losses of entire data instead of only batches.
Ensure the loss started with right value, e.g., -log(1/n_classes).
Visualize the fixed test batch to see the "dynamics" of how model learns (be aware of very low or very high learning rates).
Be aware of view and transpose/permute.
Write simple code that can work first and refactor to more generalizable version later.
Overfit
Follow the most related paper and try their simplest architecture that achieves good performance. Do not start to customize things in early stage.
Use Adam optimizer with 3e-4 learning rate is safe.
If we have multiple signals, plug them into model one by one to ensure that you get a performance boost you'd expect.
Be careful with learning rate decay (different dataset size/problem requires different learning rate decay schedule). Disable learning rate decay first and tune this later.
Regularize
Don't spend a lot of engineering costs to squeeze juice out of a small dataset when you could instead be collecting more data.
Data augmentation.
Creative augmentation: domain randomization, use of simulation, clever hybrids such as inserting (potentially simulated) data into scenes, or even GANs.
Pretraining.
Stick with supervised learning.
Smaller input dimensionality (try input small image).
Decrease the batch size (small batch = stronger regularization).
Add dropout (dropout2d for CNNs).
Weight decay.
Try larger model (early stopped performance may be better than the small ones).
Metadata