About pre-training loss

Hi :)

In general, the pre-training loss is rather uninformative of final downstream task performance, and not really comparable between different learning rates and EMA decay rates. The loss goes up because we have shifting targets: as the model evolves, the targets become more meaningful / information-dense and thus harder to predict. The shape of the loss curve and the whole learning dynamic depends heavily on the EMA decay rate and its warm-up period (it's crucial to find the right balance between EMA and LR here. the loss could also go up much more rapidly, or even slightly go down; none of which is necessarily a bad sign.). For comparison, you can also refer to the data2vec losses: https://github.com/facebookresearch/fairseq/issues/4177#issuecomment-1041020937

In addition to tracking the loss, we also log standard deviations for targets and predictions, as well as linear SVM evaluation. While these metrics were helpful in detecting diverging or broken runs, they didn't strongly correlate with the performance of the fine-tuned model on a downstream task. You don't need annotations for validation during pre-training. However, in our case the validation loss was almost identical to the training loss, so I don't think validation is very important here.

To proceed, I recommend trying a few learning rates (--model.learning_rate) and number of epochs (--trainer.max_epochs), then evaluating on your target task. A good starting point for lr warm-up is around 10% of total epochs (--model.lr_scheduler_linear_warmup_epochs), while EMA warm-up can be set to around 25% of total epochs (--model.d2v_ema_tau_epochs).

kabouzeid / point2vec

About pre-training loss #5