lucidrains / DALLE2-pytorch

Implementation of DALL-E 2, OpenAI's updated text-to-image synthesis neural network, in Pytorch
MIT License
10.97k stars 1.07k forks source link

Evaluate on test more often than once per epoch #185

Open rom1504 opened 2 years ago

rom1504 commented 2 years ago

An epoch can last a long time, maybe even the whole training duration

https://github.com/lucidrains/DALLE2-pytorch/blob/main/train_decoder.py#L328

Introducing a eval_every_n_steps param would be great

Veldrovive commented 2 years ago

In parallel with this, I think we need to also a parameter for quantifying how much time is spent doing evaluation versus training. Evaluation is an incredibly heavy operation because it involves sampling hundreds of images from the decoder and as such may take as much time as an entire training loop if parameters are set up naively. In this case, doing multiple evaluations per epoch would be very wasteful. This is useful for large scale distributed runs where the training loop will be many millions of samples, but should not be encouraged for anything below around 5 million sample training loops.

rom1504 commented 2 years ago

save_every_n_samples": 2000000 and epoch_samples": 10000000 are not super convenient because what would be much better is "sample every N minutes" and "evaluate every N minutes" like every hour is good and that vary wildly depending on sample/s