Closed belsten closed 6 months ago
Thanks Alex! Just to confirm I understand what the change is: previously, each epoch would just use the next batch from the dataset, and losses were computed per batch. So during training, the model only saw each example once. Change is each epoch uses the entire dataset, and losses are computed for the whole dataset at each epoch.
It makes sense that the old method would finish very fast, because it only processes len(dataset) samples rather than n_epoch*len(dataset) samples. I'm guessing the loss and filters learned were also not as good?
Almost. The previous incorrect method saw batch_size*n_epoch
samples while the new method sees n_epoch*len(dataset)
samples (like you said). And yes, the losses returned previously were only over a single batch and now they are over the whole dataset. The previous method would make sense if we called n_epochs
n_batch_updates
instead but alas we did not.
In my experience, the loss and filters could be fine with the old method, you would just have to make n_epochs
large.
Compute epoch energy as the average of the batch energies.