Training insights - Githubissues

It been quite a while since I looked at this but this is what I wrote about training in a blog post about it:

"I trained/am training it distributed across 8 TPU (v3) cores, kindly provided by Google’s TPU Research Cloud Program. Due to the model size, I used a (global) batch size of 8 per step, and 3 samples per input during the generation step, prior to grid cell regularization, as opposed to the 6 used in the paper. All the other hyperparameters align with the paper. I experimented around with a bunch of GCP VM and GPU configurations, but after refactoring the training loop to run on a TPU, I realized it would work best, and Google kindly provided the service for free through their program. The training is occasionally check-pointed to cloud, and since the dataset is so large, some refactoring needs to be done to incorporate some sort of epoch structure during training."

Here's the blogpost (it probably wont add much to what I have copy pasted above. ): https://hassaann.medium.com/skillful-precipitation-nowcasting-an-implementation-of-deepminds-dgmr-aa68764f2ae2

From what I recall, it was not converging as one would hope, the curve was just sort of plateauing after some time, and that was after having it run for a few days straight on the TPUs. I did however feed in some test data frames and the predictions (visually) looked alright, however I did not check for any numerical accuracy.

Best of luck.

l4fl4m3 / nowcasting-dgmr

Training insights #1