l4fl4m3 / nowcasting-dgmr

7 stars 1 forks source link

Training insights #1

Open VisionVoyagerX opened 1 year ago

VisionVoyagerX commented 1 year ago

Hey, Great implementation! Could you please provide some insights on the training? How many gen steps in total did it take to fully converge? Did you manage to achieve DeepMind-level performance?

l4fl4m3 commented 1 year ago

It been quite a while since I looked at this but this is what I wrote about training in a blog post about it:

"I trained/am training it distributed across 8 TPU (v3) cores, kindly provided by Google’s TPU Research Cloud Program. Due to the model size, I used a (global) batch size of 8 per step, and 3 samples per input during the generation step, prior to grid cell regularization, as opposed to the 6 used in the paper. All the other hyperparameters align with the paper. I experimented around with a bunch of GCP VM and GPU configurations, but after refactoring the training loop to run on a TPU, I realized it would work best, and Google kindly provided the service for free through their program. The training is occasionally check-pointed to cloud, and since the dataset is so large, some refactoring needs to be done to incorporate some sort of epoch structure during training."

Here's the blogpost (it probably wont add much to what I have copy pasted above. ): https://hassaann.medium.com/skillful-precipitation-nowcasting-an-implementation-of-deepminds-dgmr-aa68764f2ae2

From what I recall, it was not converging as one would hope, the curve was just sort of plateauing after some time, and that was after having it run for a few days straight on the TPUs. I did however feed in some test data frames and the predictions (visually) looked alright, however I did not check for any numerical accuracy.

Best of luck.