google-research / neuralgcm

Hybrid ML + physics model of the Earth's atmosphere
https://neuralgcm.readthedocs.io
Apache License 2.0
628 stars 68 forks source link

GH Artifacts (Horizontal lines) when running on L4 GPU #124

Open leanderloew opened 2 weeks ago

leanderloew commented 2 weeks ago

L4 GPU:

Screenshot 2024-09-13 at 16 25 27

T4 GPU:

Screenshot 2024-09-13 at 16 26 10

Other variables look fine. I ran this for the neural_gcm_dynamic_forcing_stochastic_1_4_deg mode.

shoyer commented 1 week ago

Thanks for the report!

Since this artifact only shows up in the latitudinal direction, my guess is that it is somehow related to errors in the spherical harmonic transform. Potentially L4 vs T4 GPU have different TensorCores, with slightly different numerical precision?

Can you try setting the precision for all spherical harmonic transforms to full float32? https://github.com/google-research/neuralgcm/issues/56#issuecomment-2091121455