Closed SiLiKhon closed 1 year ago
Turns out I overlooked this: https://www.tensorflow.org/api_docs/python/tf/config/experimental/enable_tensor_float_32_execution
Apparently, NVIDIA A40 supports this mode which leads to this slight inconsistency. Setting this to false
makes everything consistent. Sorry for the noise.
Hi! I was trying to train m3gnet on a specific set of crystals and noticed that evaluating the trained model gave me 3x different rmse depending on whether I was running the evaluation on GPU or CPU.
Diving deeper into this, I was able to spot that, when run with GPU on batches, m3gnet predicts somewhat biased energies, compared to what it gives for single-structure (batch size = 1) inputs or when running on CPU. I was able to reproduce this bias even on the pre-trained m3gnet. For the pretrained model, the bias is not too large, but it's certainly larger than the 32-bit floating point precision. Whether or not
tf.function
is used (as controlled globally bytf.config.run_functions_eagerly(...)
) also affects the result.Here are some details about my environment: tensorflow 2.9.2 Driver Version: 515.48.07 CUDA Version: 11.7 GPU: NVIDIA A40
I was not able to reproduce it on a different machine (with different GPU and CUDA).
Here's the code to reproduce:
Here's what I see on the plot (energy vs batch size): Printout (note how gpu v3 differs from the rest):
When run on google collab (CUDA 11.6, Tesla T4 GPU), same code gives the following (much more consistent) result: Printout (again, much more consistent):