Inconsistent results when predicting on batches with GPU

Hi! I was trying to train m3gnet on a specific set of crystals and noticed that evaluating the trained model gave me 3x different rmse depending on whether I was running the evaluation on GPU or CPU.

Diving deeper into this, I was able to spot that, when run with GPU on batches, m3gnet predicts somewhat biased energies, compared to what it gives for single-structure (batch size = 1) inputs or when running on CPU. I was able to reproduce this bias even on the pre-trained m3gnet. For the pretrained model, the bias is not too large, but it's certainly larger than the 32-bit floating point precision. Whether or not tf.function is used (as controlled globally by tf.config.run_functions_eagerly(...)) also affects the result.

Here are some details about my environment: tensorflow 2.9.2 Driver Version: 515.48.07 CUDA Version: 11.7 GPU: NVIDIA A40

I was not able to reproduce it on a different machine (with different GPU and CUDA).

Here's the code to reproduce:

import tensorflow as tf
import numpy as np
from ase import Atoms
from tqdm import tqdm
import matplotlib.pyplot as plt

from m3gnet.models import M3GNet, Potential
from m3gnet.graph import MaterialGraphBatchEnergyForceStress

for d in tf.config.list_physical_devices("GPU"):
    tf.config.experimental.set_memory_growth(d, True)

batch_size = 128

structure = Atoms(
    "Cl4Ag4", pbc=True,
    cell=np.diag([5.5956, 5.5956, 5.5956]),
    positions=np.array([
       [0.    , 0.    , 0.    ],
       [2.7978, 2.7978, 0.    ],
       [0.    , 2.7978, 2.7978],
       [2.7978, 0.    , 2.7978],
       [0.    , 2.7978, 0.    ],
       [2.7978, 0.    , 0.    ],
       [0.    , 0.    , 2.7978],
       [5.2459, 1.3989, 4.3715],
    ])
)

def eval_structure_v1(struct):    
    m3gnet = M3GNet.load()
    potential = Potential(m3gnet)

    structure_graph = m3gnet.graph_converter(struct)
    return potential.get_energies(structure_graph.as_tf().as_list()).numpy().squeeze()

def eval_structure_v2(struct):
    m3gnet = M3GNet.load()
    potential = Potential(m3gnet)

    graph = m3gnet.graph_converter(struct)
    pred_e, _ = potential.get_ef_tensor(graph.as_tf().as_list())
    return pred_e.numpy().squeeze()

def eval_structure_v3(struct, batch_size=batch_size):
    m3gnet = M3GNet.load()
    potential = Potential(m3gnet)

    mgb = MaterialGraphBatchEnergyForceStress(
        [m3gnet.graph_converter(struct) for _ in range(batch_size)],
        energies=[0.0] * batch_size,
        forces=[np.zeros((8, 3)) for _ in range(batch_size)],
        stresses=None,
        batch_size=batch_size,
        shuffle=False,
    )

    graph, _ = next(iter(mgb))
    pred_e, _ = potential.get_ef_tensor(graph.as_tf().as_list())
    return pred_e.numpy().squeeze()

print(structure)

results_mean = {}
results_std = {}

bsizes = np.unique(np.round(np.logspace(0, 7, 30, base=2)).astype(int))
output = ""
for device in ["gpu:0", "cpu:0"]:
    for use_tf_func in [True, False]:
        key = f"{device}--useTfFunc:{use_tf_func}"
        results_mean[key] = []
        results_std[key] = []

        output += f"{device}, tf.function {use_tf_func}\n"
        tf.config.run_functions_eagerly(not use_tf_func)
        with tf.device(device):
            output += f"  v1: {eval_structure_v1(structure.copy())}\n"
            output += f"  v2: {eval_structure_v2(structure.copy())}\n"
            e_v3 = eval_structure_v3(structure.copy())
            output += f"  v3: {e_v3.min()}, {e_v3.max()}, {e_v3.mean()}\n"

            for bs in tqdm(bsizes):
                e_v3 = eval_structure_v3(structure.copy(), batch_size=bs)
                results_mean[key].append(e_v3.mean())
                results_std[key].append(e_v3.std())

        output += "\n"

print("", flush=True)
print(output)

for key in results_mean:
    plt.errorbar(x=bsizes, y=results_mean[key], yerr=results_std[key], label=key)
plt.legend()
plt.xlabel("batch size")
plt.ylabel("predicted energy")
plt.savefig("m3gnet_bug.png")

Here's what I see on the plot (energy vs batch size): Printout (note how gpu v3 differs from the rest):

gpu:0, tf.function True
  v1: -10.224565505981445
  v2: -10.224565505981445
  v3: -10.216326713562012, -10.216205596923828, -10.216231346130371

gpu:0, tf.function False
  v1: -10.224565505981445
  v2: -10.224565505981445
  v3: -10.223982810974121, -10.223982810974121, -10.223982810974121

cpu:0, tf.function True
  v1: -10.224571228027344
  v2: -10.224571228027344
  v3: -10.22457218170166, -10.22457218170166, -10.224573135375977

cpu:0, tf.function False
  v1: -10.224571228027344
  v2: -10.224571228027344
  v3: -10.22457218170166, -10.224571228027344, -10.224573135375977

When run on google collab (CUDA 11.6, Tesla T4 GPU), same code gives the following (much more consistent) result: Printout (again, much more consistent):

gpu:0, tf.function True
  v1: -10.224571228027344
  v2: -10.224571228027344
  v3: -10.224573135375977, -10.224570274353027, -10.224571228027344

gpu:0, tf.function False
  v1: -10.224571228027344
  v2: -10.224571228027344
  v3: -10.224573135375977, -10.224569320678711, -10.224571228027344

cpu:0, tf.function True
  v1: -10.22457218170166
  v2: -10.22457218170166
  v3: -10.22457218170166, -10.22457218170166, -10.22457218170166

cpu:0, tf.function False
  v1: -10.22457218170166
  v2: -10.22457218170166
  v3: -10.22457218170166, -10.22457218170166, -10.22457218170166

materialsvirtuallab / m3gnet

Inconsistent results when predicting on batches with GPU #54