Closed tcmoore3 closed 2 days ago
I've gotten a lot of questions about this bugfix. The short answer is that any simulation previously run with ALJ
on any GPU is incorrect. Similarly, any simulation run with Dipole
on any GPU since v3.0 is incorrect. How incorrect? That is difficult to say as this bug is a race condition and it may be triggered more or less often based on a variety of conditions. In some tests, it seems that forces are incorrect at a rate of 1 out of every ten million pair force computations. The ALJ
kernel would trigger an out of bounds memory access (typically triggering an integer divide by 0 exception) at a rate of 1 out of hundreds of millions of pair force computations. Your mileage may vary.
Your best option is to rerun ALL simulations that used ALJ
or Dipole
on the GPU. When you do so, you are likely to see better momentum conservation. In thermostatted simulations, the energy spikes from the incorrect forces may have been absorbed without qualitatively changing results. You should still rerun the simulations to verify.
@cbkerr This bug would cause problems with the patchy potentials on the GPU as well.
Description
This PR removes a race condition on the shared memory in the anisotropic pair potential GPU kernel launcher by adding a call to
__syncthreads()
after the per-pair and per-type parameters are loaded.Motivation and context
Many users have reported arbitrary crashes when using
hoomd.pair.aniso.ALJ
on the GPU, often reported as a divide by zero floating point exception. It turns out that was only a symptom of the bug; running with the GPU error checking reported an illegal memory access instead of the floating point exception. The illegal access was triggered by the race condition that is fixed in this PR.How has this been tested?
compute-sanitizer --tool racecheck
previously reported a race condition when running a simulation that useshoomd.md.pair.aniso.ALJ
and runs cleanly with this fix.Checklist:
sphinx-doc/credits.rst
) in the pull request source branch.CHANGELOG.rst
following the established format.