glotzerlab / hoomd-blue

Molecular dynamics and Monte Carlo soft matter simulation on GPUs.
http://glotzerlab.engin.umich.edu/hoomd-blue
BSD 3-Clause "New" or "Revised" License
343 stars 132 forks source link

Fix read after write hazard in the GPU implementations of `ALJ`, `Dipole`, and `Patchy*` potentials. #1944

Closed tcmoore3 closed 2 days ago

tcmoore3 commented 1 week ago

Description

This PR removes a race condition on the shared memory in the anisotropic pair potential GPU kernel launcher by adding a call to __syncthreads() after the per-pair and per-type parameters are loaded.

Motivation and context

Many users have reported arbitrary crashes when using hoomd.pair.aniso.ALJ on the GPU, often reported as a divide by zero floating point exception. It turns out that was only a symptom of the bug; running with the GPU error checking reported an illegal memory access instead of the floating point exception. The illegal access was triggered by the race condition that is fixed in this PR.

How has this been tested?

compute-sanitizer --tool racecheck previously reported a race condition when running a simulation that uses hoomd.md.pair.aniso.ALJ and runs cleanly with this fix.

Checklist:

joaander commented 3 days ago

I've gotten a lot of questions about this bugfix. The short answer is that any simulation previously run with ALJ on any GPU is incorrect. Similarly, any simulation run with Dipole on any GPU since v3.0 is incorrect. How incorrect? That is difficult to say as this bug is a race condition and it may be triggered more or less often based on a variety of conditions. In some tests, it seems that forces are incorrect at a rate of 1 out of every ten million pair force computations. The ALJ kernel would trigger an out of bounds memory access (typically triggering an integer divide by 0 exception) at a rate of 1 out of hundreds of millions of pair force computations. Your mileage may vary.

Your best option is to rerun ALL simulations that used ALJ or Dipole on the GPU. When you do so, you are likely to see better momentum conservation. In thermostatted simulations, the energy spikes from the incorrect forces may have been absorbed without qualitatively changing results. You should still rerun the simulations to verify.

joaander commented 2 days ago

@cbkerr This bug would cause problems with the patchy potentials on the GPU as well.