Apply switching fxn - Githubissues

Apply a switching function to erfc reaction field, so that energies (and a couple derivatives) go to zero at cutoff.

Main considerations:

Introduce a minimal change to the current f(x) = erfc(2*x) damping function.
Respect boundary conditions f(cutoff) = 0, f'(cutoff) = 0, f''(cutoff) = 0

Implementation approach:

Make minimal modifications to the (unused) heuristic switch function cos(0.5*pi*dij^8)^2 that's been in nonbonded.py for a few years, so that it has desired boundary conditions:
- replace dij with dij/cutoff
- replace cos^2 with cos^3
Modify nonbonded.py potentials and basis functions to use switch_fn.
Modify timemachine/cpp/src/kernels/k_nonbonded_common.cuh and timemachine/cpp/src/kernels/k_nonbonded_precomputed.cuh accordingly.
- This PR increases the amount of arithmetic performed by the function real_es_factor, which is called for every interacting particle pair and is under significant performance pressure.
- This PR already reflects several performance suggestions from @proteneer, esp. a helpful suggestion to use the __sincosf intrinsic.

Current state:

Not performance-neutral
- Slows down MD by ~5-10% across the range of conditions in test_benchmark.py
Not tech-debt-neutral
- Currently hard-codes cutoff=1.2. TBD: decide if we want to hard-code cutoff=1.2 as a project-wide constant, or modify real_es_factor etc. to accept cutoff as a parameter. (Note: the DHFR benchmark runs silently with cutoff=1.0.)
- Increases the number of device function definitions in timemachine/cpp/src/kernels/k_nonbonded_common.cuh.
Tests not passing
- du_dp failing (but maybe a false positive, close to test threshold)
- bitwise tests have yet to be updated

Significance:

The size of the cutoff artifact was approximately 0.05 kJ/mol for a typical particle pair (assuming TIP3P oxygen charges, beta=2.0/nm, cutoff=1.2nm). When summed over a large number of pairs, or when the charges are larger, energy discontinuity at cutoff may not be negligible.
VelocityVerlet energy conservation was poor. (Fluctuations of 10s of kJ/mol on master when cutoff=1.2nm, dt=0.1fs. If cutoff is doubled to 2.4nm or if this switch fn is applied, energy fluctuations are on the order of 0.1 kJ/mol.)
Nonphysical RDF artifacts were measurable at cutoff when bulk water is sampled with MD. (And very different RDF artifacts appeared when sampled with MCMC.)
I believe the associated numerical artifacts may have led to inaccurate impressions about feasibility of certain classes of methods, e.g. that HMC (and possibly other MD/MC combinations) were infeasible in TM. For example, because energy fluctuations with VelocityVerlet were large even when the timestep is small, HMC appears to have an infeasibly low acceptance rate in TM, but this is resolved when the cutoff is increased or when a switch function is applied.

Other directions considered but pruned:

Adopting a different reaction field. For example, this reaction field seems well-motivated and is also computationally simpler and cheaper, but it would require more significant validation since it is a larger change from erfc(2*x). This direction was suggested by JDC a few years ago, I previously investigated a little with Joe last year, and I pursued implementation again briefly in this branch.
Modifying the cutoff implementation so that the cutoff artifact always occurs at the cutoff distance (rather than anywhere between 0 and cutoff depending on lambda). This wouldn't fix cutoff artifacts present even in non-alchemical simulations, but may rule out a possible class of idiosyncratic bugs / behaviors in TM alchemical simulations. Implemented a little in this branch.

Future directions:

Continue validating erfc(2*dij) * cos(0.5*pi*(dij/1.2)^8)^3 reaction field.
Approximate switch_fn(x) * erfc(2*x) more cheaply. But, since the branch is currently within ~10% of the speed of master, opening it up for review and testing. (Maybe we'll do some more testing and find we want to tweak the functional form of switch_fn a little.)
- Performance experiment: If we replace the current fast erfc implementation with the cheapest placeholder "return 0.0;", that appears to speed up NVT MD benchmarks by ~10%. So, maybe, if we had an even faster erfc approximation, we could offset the cost of cos(0.5*pi*(dij/1.2)^8)^3.
- I briefly attempted to swap in faster erfc approximations based on Norbert Juffa's forum posts https://forums.developer.nvidia.com/t/calling-all-juffas-whats-up-with-erfcf-nowadays/262973 , https://forums.developer.nvidia.com/t/a-very-fast-low-precision-implementation-of-erff-for-compute-capability-7-5/275503 , but these didn't render the change performance-neutral overall. (See also https://github.com/proteneer/timemachine/pull/1080 , which reported a 1% speedup.)
- I briefly attempted to swap in polynomial approximations earlier in this branch, and I have some pointers from @proteneer , but doing this properly will require more attention.
Possibly consider applying a switching function to LJ, but payoff will likely be minimal.

Timings

Benchmark timings, on an RTX 3080. Grain of salt 1: other stuff running in background. Grain of salt 2: proposed implementation isn't compatible with the cutoff used in DHFR benchmark.

current

```text dhfr-apo: N=23558 speed: 702.66ns/day dt: 2.5fs (ran 100000 steps in 30.74s) dhfr-apo-barostat-interval-25: N=23558 speed: 632.82ns/day dt: 2.5fs (ran 100000 steps in 34.14s) hif2a-apo: N=8805 speed: 1260.11ns/day dt: 2.5fs (ran 100000 steps in 17.14s) hif2a-apo-barostat-interval-25: N=8805 speed: 1042.06ns/day dt: 2.5fs (ran 100000 steps in 20.73s) hif2a-rbfe-barostat-interval-25: N=8840 speed: 833.62ns/day dt: 2.5fs (ran 100000 steps in 25.91s) hif2a-rbfe-local: N=8840 speed: 1431.88ns/day dt: 2.5fs (ran 100000 steps in 15.09s) solvent-apo: N=6282 speed: 1584.65ns/day dt: 2.5fs (ran 100000 steps in 13.63s) solvent-apo-barostat-interval-25: N=6282 speed: 1388.41ns/day dt: 2.5fs (ran 100000 steps in 15.56s) solvent-rbfe-barostat-interval-25: N=6317 speed: 1179.57ns/day dt: 2.5fs (ran 100000 steps in 18.32s) solvent-rbfe-local: N=6317 speed: 1672.97ns/day dt: 2.5fs (ran 100000 steps in 12.92s) vacuum-rbfe: N=35 speed: 11138.93ns/day dt: 2.5fs (ran 100000 steps in 1.94s) ```

proposed (5-10% slower)

```text dhfr-apo: N=23558 speed: 661.35ns/day dt: 2.5fs (ran 100000 steps in 32.66s) dhfr-apo-barostat-interval-25: N=23558 speed: 593.45ns/day dt: 2.5fs (ran 100000 steps in 36.40s) hif2a-apo: N=8805 speed: 1174.16ns/day dt: 2.5fs (ran 100000 steps in 18.40s) hif2a-apo-barostat-interval-25: N=8805 speed: 979.97ns/day dt: 2.5fs (ran 100000 steps in 22.04s) hif2a-rbfe-barostat-interval-25: N=8840 speed: 793.70ns/day dt: 2.5fs (ran 100000 steps in 27.22s) hif2a-rbfe-local: N=8840 speed: 1399.90ns/day dt: 2.5fs (ran 100000 steps in 15.43s) solvent-apo: N=6282 speed: 1500.61ns/day dt: 2.5fs (ran 100000 steps in 14.40s) solvent-apo-barostat-interval-25: N=6282 speed: 1312.82ns/day dt: 2.5fs (ran 100000 steps in 16.46s) solvent-rbfe-barostat-interval-25: N=6317 speed: 1128.24ns/day dt: 2.5fs (ran 100000 steps in 19.15s) solvent-rbfe-local: N=6317 speed: 1643.52ns/day dt: 2.5fs (ran 100000 steps in 13.15s) vacuum-rbfe: N=35 speed: 10314.47ns/day dt: 2.5fs (ran 100000 steps in 2.10s) ```

__sincosf is courtesy of @dmclark17

Currently hard-codes cutoff=1.2. TBD: decide if we want to hard-code cutoff=1.2 as a project-wide constant, or modify real_es_factor etc. to accept cutoff as a parameter. (Note: the DHFR benchmark runs silently with cutoff=1.0.)

Note that the DHFR benchmark runs with 1.0 to allow us to more directly compare against OpenMM. I would advocate for accepting it as a parameter for that reason

Note that the DHFR benchmark runs with 1.0 to allow us to more directly compare against OpenMM. I would advocate for accepting it as a parameter for that reason

I believe the timing comparison to OpenMM is the only current use case for cutoff != 1.2nm in the project?

I suppose a direct performance comparison to OpenMM may also involve matching OpenMM's reaction field. (Currently running with cutoff = 1.0 does lead to a larger truncation artifact, which could impact densities and therefore timings.)

Leaning towards exposing the cutoff parameter, mostly for reasons of flexibility.

Note that the DHFR benchmark runs with 1.0 to allow us to more directly compare against OpenMM. I would advocate for accepting it as a parameter for that reason

I believe the timing comparison to OpenMM is the only current use case for cutoff != 1.2nm in the project?

I suppose a direct performance comparison to OpenMM may also involve matching OpenMM's reaction field. (Currently running with cutoff = 1.0 does lead to a larger truncation artifact, which could impact densities and therefore timings.)

Leaning towards exposing the cutoff parameter, mostly for reasons of flexibility.

Not sure if this is a good idea, but I suppose we could make cutoff a template parameter to allow for compiler optimizations while still maintaining flexibility to test with a different cutoff.

Timings as of https://github.com/proteneer/timemachine/pull/1290/commits/92c16263e3706cde4244e136370fb16bcdc040fb, maybe a few ns/day slower than first report, but probably within noise

dhfr-apo: N=23558 speed: 656.10ns/day dt: 2.5fs (ran 100000 steps in 32.93s)
dhfr-apo-barostat-interval-25: N=23558 speed: 589.55ns/day dt: 2.5fs (ran 100000 steps in 36.64s)

hif2a-apo: N=8805 speed: 1157.51ns/day dt: 2.5fs (ran 100000 steps in 18.66s)
hif2a-apo-barostat-interval-25: N=8805 speed: 977.32ns/day dt: 2.5fs (ran 100000 steps in 22.10s)

hif2a-rbfe-barostat-interval-25: N=8840 speed: 786.30ns/day dt: 2.5fs (ran 100000 steps in 27.47s)
hif2a-rbfe-local: N=8840 speed: 1399.82ns/day dt: 2.5fs (ran 100000 steps in 15.44s)

solvent-apo: N=6282 speed: 1487.89ns/day dt: 2.5fs (ran 100000 steps in 14.52s)
solvent-apo-barostat-interval-25: N=6282 speed: 1304.80ns/day dt: 2.5fs (ran 100000 steps in 16.56s)

solvent-rbfe-barostat-interval-25: N=6317 speed: 1112.19ns/day dt: 2.5fs (ran 100000 steps in 19.42s)
solvent-rbfe-local: N=6317 speed: 1646.61ns/day dt: 2.5fs (ran 100000 steps in 13.12s)

vacuum-rbfe: N=35 speed: 10269.27ns/day dt: 2.5fs (ran 100000 steps in 2.11s)

Looking at the performance between master and this PR as of https://github.com/proteneer/timemachine/pull/1290/commits/15221493bc52f1782ef0ac80e6635e168af96bb7 on an A10 (a Tesla card) that had nothing else running:

Looks to be about 5% performance difference, though it goes as low as ~1% for vacuum and as high as ~8% for hif2a-apo. On a side-note: Interesting to see that vacuum is so much faster on a RTX 3080 than the A10

Master

dhfr-apo: N=23558 speed: 742.84ns/day dt: 2.5fs (ran 100000 steps in 29.08s)
dhfr-apo-barostat-interval-25: N=23558 speed: 666.11ns/day dt: 2.5fs (ran 100000 steps in 32.43s)

hif2a-apo: N=8805 speed: 1289.54ns/day dt: 2.5fs (ran 100000 steps in 16.75s)
hif2a-apo-barostat-interval-25: N=8805 speed: 1076.49ns/day dt: 2.5fs (ran 100000 steps in 20.07s)
hif2a-rbfe-barostat-interval-25: N=8840 speed: 825.48ns/day dt: 2.5fs (ran 100000 steps in 26.17s)
hif2a-rbfe-local: N=8840 speed: 1333.75ns/day dt: 2.5fs (ran 100000 steps in 16.20s)

solvent-apo: N=6282 speed: 1536.51ns/day dt: 2.5fs (ran 100000 steps in 14.06s)
solvent-apo-barostat-interval-25: N=6282 speed: 1372.01ns/day dt: 2.5fs (ran 100000 steps in 15.75s)
solvent-rbfe-barostat-interval-25: N=6317 speed: 1135.66ns/day dt: 2.5fs (ran 100000 steps in 19.02s)
solvent-rbfe-local: N=6317 speed: 1485.72ns/day dt: 2.5fs (ran 100000 steps in 14.54s)

vacuum-rbfe: N=35 speed: 9055.34ns/day dt: 2.5fs (ran 100000 steps in 2.39s)

NonbondedInteractionGroup_f32: N=8840 Frames=1000 Params=5 speed: 1178.51 executions/seconds (ran 10000 potentials in 8.49s) du_dp=True, du_dx=True, u=True
NonbondedInteractionGroup_f64: N=8840 Frames=1000 Params=5 speed: 620.84 executions/seconds (ran 10000 potentials in 16.11s) du_dp=True, du_dx=True, u=True
HarmonicBond_f32: N=8840 Frames=1000 Params=5 speed: 1501.78 executions/seconds (ran 10000 potentials in 6.66s) du_dp=True, du_dx=True, u=True
HarmonicBond_f64: N=8840 Frames=1000 Params=5 speed: 1499.59 executions/seconds (ran 10000 potentials in 6.67s) du_dp=True, du_dx=True, u=True
HarmonicAngleStable_f32: N=8840 Frames=1000 Params=5 speed: 1461.24 executions/seconds (ran 10000 potentials in 6.84s) du_dp=True, du_dx=True, u=True
HarmonicAngleStable_f64: N=8840 Frames=1000 Params=5 speed: 1437.29 executions/seconds (ran 10000 potentials in 6.96s) du_dp=True, du_dx=True, u=True
PeriodicTorsion_f32: N=8840 Frames=1000 Params=5 speed: 1463.22 executions/seconds (ran 10000 potentials in 6.83s) du_dp=True, du_dx=True, u=True
PeriodicTorsion_f64: N=8840 Frames=1000 Params=5 speed: 1438.76 executions/seconds (ran 10000 potentials in 6.95s) du_dp=True, du_dx=True, u=True
ChiralAtomRestraint_f32: N=8840 Frames=1000 Params=5 speed: 1676.79 executions/seconds (ran 10000 potentials in 5.96s) du_dp=True, du_dx=True, u=True
ChiralAtomRestraint_f64: N=8840 Frames=1000 Params=5 speed: 1651.38 executions/seconds (ran 10000 potentials in 6.06s) du_dp=True, du_dx=True, u=True
NonbondedPairListPrecomputed_f32: N=8840 Frames=1000 Params=5 speed: 1605.44 executions/seconds (ran 10000 potentials in 6.23s) du_dp=True, du_dx=True, u=True
NonbondedPairListPrecomputed_f64: N=8840 Frames=1000 Params=5 speed: 1583.63 executions/seconds (ran 10000 potentials in 6.31s) du_dp=True, du_dx=True, u=True
Nonbonded_f32: N=8840 Frames=1000 Params=5 speed: 928.56 executions/seconds (ran 10000 potentials in 10.77s) du_dp=True, du_dx=True, u=True
Nonbonded_f64: N=8840 Frames=1000 Params=5 speed: 131.34 executions/seconds (ran 10000 potentials in 76.14s) du_dp=True, du_dx=True, u=True
SummedPotential(NonbondedInteractionGroup, NonbondedInteractionGroup)_f32: N=8840 Frames=1000 Params=5 speed: 926.95 executions/seconds (ran 10000 potentials in 10.79s) du_dp=True, du_dx=True, u=True
SummedPotential(NonbondedInteractionGroup, NonbondedInteractionGroup)_f64: N=8840 Frames=1000 Params=5 speed: 470.93 executions/seconds (ran 10000 potentials in 21.23s) du_dp=True, du_dx=True, u=True

PR

dhfr-apo: N=23558 speed: 697.24ns/day dt: 2.5fs (ran 100000 steps in 30.98s)
dhfr-apo-barostat-interval-25: N=23558 speed: 625.15ns/day dt: 2.5fs (ran 100000 steps in 34.56s)

hif2a-apo: N=8805 speed: 1194.36ns/day dt: 2.5fs (ran 100000 steps in 18.09s)
hif2a-apo-barostat-interval-25: N=8805 speed: 1016.03ns/day dt: 2.5fs (ran 100000 steps in 21.26s)
hif2a-rbfe-barostat-interval-25: N=8840 speed: 786.31ns/day dt: 2.5fs (ran 100000 steps in 27.47s)
hif2a-rbfe-local: N=8840 speed: 1295.05ns/day dt: 2.5fs (ran 100000 steps in 16.68s)

solvent-apo: N=6282 speed: 1447.15ns/day dt: 2.5fs (ran 100000 steps in 14.93s)
solvent-apo-barostat-interval-25: N=6282 speed: 1288.75ns/day dt: 2.5fs (ran 100000 steps in 16.76s)
solvent-rbfe-barostat-interval-25: N=6317 speed: 1079.71ns/day dt: 2.5fs (ran 100000 steps in 20.01s)
solvent-rbfe-local: N=6317 speed: 1434.23ns/day dt: 2.5fs (ran 100000 steps in 15.06s)

vacuum-rbfe: N=35 speed: 9115.40ns/day dt: 2.5fs (ran 100000 steps in 2.37s)

NonbondedInteractionGroup_f32: N=8840 Frames=1000 Params=5 speed: 1081.57 executions/seconds (ran 10000 potentials in 9.25s) du_dp=True, du_dx=True, u=True
NonbondedInteractionGroup_f64: N=8840 Frames=1000 Params=5 speed: 498.56 executions/seconds (ran 10000 potentials in 20.06s) du_dp=True, du_dx=True, u=True
HarmonicBond_f32: N=8840 Frames=1000 Params=5 speed: 1388.07 executions/seconds (ran 10000 potentials in 7.20s) du_dp=True, du_dx=True, u=True
HarmonicBond_f64: N=8840 Frames=1000 Params=5 speed: 1351.33 executions/seconds (ran 10000 potentials in 7.40s) du_dp=True, du_dx=True, u=True
HarmonicAngleStable_f32: N=8840 Frames=1000 Params=5 speed: 1327.24 executions/seconds (ran 10000 potentials in 7.53s) du_dp=True, du_dx=True, u=True
HarmonicAngleStable_f64: N=8840 Frames=1000 Params=5 speed: 1293.81 executions/seconds (ran 10000 potentials in 7.73s) du_dp=True, du_dx=True, u=True
PeriodicTorsion_f32: N=8840 Frames=1000 Params=5 speed: 1333.99 executions/seconds (ran 10000 potentials in 7.50s) du_dp=True, du_dx=True, u=True
PeriodicTorsion_f64: N=8840 Frames=1000 Params=5 speed: 1296.51 executions/seconds (ran 10000 potentials in 7.71s) du_dp=True, du_dx=True, u=True
ChiralAtomRestraint_f32: N=8840 Frames=1000 Params=5 speed: 1500.43 executions/seconds (ran 10000 potentials in 6.66s) du_dp=True, du_dx=True, u=True
ChiralAtomRestraint_f64: N=8840 Frames=1000 Params=5 speed: 1499.99 executions/seconds (ran 10000 potentials in 6.67s) du_dp=True, du_dx=True, u=True
NonbondedPairListPrecomputed_f32: N=8840 Frames=1000 Params=5 speed: 1476.21 executions/seconds (ran 10000 potentials in 6.77s) du_dp=True, du_dx=True, u=True
NonbondedPairListPrecomputed_f64: N=8840 Frames=1000 Params=5 speed: 1432.26 executions/seconds (ran 10000 potentials in 6.98s) du_dp=True, du_dx=True, u=True
Nonbonded_f32: N=8840 Frames=1000 Params=5 speed: 862.63 executions/seconds (ran 10000 potentials in 11.59s) du_dp=True, du_dx=True, u=True
Nonbonded_f64: N=8840 Frames=1000 Params=5 speed: 98.31 executions/seconds (ran 10000 potentials in 101.72s) du_dp=True, du_dx=True, u=True
SummedPotential(NonbondedInteractionGroup, NonbondedInteractionGroup)_f32: N=8840 Frames=1000 Params=5 speed: 883.34 executions/seconds (ran 10000 potentials in 11.32s) du_dp=True, du_dx=True, u=True
SummedPotential(NonbondedInteractionGroup, NonbondedInteractionGroup)_f64: N=8840 Frames=1000 Params=5 speed: 409.61 executions/seconds (ran 10000 potentials in 24.41s) du_dp=True, du_dx=True, u=True

Currently stuck on this test failure:

FAILED tests/nonbonded/test_nonbonded_mol_energy.py::test_nonbonded_mol_energy_matches_exchange_mover_batch_U[float32-0.0001-0.002-4085] - AssertionError: 

E       AssertionError: 
E       Not equal to tolerance rtol=0.002, atol=0.0001
E       
E       Mismatched elements: 2 / 4085 (0.049%)
E       Max absolute difference: 0.03594835
E       Max relative difference: 0.00594114
E        x: array([-67.136335, -64.208156, -78.747204, ..., -77.397931, -36.04158 , -92.411841])
E        y: array([-67.136256, -64.208254, -78.747074, ..., -77.397854, -36.041125, -92.411754])

Note that the variant of this test that exercises custom_ops.NonbondedMolEnergyPotential_f64 passes.

This variant, which exercises custom_ops.NonbondedMolEnergyPotential_f32, fails when the jax reference batch_U is run in x64 mode, and still fails when I revert usage of all f32 intrinsics in nonbonded_common.cuh https://github.com/proteneer/timemachine/commit/fa185e2cc07f0b8be00014c379ea81114fe2777a (indicating that the issue is not due to accuracy loss associated with these optimizations).

Not obvious to me what might be causing the discrepancy. Seems much larger than I assume is explainable by differences between fixed-point and floating-point summation, but maybe it's tolerable and the test thresholds should just be relaxed a bit further?

Since the double precision tests pass, does overloading the double precision implementation into the naive single precision implementation cause the tests to pass? (i.e. just take the double precision real_es_factor function, and replace double inputs with floats)

(This is one simple way to test another way of computation to see if it has similar loss of precision)

Good suggestion. However, this particular test still fails with approximately the same magnitude discrepancy when using the double precision real_es_factor function with float inputs https://github.com/proteneer/timemachine/commit/9b3f75d3a3bf6703ef6f32a6fca51cadb3c1ac89

E       AssertionError: 
E       Not equal to tolerance rtol=0.002, atol=0.0001
E       
E       Mismatched elements: 2 / 4085 (0.049%)
E       Max absolute difference: 0.03589625
E       Max relative difference: 0.00673641
E        x: array([-67.136453, -64.208127, -78.747047, ..., -77.39786 , -36.041394, -92.411671])
E        y: array([-67.136256, -64.208254, -78.747074, ..., -77.397854, -36.041125, -92.411754])

tests/nonbonded/test_nonbonded_mol_energy.py:111: AssertionError

Note that tests of u and du_dx in test_nonbonded.py are passing for both f32 and f64 -- the main current issue is in this test of NonbondedMolEnergyPotential , which exercises a slightly different code path. Compares k_atom_by_atom_energies to nonbonded_block_unsummed.

( Separately, there's a du_dp assertion failing in tests/test_nonbonded.py::TestNonbondedWater::test_nblist_box_resize, which seems much closer to test threshold -- seems clearer that the test threshold can be loosened for this case.)

E               AssertionError: 
E               Not equal to tolerance rtol=1e-08, atol=1e-10
E               
E               Mismatched elements: 1 / 10644 (0.00939%)
E               Max absolute difference: 3.823341643283129e-10
E               Max relative difference: 3.73142991131214e-07

Thanks for investigating.

( Separately, there's a du_dp assertion failing in tests/test_nonbonded.py::TestNonbondedWater::test_nblist_box_resize, which seems much closer to test threshold -- seems clearer that the test threshold can be loosened for this case.)

Agreed. Not worried about this, feel free to loosen the tolerances.

the main current issue is in this test of NonbondedMolEnergyPotential , which exercises a slightly different code path. Compares k_atom_by_atom_energies to nonbonded_block_unsummed.

Hm - @badisa any insight here? AFAIK The mol energies are used by the water sampling code.

Note that I don't think this is a fixed precision issue (since double passes, and double / single code paths use the same exponents for fixed point conversion). There's just a fundamental loss of precision occurring somewhere. The original tolerances were pretty loose to begin with, so this PR might just be the straw that broke the camel's back. I'd like to figure out where the loss of precision is coming from on a more fundamental level.

After some more digging, the test test_nonbonded_mol_energy_matches_exchange_mover_batch_U is using a non-minimized water with clashes near the PBC. Using commit ee8e5c47ec3e3d3200cf034de8ed1f43d8bc2e91

We can plot the reference per-molecule water energy histogram using:

import matplotlib.pyplot as plt
plt.hist(u_ref(conf, box, params), bins=20)
plt.show()

There is also a clear correlation between the abs_error and large positive molecular energies. Note that a large positive energy is indicative of clashes.

Adding a one line box size adjustment:

box += np.eye(3) * 0.1

Edit: tests pass with atol 1e-4 and rtol=1e-3 (before I had atol accidentally increased to 1e-3 as well)

tests/nonbonded/test_nonbonded_mol_energy.py::test_nonbonded_mol_energy_matches_exchange_mover_batch_U[float32-0.0001-0.001-4085] PASSED

Hmm, memory-tests are all passing now, but a previously passing test started failing... Will need to look more closely in the morning:

FAILED tests/test_cuda_targeted_insertion_mover.py::test_targeted_moves_with_complex_and_ligand_in_brd4[2023-float32-0.0001-0.002-10000-1-200000-0.95] - AssertionError: No moves were made, nothing was tested assert 0 > 0

was passing prior to https://github.com/proteneer/timemachine/pull/1290/commits/f371865dea0fa6997fb4a2bf9df4a5f709e5f8d6 , which was expected to have no effect when cutoff==1.2

proteneer / timemachine

Apply switching fxn #1290