proteneer / timemachine

Differentiate all the things!
Other
138 stars 17 forks source link

Apply switching fxn #1290

Closed maxentile closed 4 months ago

maxentile commented 4 months ago

Apply a switching function to erfc reaction field, so that energies (and a couple derivatives) go to zero at cutoff.

Main considerations:

Implementation approach:

Current state:

Significance:

Other directions considered but pruned:

Future directions:

Timings

Benchmark timings, on an RTX 3080. Grain of salt 1: other stuff running in background. Grain of salt 2: proposed implementation isn't compatible with the cutoff used in DHFR benchmark.

current

```text dhfr-apo: N=23558 speed: 702.66ns/day dt: 2.5fs (ran 100000 steps in 30.74s) dhfr-apo-barostat-interval-25: N=23558 speed: 632.82ns/day dt: 2.5fs (ran 100000 steps in 34.14s) hif2a-apo: N=8805 speed: 1260.11ns/day dt: 2.5fs (ran 100000 steps in 17.14s) hif2a-apo-barostat-interval-25: N=8805 speed: 1042.06ns/day dt: 2.5fs (ran 100000 steps in 20.73s) hif2a-rbfe-barostat-interval-25: N=8840 speed: 833.62ns/day dt: 2.5fs (ran 100000 steps in 25.91s) hif2a-rbfe-local: N=8840 speed: 1431.88ns/day dt: 2.5fs (ran 100000 steps in 15.09s) solvent-apo: N=6282 speed: 1584.65ns/day dt: 2.5fs (ran 100000 steps in 13.63s) solvent-apo-barostat-interval-25: N=6282 speed: 1388.41ns/day dt: 2.5fs (ran 100000 steps in 15.56s) solvent-rbfe-barostat-interval-25: N=6317 speed: 1179.57ns/day dt: 2.5fs (ran 100000 steps in 18.32s) solvent-rbfe-local: N=6317 speed: 1672.97ns/day dt: 2.5fs (ran 100000 steps in 12.92s) vacuum-rbfe: N=35 speed: 11138.93ns/day dt: 2.5fs (ran 100000 steps in 1.94s) ```

proposed (5-10% slower)

```text dhfr-apo: N=23558 speed: 661.35ns/day dt: 2.5fs (ran 100000 steps in 32.66s) dhfr-apo-barostat-interval-25: N=23558 speed: 593.45ns/day dt: 2.5fs (ran 100000 steps in 36.40s) hif2a-apo: N=8805 speed: 1174.16ns/day dt: 2.5fs (ran 100000 steps in 18.40s) hif2a-apo-barostat-interval-25: N=8805 speed: 979.97ns/day dt: 2.5fs (ran 100000 steps in 22.04s) hif2a-rbfe-barostat-interval-25: N=8840 speed: 793.70ns/day dt: 2.5fs (ran 100000 steps in 27.22s) hif2a-rbfe-local: N=8840 speed: 1399.90ns/day dt: 2.5fs (ran 100000 steps in 15.43s) solvent-apo: N=6282 speed: 1500.61ns/day dt: 2.5fs (ran 100000 steps in 14.40s) solvent-apo-barostat-interval-25: N=6282 speed: 1312.82ns/day dt: 2.5fs (ran 100000 steps in 16.46s) solvent-rbfe-barostat-interval-25: N=6317 speed: 1128.24ns/day dt: 2.5fs (ran 100000 steps in 19.15s) solvent-rbfe-local: N=6317 speed: 1643.52ns/day dt: 2.5fs (ran 100000 steps in 13.15s) vacuum-rbfe: N=35 speed: 10314.47ns/day dt: 2.5fs (ran 100000 steps in 2.10s) ```
proteneer commented 4 months ago

__sincosf is courtesy of @dmclark17

badisa commented 4 months ago

Currently hard-codes cutoff=1.2. TBD: decide if we want to hard-code cutoff=1.2 as a project-wide constant, or modify real_es_factor etc. to accept cutoff as a parameter. (Note: the DHFR benchmark runs silently with cutoff=1.0.)

Note that the DHFR benchmark runs with 1.0 to allow us to more directly compare against OpenMM. I would advocate for accepting it as a parameter for that reason

maxentile commented 4 months ago

Note that the DHFR benchmark runs with 1.0 to allow us to more directly compare against OpenMM. I would advocate for accepting it as a parameter for that reason

I believe the timing comparison to OpenMM is the only current use case for cutoff != 1.2nm in the project?

I suppose a direct performance comparison to OpenMM may also involve matching OpenMM's reaction field. (Currently running with cutoff = 1.0 does lead to a larger truncation artifact, which could impact densities and therefore timings.)

Leaning towards exposing the cutoff parameter, mostly for reasons of flexibility.

mcwitt commented 4 months ago

Note that the DHFR benchmark runs with 1.0 to allow us to more directly compare against OpenMM. I would advocate for accepting it as a parameter for that reason

I believe the timing comparison to OpenMM is the only current use case for cutoff != 1.2nm in the project?

I suppose a direct performance comparison to OpenMM may also involve matching OpenMM's reaction field. (Currently running with cutoff = 1.0 does lead to a larger truncation artifact, which could impact densities and therefore timings.)

Leaning towards exposing the cutoff parameter, mostly for reasons of flexibility.

Not sure if this is a good idea, but I suppose we could make cutoff a template parameter to allow for compiler optimizations while still maintaining flexibility to test with a different cutoff.

maxentile commented 4 months ago

Timings as of https://github.com/proteneer/timemachine/pull/1290/commits/92c16263e3706cde4244e136370fb16bcdc040fb, maybe a few ns/day slower than first report, but probably within noise

dhfr-apo: N=23558 speed: 656.10ns/day dt: 2.5fs (ran 100000 steps in 32.93s)
dhfr-apo-barostat-interval-25: N=23558 speed: 589.55ns/day dt: 2.5fs (ran 100000 steps in 36.64s)

hif2a-apo: N=8805 speed: 1157.51ns/day dt: 2.5fs (ran 100000 steps in 18.66s)
hif2a-apo-barostat-interval-25: N=8805 speed: 977.32ns/day dt: 2.5fs (ran 100000 steps in 22.10s)

hif2a-rbfe-barostat-interval-25: N=8840 speed: 786.30ns/day dt: 2.5fs (ran 100000 steps in 27.47s)
hif2a-rbfe-local: N=8840 speed: 1399.82ns/day dt: 2.5fs (ran 100000 steps in 15.44s)

solvent-apo: N=6282 speed: 1487.89ns/day dt: 2.5fs (ran 100000 steps in 14.52s)
solvent-apo-barostat-interval-25: N=6282 speed: 1304.80ns/day dt: 2.5fs (ran 100000 steps in 16.56s)

solvent-rbfe-barostat-interval-25: N=6317 speed: 1112.19ns/day dt: 2.5fs (ran 100000 steps in 19.42s)
solvent-rbfe-local: N=6317 speed: 1646.61ns/day dt: 2.5fs (ran 100000 steps in 13.12s)

vacuum-rbfe: N=35 speed: 10269.27ns/day dt: 2.5fs (ran 100000 steps in 2.11s)
badisa commented 4 months ago

Looking at the performance between master and this PR as of https://github.com/proteneer/timemachine/pull/1290/commits/15221493bc52f1782ef0ac80e6635e168af96bb7 on an A10 (a Tesla card) that had nothing else running:

Looks to be about 5% performance difference, though it goes as low as ~1% for vacuum and as high as ~8% for hif2a-apo. On a side-note: Interesting to see that vacuum is so much faster on a RTX 3080 than the A10

Master

dhfr-apo: N=23558 speed: 742.84ns/day dt: 2.5fs (ran 100000 steps in 29.08s)
dhfr-apo-barostat-interval-25: N=23558 speed: 666.11ns/day dt: 2.5fs (ran 100000 steps in 32.43s)

hif2a-apo: N=8805 speed: 1289.54ns/day dt: 2.5fs (ran 100000 steps in 16.75s)
hif2a-apo-barostat-interval-25: N=8805 speed: 1076.49ns/day dt: 2.5fs (ran 100000 steps in 20.07s)
hif2a-rbfe-barostat-interval-25: N=8840 speed: 825.48ns/day dt: 2.5fs (ran 100000 steps in 26.17s)
hif2a-rbfe-local: N=8840 speed: 1333.75ns/day dt: 2.5fs (ran 100000 steps in 16.20s)

solvent-apo: N=6282 speed: 1536.51ns/day dt: 2.5fs (ran 100000 steps in 14.06s)
solvent-apo-barostat-interval-25: N=6282 speed: 1372.01ns/day dt: 2.5fs (ran 100000 steps in 15.75s)
solvent-rbfe-barostat-interval-25: N=6317 speed: 1135.66ns/day dt: 2.5fs (ran 100000 steps in 19.02s)
solvent-rbfe-local: N=6317 speed: 1485.72ns/day dt: 2.5fs (ran 100000 steps in 14.54s)

vacuum-rbfe: N=35 speed: 9055.34ns/day dt: 2.5fs (ran 100000 steps in 2.39s)

NonbondedInteractionGroup_f32: N=8840 Frames=1000 Params=5 speed: 1178.51 executions/seconds (ran 10000 potentials in 8.49s) du_dp=True, du_dx=True, u=True
NonbondedInteractionGroup_f64: N=8840 Frames=1000 Params=5 speed: 620.84 executions/seconds (ran 10000 potentials in 16.11s) du_dp=True, du_dx=True, u=True
HarmonicBond_f32: N=8840 Frames=1000 Params=5 speed: 1501.78 executions/seconds (ran 10000 potentials in 6.66s) du_dp=True, du_dx=True, u=True
HarmonicBond_f64: N=8840 Frames=1000 Params=5 speed: 1499.59 executions/seconds (ran 10000 potentials in 6.67s) du_dp=True, du_dx=True, u=True
HarmonicAngleStable_f32: N=8840 Frames=1000 Params=5 speed: 1461.24 executions/seconds (ran 10000 potentials in 6.84s) du_dp=True, du_dx=True, u=True
HarmonicAngleStable_f64: N=8840 Frames=1000 Params=5 speed: 1437.29 executions/seconds (ran 10000 potentials in 6.96s) du_dp=True, du_dx=True, u=True
PeriodicTorsion_f32: N=8840 Frames=1000 Params=5 speed: 1463.22 executions/seconds (ran 10000 potentials in 6.83s) du_dp=True, du_dx=True, u=True
PeriodicTorsion_f64: N=8840 Frames=1000 Params=5 speed: 1438.76 executions/seconds (ran 10000 potentials in 6.95s) du_dp=True, du_dx=True, u=True
ChiralAtomRestraint_f32: N=8840 Frames=1000 Params=5 speed: 1676.79 executions/seconds (ran 10000 potentials in 5.96s) du_dp=True, du_dx=True, u=True
ChiralAtomRestraint_f64: N=8840 Frames=1000 Params=5 speed: 1651.38 executions/seconds (ran 10000 potentials in 6.06s) du_dp=True, du_dx=True, u=True
NonbondedPairListPrecomputed_f32: N=8840 Frames=1000 Params=5 speed: 1605.44 executions/seconds (ran 10000 potentials in 6.23s) du_dp=True, du_dx=True, u=True
NonbondedPairListPrecomputed_f64: N=8840 Frames=1000 Params=5 speed: 1583.63 executions/seconds (ran 10000 potentials in 6.31s) du_dp=True, du_dx=True, u=True
Nonbonded_f32: N=8840 Frames=1000 Params=5 speed: 928.56 executions/seconds (ran 10000 potentials in 10.77s) du_dp=True, du_dx=True, u=True
Nonbonded_f64: N=8840 Frames=1000 Params=5 speed: 131.34 executions/seconds (ran 10000 potentials in 76.14s) du_dp=True, du_dx=True, u=True
SummedPotential(NonbondedInteractionGroup, NonbondedInteractionGroup)_f32: N=8840 Frames=1000 Params=5 speed: 926.95 executions/seconds (ran 10000 potentials in 10.79s) du_dp=True, du_dx=True, u=True
SummedPotential(NonbondedInteractionGroup, NonbondedInteractionGroup)_f64: N=8840 Frames=1000 Params=5 speed: 470.93 executions/seconds (ran 10000 potentials in 21.23s) du_dp=True, du_dx=True, u=True

PR

dhfr-apo: N=23558 speed: 697.24ns/day dt: 2.5fs (ran 100000 steps in 30.98s)
dhfr-apo-barostat-interval-25: N=23558 speed: 625.15ns/day dt: 2.5fs (ran 100000 steps in 34.56s)

hif2a-apo: N=8805 speed: 1194.36ns/day dt: 2.5fs (ran 100000 steps in 18.09s)
hif2a-apo-barostat-interval-25: N=8805 speed: 1016.03ns/day dt: 2.5fs (ran 100000 steps in 21.26s)
hif2a-rbfe-barostat-interval-25: N=8840 speed: 786.31ns/day dt: 2.5fs (ran 100000 steps in 27.47s)
hif2a-rbfe-local: N=8840 speed: 1295.05ns/day dt: 2.5fs (ran 100000 steps in 16.68s)

solvent-apo: N=6282 speed: 1447.15ns/day dt: 2.5fs (ran 100000 steps in 14.93s)
solvent-apo-barostat-interval-25: N=6282 speed: 1288.75ns/day dt: 2.5fs (ran 100000 steps in 16.76s)
solvent-rbfe-barostat-interval-25: N=6317 speed: 1079.71ns/day dt: 2.5fs (ran 100000 steps in 20.01s)
solvent-rbfe-local: N=6317 speed: 1434.23ns/day dt: 2.5fs (ran 100000 steps in 15.06s)

vacuum-rbfe: N=35 speed: 9115.40ns/day dt: 2.5fs (ran 100000 steps in 2.37s)

NonbondedInteractionGroup_f32: N=8840 Frames=1000 Params=5 speed: 1081.57 executions/seconds (ran 10000 potentials in 9.25s) du_dp=True, du_dx=True, u=True
NonbondedInteractionGroup_f64: N=8840 Frames=1000 Params=5 speed: 498.56 executions/seconds (ran 10000 potentials in 20.06s) du_dp=True, du_dx=True, u=True
HarmonicBond_f32: N=8840 Frames=1000 Params=5 speed: 1388.07 executions/seconds (ran 10000 potentials in 7.20s) du_dp=True, du_dx=True, u=True
HarmonicBond_f64: N=8840 Frames=1000 Params=5 speed: 1351.33 executions/seconds (ran 10000 potentials in 7.40s) du_dp=True, du_dx=True, u=True
HarmonicAngleStable_f32: N=8840 Frames=1000 Params=5 speed: 1327.24 executions/seconds (ran 10000 potentials in 7.53s) du_dp=True, du_dx=True, u=True
HarmonicAngleStable_f64: N=8840 Frames=1000 Params=5 speed: 1293.81 executions/seconds (ran 10000 potentials in 7.73s) du_dp=True, du_dx=True, u=True
PeriodicTorsion_f32: N=8840 Frames=1000 Params=5 speed: 1333.99 executions/seconds (ran 10000 potentials in 7.50s) du_dp=True, du_dx=True, u=True
PeriodicTorsion_f64: N=8840 Frames=1000 Params=5 speed: 1296.51 executions/seconds (ran 10000 potentials in 7.71s) du_dp=True, du_dx=True, u=True
ChiralAtomRestraint_f32: N=8840 Frames=1000 Params=5 speed: 1500.43 executions/seconds (ran 10000 potentials in 6.66s) du_dp=True, du_dx=True, u=True
ChiralAtomRestraint_f64: N=8840 Frames=1000 Params=5 speed: 1499.99 executions/seconds (ran 10000 potentials in 6.67s) du_dp=True, du_dx=True, u=True
NonbondedPairListPrecomputed_f32: N=8840 Frames=1000 Params=5 speed: 1476.21 executions/seconds (ran 10000 potentials in 6.77s) du_dp=True, du_dx=True, u=True
NonbondedPairListPrecomputed_f64: N=8840 Frames=1000 Params=5 speed: 1432.26 executions/seconds (ran 10000 potentials in 6.98s) du_dp=True, du_dx=True, u=True
Nonbonded_f32: N=8840 Frames=1000 Params=5 speed: 862.63 executions/seconds (ran 10000 potentials in 11.59s) du_dp=True, du_dx=True, u=True
Nonbonded_f64: N=8840 Frames=1000 Params=5 speed: 98.31 executions/seconds (ran 10000 potentials in 101.72s) du_dp=True, du_dx=True, u=True
SummedPotential(NonbondedInteractionGroup, NonbondedInteractionGroup)_f32: N=8840 Frames=1000 Params=5 speed: 883.34 executions/seconds (ran 10000 potentials in 11.32s) du_dp=True, du_dx=True, u=True
SummedPotential(NonbondedInteractionGroup, NonbondedInteractionGroup)_f64: N=8840 Frames=1000 Params=5 speed: 409.61 executions/seconds (ran 10000 potentials in 24.41s) du_dp=True, du_dx=True, u=True
maxentile commented 4 months ago

Currently stuck on this test failure:

FAILED tests/nonbonded/test_nonbonded_mol_energy.py::test_nonbonded_mol_energy_matches_exchange_mover_batch_U[float32-0.0001-0.002-4085] - AssertionError: 

E       AssertionError: 
E       Not equal to tolerance rtol=0.002, atol=0.0001
E       
E       Mismatched elements: 2 / 4085 (0.049%)
E       Max absolute difference: 0.03594835
E       Max relative difference: 0.00594114
E        x: array([-67.136335, -64.208156, -78.747204, ..., -77.397931, -36.04158 , -92.411841])
E        y: array([-67.136256, -64.208254, -78.747074, ..., -77.397854, -36.041125, -92.411754])

Note that the variant of this test that exercises custom_ops.NonbondedMolEnergyPotential_f64 passes.

This variant, which exercises custom_ops.NonbondedMolEnergyPotential_f32, fails when the jax reference batch_U is run in x64 mode, and still fails when I revert usage of all f32 intrinsics in nonbonded_common.cuh https://github.com/proteneer/timemachine/commit/fa185e2cc07f0b8be00014c379ea81114fe2777a (indicating that the issue is not due to accuracy loss associated with these optimizations).

Not obvious to me what might be causing the discrepancy. Seems much larger than I assume is explainable by differences between fixed-point and floating-point summation, but maybe it's tolerable and the test thresholds should just be relaxed a bit further?

proteneer commented 4 months ago

Since the double precision tests pass, does overloading the double precision implementation into the naive single precision implementation cause the tests to pass? (i.e. just take the double precision real_es_factor function, and replace double inputs with floats)

(This is one simple way to test another way of computation to see if it has similar loss of precision)

maxentile commented 4 months ago

Good suggestion. However, this particular test still fails with approximately the same magnitude discrepancy when using the double precision real_es_factor function with float inputs https://github.com/proteneer/timemachine/commit/9b3f75d3a3bf6703ef6f32a6fca51cadb3c1ac89

E       AssertionError: 
E       Not equal to tolerance rtol=0.002, atol=0.0001
E       
E       Mismatched elements: 2 / 4085 (0.049%)
E       Max absolute difference: 0.03589625
E       Max relative difference: 0.00673641
E        x: array([-67.136453, -64.208127, -78.747047, ..., -77.39786 , -36.041394, -92.411671])
E        y: array([-67.136256, -64.208254, -78.747074, ..., -77.397854, -36.041125, -92.411754])

tests/nonbonded/test_nonbonded_mol_energy.py:111: AssertionError

Note that tests of u and du_dx in test_nonbonded.py are passing for both f32 and f64 -- the main current issue is in this test of NonbondedMolEnergyPotential , which exercises a slightly different code path. Compares k_atom_by_atom_energies to nonbonded_block_unsummed.


( Separately, there's a du_dp assertion failing in tests/test_nonbonded.py::TestNonbondedWater::test_nblist_box_resize, which seems much closer to test threshold -- seems clearer that the test threshold can be loosened for this case.)

E               AssertionError: 
E               Not equal to tolerance rtol=1e-08, atol=1e-10
E               
E               Mismatched elements: 1 / 10644 (0.00939%)
E               Max absolute difference: 3.823341643283129e-10
E               Max relative difference: 3.73142991131214e-07
proteneer commented 4 months ago

Thanks for investigating.

( Separately, there's a du_dp assertion failing in tests/test_nonbonded.py::TestNonbondedWater::test_nblist_box_resize, which seems much closer to test threshold -- seems clearer that the test threshold can be loosened for this case.)

Agreed. Not worried about this, feel free to loosen the tolerances.

the main current issue is in this test of NonbondedMolEnergyPotential , which exercises a slightly different code path. Compares k_atom_by_atom_energies to nonbonded_block_unsummed.

Hm - @badisa any insight here? AFAIK The mol energies are used by the water sampling code.

Note that I don't think this is a fixed precision issue (since double passes, and double / single code paths use the same exponents for fixed point conversion). There's just a fundamental loss of precision occurring somewhere. The original tolerances were pretty loose to begin with, so this PR might just be the straw that broke the camel's back. I'd like to figure out where the loss of precision is coming from on a more fundamental level.

proteneer commented 4 months ago

After some more digging, the test test_nonbonded_mol_energy_matches_exchange_mover_batch_U is using a non-minimized water with clashes near the PBC. Using commit ee8e5c47ec3e3d3200cf034de8ed1f43d8bc2e91

We can plot the reference per-molecule water energy histogram using:

import matplotlib.pyplot as plt
plt.hist(u_ref(conf, box, params), bins=20)
plt.show()

image

There is also a clear correlation between the abs_error and large positive molecular energies. Note that a large positive energy is indicative of clashes.

image

Adding a one line box size adjustment:

box += np.eye(3) * 0.1

image

Edit: tests pass with atol 1e-4 and rtol=1e-3 (before I had atol accidentally increased to 1e-3 as well)

tests/nonbonded/test_nonbonded_mol_energy.py::test_nonbonded_mol_energy_matches_exchange_mover_batch_U[float32-0.0001-0.001-4085] PASSED

maxentile commented 4 months ago

Hmm, memory-tests are all passing now, but a previously passing test started failing... Will need to look more closely in the morning:

FAILED tests/test_cuda_targeted_insertion_mover.py::test_targeted_moves_with_complex_and_ligand_in_brd4[2023-float32-0.0001-0.002-10000-1-200000-0.95] - AssertionError: No moves were made, nothing was tested assert 0 > 0

was passing prior to https://github.com/proteneer/timemachine/pull/1290/commits/f371865dea0fa6997fb4a2bf9df4a5f709e5f8d6 , which was expected to have no effect when cutoff==1.2