Long-range correction computation times

jchodera commented 7 years ago

We're running into an odd problem with a CustomIntegrator and an alchemical system containing Custom*Force terms where changing a context parameter seems to result in a sudden increase in the per-timestep time by 60x.

On a GTX-TITAN, for example, the per-timestep time remains low until the context parameter lambda_sterics is modified from 1.0, at which point the time-per-timestep rockets from 10 ms to 600 ms:

     90 : -319121.711 kJ/mol :   11.909 ms : lambda_sterics   1.00000000 lambda_electrostatics   0.09000000
      91 : -318910.208 kJ/mol :   11.130 ms : lambda_sterics   1.00000000 lambda_electrostatics   0.08000000
      92 : -318920.677 kJ/mol :   11.298 ms : lambda_sterics   1.00000000 lambda_electrostatics   0.07000000
      93 : -319024.438 kJ/mol :    8.587 ms : lambda_sterics   1.00000000 lambda_electrostatics   0.06000000
      94 : -319066.106 kJ/mol :   11.298 ms : lambda_sterics   1.00000000 lambda_electrostatics   0.05000000
      95 : -319027.394 kJ/mol :    8.534 ms : lambda_sterics   1.00000000 lambda_electrostatics   0.04000000
      96 : -318954.743 kJ/mol :   12.134 ms : lambda_sterics   1.00000000 lambda_electrostatics   0.03000000
      97 : -318848.571 kJ/mol :    9.846 ms : lambda_sterics   1.00000000 lambda_electrostatics   0.02000000
      98 : -318685.161 kJ/mol :   11.206 ms : lambda_sterics   1.00000000 lambda_electrostatics   0.01000000
      99 : -318529.539 kJ/mol :   10.777 ms : lambda_sterics   1.00000000 lambda_electrostatics   0.00000000
     100 : -318504.142 kJ/mol :  620.636 ms : lambda_sterics   0.99333333 lambda_electrostatics   0.00000000
     101 : -318609.827 kJ/mol :  617.799 ms : lambda_sterics   0.98666667 lambda_electrostatics   0.00000000
     102 : -318698.663 kJ/mol :  624.226 ms : lambda_sterics   0.98000000 lambda_electrostatics   0.00000000
     103 : -318668.663 kJ/mol :  618.002 ms : lambda_sterics   0.97333333 lambda_electrostatics   0.00000000
     104 : -318606.746 kJ/mol :  616.902 ms : lambda_sterics   0.96666667 lambda_electrostatics   0.00000000
     105 : -318637.709 kJ/mol :  620.136 ms : lambda_sterics   0.96000000 lambda_electrostatics   0.00000000
     106 : -318712.495 kJ/mol :  622.895 ms : lambda_sterics   0.95333333 lambda_electrostatics   0.00000000
     107 : -318692.099 kJ/mol :  618.168 ms : lambda_sterics   0.94666667 lambda_electrostatics   0.00000000
     108 : -318563.470 kJ/mol :  618.041 ms : lambda_sterics   0.94000000 lambda_electrostatics   0.00000000
     109 : -318455.241 kJ/mol :  624.268 ms : lambda_sterics   0.93333333 lambda_electrostatics   0.00000000

I've attached a simple script that illustrates this using serialized system, integrator, and state.

I've checked the "usual suspects":

The energy is not suddenly nan or blowing up
The box size is not expanding; the box size is fixed here
The alchemical parameters are not changing to disallowed (e.g. negative) values

Any thoughts on what might be going on here?

sudden-slowdown.zip

jchodera commented 7 years ago

For reference, the CustomNonbondedForce terms look like this, and encode a softcore Lennard-Jones form:

<Force cutoff="1" energy="U_sterics;U_sterics = (lambda_sterics^softcore_a)*4*epsilon*x*(x-1.0); x = (sigma/reff_sterics)^6;reff_sterics = sigma*((softcore_alpha*(1.0-lambda_sterics)^softcore_b + (r/sigma)^softcore_c))^(1/softcore_c);epsilon = sqrt(epsilon1*epsilon2);sigma = 0.5*(sigma1 + sigma2);" forceGroup="0" method="2" switchingDistance="-1" type="CustomNonbondedForce" useLongRangeCorrection="1" useSwitchingFunction="0" version="2">

with parameters

                                <Parameter default=".5" name="softcore_alpha"/>
                                <Parameter default="0" name="softcore_beta"/>
                                <Parameter default="1" name="softcore_a"/>
                                <Parameter default="1" name="softcore_b"/>
                                <Parameter default="6" name="softcore_c"/>
                                <Parameter default="1" name="softcore_d"/>
                                <Parameter default="1" name="softcore_e"/>
                                <Parameter default="2" name="softcore_f"/>

cc: https://github.com/choderalab/openmmtools/issues/201

jchodera commented 7 years ago

Oh, I wonder if it's triggering the recomputation of the long-range correction since we have

useLongRangeCorrection="1"

jchodera commented 7 years ago

Yep, that was it. If I disable the long-range correction, the slowdown disappears:

      95 : -319322.814 kJ/mol :    6.023 ms : lambda_sterics   1.00000000 lambda_electrostatics   0.04000000
      96 : -319333.910 kJ/mol :    5.260 ms : lambda_sterics   1.00000000 lambda_electrostatics   0.03000000
      97 : -319281.300 kJ/mol :    5.613 ms : lambda_sterics   1.00000000 lambda_electrostatics   0.02000000
      98 : -319159.651 kJ/mol :    5.294 ms : lambda_sterics   1.00000000 lambda_electrostatics   0.01000000
      99 : -319073.257 kJ/mol :    5.943 ms : lambda_sterics   1.00000000 lambda_electrostatics   0.00000000
     100 : -319113.388 kJ/mol :    5.278 ms : lambda_sterics   0.99333333 lambda_electrostatics   0.00000000
     101 : -319241.998 kJ/mol :    5.103 ms : lambda_sterics   0.98666667 lambda_electrostatics   0.00000000
     102 : -319294.918 kJ/mol :    5.242 ms : lambda_sterics   0.98000000 lambda_electrostatics   0.00000000
     103 : -319195.520 kJ/mol :    5.151 ms : lambda_sterics   0.97333333 lambda_electrostatics   0.00000000
     104 : -319047.499 kJ/mol :    5.550 ms : lambda_sterics   0.96666667 lambda_electrostatics   0.00000000
     105 : -318959.133 kJ/mol :    5.091 ms : lambda_sterics   0.96000000 lambda_electrostatics   0.00000000

jchodera commented 7 years ago

@peastman : Under what conditions are dispersion coefficients recomputed for NonbondedForce? Is it only if the Lennard-Jones parameters are updated, or is it every time updateParametrsInContext is called?

It looks like this line might mean that this is done every time updateParametersInContext() is called for NonbondedForce.

I realize we may be running into this issue in our protons constant-pH and saltswap codes, where we make heavy use of updateParametersInContext() to change protonation states or mutate waters into ions, and I'm wondering how much of a slowdown this causes and what strategies we might use there to get around it.

jchodera commented 7 years ago

It looks like NonbondedForce updates the long-range dispersion correction every time the updateParametersInContext is called using a CPU implementation, and CustomNonbondedForce uses a similar CPU implementation that involves numerical integration.

It looks like the NonbondedForce.calcDispersionCorrection() is quite fast, however. Using the latest OpenMM, calling NonbondedForce.updateParametersInContext() on an explicitly solvated T4 lysozyme system (22K atoms) without changing any parameters increases the base timestep time from 10 ms to 80 ms without a dispersion correction and 90 ms with a dispersion correction, suggesting that 70 ms of overhead is due to copy/update operations and 10 ms of overhead is due to recomputing the Lennard-Jones dispersion correction. That's 7x slowdown just from the copy/update operations brought to 8x with the dispersion recomputation. (This is on a GTX-680.)

peastman commented 7 years ago

I'm a little unclear on what you're doing. If you modify anything that could change the dispersion correction, it will have to recalculate it. But that's a one time calculation. The next time step will be slow, and then it should go back to being fast again. Or are you calling updateParametersInContext() on every time step?

jchodera commented 7 years ago

Or are you calling updateParametersInContext() on every time step?

In the constant-pH code (recall #1661), we use updateParametersInContext() to change only the charges of a few atoms every timestep over nonequilibrium switching trajectories of 20-40 ps. This appears to incur the overhead of recomputing the long-range dispersion correction every timestep. The same thing happens with our saltswap code, in which a few charges and LJ parameters are changed to transmute waters into ions.

In the original issue above, we and collaborators in the Mobley lab have an alchemically-modified system where a CustomNonbondedForce is used to compute interactions between the alchemically-modified part of the system (usually just a few atoms) and the rest of the system. We had previously been retaining the long-range correction in CustomNonbondedForce, but have discovered that this incurs the overhead of recomputing the dispersion correction every time the lambda_sterics global parameter is called. In a nonequilibrium switching trajectory (e.g. in NCMC), we change this nearly every timestep for a 20 ps - 1 ns switching simulation, incurring a huge amount of overhead. I think we can work around this for cases where only a few atoms are in the alchemical region, but it may be useful to think about whether there is some way to speed up this computation since it does decrease the phase-space overlap, resulting in higher rejection rates for NCMC.

peastman commented 7 years ago

it may be useful to think about whether there is some way to speed up this computation

Do you mean in NonbondedForce, CustomNonbondedForce, or both?

To compute the coefficient, we need to integrate the interaction for every pair of atom types. In the case of NonbondedForce that's pretty fast because we can compute it analytically. But for CustomNonbondedForce we need to do it numerically, which is a lot slower. I can look at the code to see if there are any ways to speed it up, though. One possibility would be to parallelize it.

jchodera commented 7 years ago

Do you mean in NonbondedForce, CustomNonbondedForce, or both?

Possibly both. I'd want to have @bas-rustenburg do some profiling first to see how much the NonbondedForce speedup would help in the constant-pH case. We know the numerical integration in CustomNonbondedForce one is much slower, though, causing a 100x slowdown in practical cases.

To compute the coefficient, we need to integrate the interaction for every pair of atom types. In the case of NonbondedForce that's pretty fast because we can compute it analytically. But for CustomNonbondedForce we need to do it numerically, which is a lot slower. I can look at the code to see if there are any ways to speed it up, though. One possibility would be to parallelize it.

Parallelization would likely help, as would encoding it as a GPU kernel.

bas-rustenburg commented 7 years ago

Possibly both. I'd want to have @bas-rustenburg do some profiling first to see how much the NonbondedForce speedup would help in the constant-pH case.

I can take a look into this next week. Will keep you posted.

jchodera commented 7 years ago

Thanks to @gregoryross, we have some additional data that suggests the overhead in recomputing dispersion corrections for NonbondedForce when using 20 ps NCMC switching trajectories which include a call to NonbondedForce.updateParametersInContext() is relatively small:

Splitting = V R O R V , disp. correction = True  , time in seconds =  66.13 +/- 0.09
Splitting = V R O R V , disp. correction = False , time in seconds =  63.3  +/- 0.3

This suggests the numerical integration to recompute the long-range correction in CustomNonbondedForce is really the only long-range correction calculation that could benefit greatly from further optimization.

@gregoryross has also determined that it could be very helpful to see if there's a way to further speed up NonbondedForce.updateParametersInContext(). We call this every timestep for 20-40 ps NCMC switching trajectories for our constant-pH and variable salt concentration implementations. The call to NonbondedForce.updateParametersInContext() every timestep makes the whole NCMC integration process take 300% longer regardless of whether the dispersion correction is in use or not:

With    updateParametersInContext() each step: time in seconds = 63.3  +/- 0.3
Without updateParametersInContext() each step: time in seconds = 21.55 +/- 0.06

I'm not sure if the dominant contribution there is the formation of the parameter vectors, their upoad to the GPU, or the call to cu.invalidateMolecules(). We're only changing parameters for a few atoms, so I'm not sure if there's a way to exploit that to further speed this up. I'll attach this information to the relevant constant-pH issue as well.

peastman commented 7 years ago

Thanks, that's good information to have.

peastman commented 7 years ago

I'm working on trying to speed this up. There are some minor optimizations I can make, but nothing dramatic.

Do you really need to be changing it every time step? If you're making the change over 20 ps, that's about 10,000 time steps. What if you made the change in 100 increments, updating the parameters every 100 time steps?

jchodera commented 7 years ago

I'm working on trying to speed this up. There are some minor optimizations I can make, but nothing dramatic.

Performing the numerical integration for all pair types in parallel on the GPU wouldn't help, would it? Or would that just be a mess?

Do you really need to be changing it every time step? If you're making the change over 20 ps, that's about 10,000 time steps. What if you made the change in 100 increments, updating the parameters every 100 time steps?

Our temporary workaround is just to disable the long-range correction for the alchemical region. That avoids the overhead entirely.

Updating every 100 steps (or at least every barostat update) could work, I think. I'm wondering if we can do that entirely with the current API by computing the initial long-range correction, taking 25-100 steps with dispersion correction and barostat disabled, re-enabling the dispersion correction and barostat for a single barostat step, taking another 25-100 steps with the dispersion correction and barostat disabled, and so on until the end of the 10,000-step switching. We can explore this a bit and see if that works out.

peastman commented 7 years ago

What I'm suggesting is much simpler than that. Assuming your current code looks something like

for i in range(10000):
  setLambda((i+1)/10000.0)
  integrator.step(1)

change that to

for i in range(100):
  setLambda((i+1)/100.0)
  integrator.step(100)

That way the dispersion correction only gets recalculated 100 times instead of 10,000 times.

jchodera commented 7 years ago

Theory and experiment both indicate that it is optimal to break the nonequilibrium switching into many small increments, rather than fewer larger increments. This is because the optimal protocol is a geodesic in thermodynamic space, where the "diagonal" protocol has shorter thermodynamic length than the "city block" protocol. However, now that we're aware that there's a significant overhead penalty in updating the CustomNonbondedForce long-range correction every timestep, we should probably try to find the optimal balance here, likely somewhere between 20-200 steps/update.

peastman commented 7 years ago

Even if you updated it every 10 time steps, that would reduce the cost by a factor of 10. And it's currently three times (i.e. 200%) slower with the updating than without, so then it would only be about 20% slower.

jchodera commented 7 years ago

I think we might be talking about different things---were you referring to the CustomNonbondedForce computation of long-range corrections via numerical integration? We found that this is ~100x slower, so updating it every 10 steps would still make it 10x slower.

I think we're happy that the NonbondedForce dispersion correction calculation is not a dominant part of the updateParametersInContext() call:

Splitting = V R O R V , disp. correction = True  , time in seconds =  66.13 +/- 0.09
Splitting = V R O R V , disp. correction = False , time in seconds =  63.3  +/- 0.3

but we're not sure what the most time-consuming part of the updateParametersInContext() call actually is. There, updating every 10 steps would amortize most of the overhead away, but it also reduces the NCMC acceptance rates.

peastman commented 7 years ago

I find it hard to believe it really has much effect on the acceptance rate whether you do the update in 100 pieces, 1000 pieces, or 10,000 pieces. As long as the individual changes are small enough, you'll stay very close to the ideal path. But try it and see.

jchodera commented 7 years ago

For example, going from 4096 perturbation steps x 32 MD steps/perturbation to 2048 perturbation steps x 64 MD steps/perturbation visibly decreases the acceptance rate without decreasing the time per switch by a significant amount:

I find it hard to believe it really has much effect on the acceptance rate whether you do the update in 100 pieces, 1000 pieces, or 10,000 pieces. As long as the individual changes are small enough, you'll stay very close to the ideal path. But try it and see.

We can explore this further in more detail now that we're able to access high acceptance rates consistently. (The plot above was from a while back when we were not able to do this due to multiple technical issues.)

peastman commented 7 years ago

From your graph, it looks like the main dependence is on the total number of time steps (the product of the two axes). There's a big change when you move from one diagonal to another, but little change when you move along a diagonal.

jchodera commented 7 years ago

Just making a note that we've observed a major slowdown in CustomNonbondedForce dispersion correction computation in going from OpenMM 7.1.1 to the OpenMM 7.2 dev conda builds in this thread, and are trying to track down the source:

https://github.com/choderalab/yank/issues/705#issuecomment-313899251

It could be differences in switching to the new omnia-linux-anvil docker build image, differences in the build.sh, or it may be related to changes made in https://github.com/pandegroup/openmm/pull/1841

openmm / openmm

Long-range correction computation times #1832