Closed jchodera closed 4 years ago
For reference, the CustomNonbondedForce
terms look like this, and encode a softcore Lennard-Jones form:
<Force cutoff="1" energy="U_sterics;U_sterics = (lambda_sterics^softcore_a)*4*epsilon*x*(x-1.0); x = (sigma/reff_sterics)^6;reff_sterics = sigma*((softcore_alpha*(1.0-lambda_sterics)^softcore_b + (r/sigma)^softcore_c))^(1/softcore_c);epsilon = sqrt(epsilon1*epsilon2);sigma = 0.5*(sigma1 + sigma2);" forceGroup="0" method="2" switchingDistance="-1" type="CustomNonbondedForce" useLongRangeCorrection="1" useSwitchingFunction="0" version="2">
with parameters
<Parameter default=".5" name="softcore_alpha"/>
<Parameter default="0" name="softcore_beta"/>
<Parameter default="1" name="softcore_a"/>
<Parameter default="1" name="softcore_b"/>
<Parameter default="6" name="softcore_c"/>
<Parameter default="1" name="softcore_d"/>
<Parameter default="1" name="softcore_e"/>
<Parameter default="2" name="softcore_f"/>
Oh, I wonder if it's triggering the recomputation of the long-range correction since we have
useLongRangeCorrection="1"
Yep, that was it. If I disable the long-range correction, the slowdown disappears:
95 : -319322.814 kJ/mol : 6.023 ms : lambda_sterics 1.00000000 lambda_electrostatics 0.04000000
96 : -319333.910 kJ/mol : 5.260 ms : lambda_sterics 1.00000000 lambda_electrostatics 0.03000000
97 : -319281.300 kJ/mol : 5.613 ms : lambda_sterics 1.00000000 lambda_electrostatics 0.02000000
98 : -319159.651 kJ/mol : 5.294 ms : lambda_sterics 1.00000000 lambda_electrostatics 0.01000000
99 : -319073.257 kJ/mol : 5.943 ms : lambda_sterics 1.00000000 lambda_electrostatics 0.00000000
100 : -319113.388 kJ/mol : 5.278 ms : lambda_sterics 0.99333333 lambda_electrostatics 0.00000000
101 : -319241.998 kJ/mol : 5.103 ms : lambda_sterics 0.98666667 lambda_electrostatics 0.00000000
102 : -319294.918 kJ/mol : 5.242 ms : lambda_sterics 0.98000000 lambda_electrostatics 0.00000000
103 : -319195.520 kJ/mol : 5.151 ms : lambda_sterics 0.97333333 lambda_electrostatics 0.00000000
104 : -319047.499 kJ/mol : 5.550 ms : lambda_sterics 0.96666667 lambda_electrostatics 0.00000000
105 : -318959.133 kJ/mol : 5.091 ms : lambda_sterics 0.96000000 lambda_electrostatics 0.00000000
@peastman : Under what conditions are dispersion coefficients recomputed for NonbondedForce
? Is it only if the Lennard-Jones parameters are updated, or is it every time updateParametrsInContext
is called?
It looks like this line might mean that this is done every time updateParametersInContext()
is called for NonbondedForce
.
I realize we may be running into this issue in our protons
constant-pH and saltswap
codes, where we make heavy use of updateParametersInContext()
to change protonation states or mutate waters into ions, and I'm wondering how much of a slowdown this causes and what strategies we might use there to get around it.
It looks like NonbondedForce
updates the long-range dispersion correction every time the updateParametersInContext
is called using a CPU implementation, and CustomNonbondedForce
uses a similar CPU implementation that involves numerical integration.
It looks like the NonbondedForce.calcDispersionCorrection()
is quite fast, however. Using the latest OpenMM, calling NonbondedForce.updateParametersInContext()
on an explicitly solvated T4 lysozyme system (22K atoms) without changing any parameters increases the base timestep time from 10 ms to 80 ms without a dispersion correction and 90 ms with a dispersion correction, suggesting that 70 ms of overhead is due to copy/update operations and 10 ms of overhead is due to recomputing the Lennard-Jones dispersion correction. That's 7x slowdown just from the copy/update operations brought to 8x with the dispersion recomputation. (This is on a GTX-680.)
I'm a little unclear on what you're doing. If you modify anything that could change the dispersion correction, it will have to recalculate it. But that's a one time calculation. The next time step will be slow, and then it should go back to being fast again. Or are you calling updateParametersInContext()
on every time step?
Or are you calling updateParametersInContext() on every time step?
In the constant-pH code (recall #1661), we use updateParametersInContext()
to change only the charges of a few atoms every timestep over nonequilibrium switching trajectories of 20-40 ps. This appears to incur the overhead of recomputing the long-range dispersion correction every timestep. The same thing happens with our saltswap
code, in which a few charges and LJ parameters are changed to transmute waters into ions.
In the original issue above, we and collaborators in the Mobley lab have an alchemically-modified system where a CustomNonbondedForce
is used to compute interactions between the alchemically-modified part of the system (usually just a few atoms) and the rest of the system. We had previously been retaining the long-range correction in CustomNonbondedForce
, but have discovered that this incurs the overhead of recomputing the dispersion correction every time the lambda_sterics
global parameter is called. In a nonequilibrium switching trajectory (e.g. in NCMC), we change this nearly every timestep for a 20 ps - 1 ns switching simulation, incurring a huge amount of overhead. I think we can work around this for cases where only a few atoms are in the alchemical region, but it may be useful to think about whether there is some way to speed up this computation since it does decrease the phase-space overlap, resulting in higher rejection rates for NCMC.
it may be useful to think about whether there is some way to speed up this computation
Do you mean in NonbondedForce, CustomNonbondedForce, or both?
To compute the coefficient, we need to integrate the interaction for every pair of atom types. In the case of NonbondedForce that's pretty fast because we can compute it analytically. But for CustomNonbondedForce we need to do it numerically, which is a lot slower. I can look at the code to see if there are any ways to speed it up, though. One possibility would be to parallelize it.
Do you mean in NonbondedForce, CustomNonbondedForce, or both?
Possibly both. I'd want to have @bas-rustenburg do some profiling first to see how much the NonbondedForce
speedup would help in the constant-pH case. We know the numerical integration in CustomNonbondedForce
one is much slower, though, causing a 100x slowdown in practical cases.
To compute the coefficient, we need to integrate the interaction for every pair of atom types. In the case of NonbondedForce that's pretty fast because we can compute it analytically. But for CustomNonbondedForce we need to do it numerically, which is a lot slower. I can look at the code to see if there are any ways to speed it up, though. One possibility would be to parallelize it.
Parallelization would likely help, as would encoding it as a GPU kernel.
Possibly both. I'd want to have @bas-rustenburg do some profiling first to see how much the NonbondedForce speedup would help in the constant-pH case.
I can take a look into this next week. Will keep you posted.
Thanks to @gregoryross, we have some additional data that suggests the overhead in recomputing dispersion corrections for NonbondedForce
when using 20 ps NCMC switching trajectories which include a call to NonbondedForce.updateParametersInContext()
is relatively small:
Splitting = V R O R V , disp. correction = True , time in seconds = 66.13 +/- 0.09
Splitting = V R O R V , disp. correction = False , time in seconds = 63.3 +/- 0.3
This suggests the numerical integration to recompute the long-range correction in CustomNonbondedForce
is really the only long-range correction calculation that could benefit greatly from further optimization.
@gregoryross has also determined that it could be very helpful to see if there's a way to further speed up NonbondedForce.updateParametersInContext()
. We call this every timestep for 20-40 ps NCMC switching trajectories for our constant-pH and variable salt concentration implementations. The call to NonbondedForce.updateParametersInContext()
every timestep makes the whole NCMC integration process take 300% longer regardless of whether the dispersion correction is in use or not:
With updateParametersInContext() each step: time in seconds = 63.3 +/- 0.3
Without updateParametersInContext() each step: time in seconds = 21.55 +/- 0.06
I'm not sure if the dominant contribution there is the formation of the parameter vectors, their upoad to the GPU, or the call to cu.invalidateMolecules()
. We're only changing parameters for a few atoms, so I'm not sure if there's a way to exploit that to further speed this up. I'll attach this information to the relevant constant-pH issue as well.
Thanks, that's good information to have.
I'm working on trying to speed this up. There are some minor optimizations I can make, but nothing dramatic.
Do you really need to be changing it every time step? If you're making the change over 20 ps, that's about 10,000 time steps. What if you made the change in 100 increments, updating the parameters every 100 time steps?
I'm working on trying to speed this up. There are some minor optimizations I can make, but nothing dramatic.
Performing the numerical integration for all pair types in parallel on the GPU wouldn't help, would it? Or would that just be a mess?
Do you really need to be changing it every time step? If you're making the change over 20 ps, that's about 10,000 time steps. What if you made the change in 100 increments, updating the parameters every 100 time steps?
Our temporary workaround is just to disable the long-range correction for the alchemical region. That avoids the overhead entirely.
Updating every 100 steps (or at least every barostat update) could work, I think. I'm wondering if we can do that entirely with the current API by computing the initial long-range correction, taking 25-100 steps with dispersion correction and barostat disabled, re-enabling the dispersion correction and barostat for a single barostat step, taking another 25-100 steps with the dispersion correction and barostat disabled, and so on until the end of the 10,000-step switching. We can explore this a bit and see if that works out.
What I'm suggesting is much simpler than that. Assuming your current code looks something like
for i in range(10000):
setLambda((i+1)/10000.0)
integrator.step(1)
change that to
for i in range(100):
setLambda((i+1)/100.0)
integrator.step(100)
That way the dispersion correction only gets recalculated 100 times instead of 10,000 times.
Theory and experiment both indicate that it is optimal to break the nonequilibrium switching into many small increments, rather than fewer larger increments. This is because the optimal protocol is a geodesic in thermodynamic space, where the "diagonal" protocol has shorter thermodynamic length than the "city block" protocol. However, now that we're aware that there's a significant overhead penalty in updating the CustomNonbondedForce
long-range correction every timestep, we should probably try to find the optimal balance here, likely somewhere between 20-200 steps/update.
Even if you updated it every 10 time steps, that would reduce the cost by a factor of 10. And it's currently three times (i.e. 200%) slower with the updating than without, so then it would only be about 20% slower.
I think we might be talking about different things---were you referring to the CustomNonbondedForce
computation of long-range corrections via numerical integration? We found that this is ~100x slower, so updating it every 10 steps would still make it 10x slower.
I think we're happy that the NonbondedForce
dispersion correction calculation is not a dominant part of the updateParametersInContext()
call:
Splitting = V R O R V , disp. correction = True , time in seconds = 66.13 +/- 0.09
Splitting = V R O R V , disp. correction = False , time in seconds = 63.3 +/- 0.3
but we're not sure what the most time-consuming part of the updateParametersInContext()
call actually is. There, updating every 10 steps would amortize most of the overhead away, but it also reduces the NCMC acceptance rates.
I find it hard to believe it really has much effect on the acceptance rate whether you do the update in 100 pieces, 1000 pieces, or 10,000 pieces. As long as the individual changes are small enough, you'll stay very close to the ideal path. But try it and see.
For example, going from 4096 perturbation steps x 32 MD steps/perturbation to 2048 perturbation steps x 64 MD steps/perturbation visibly decreases the acceptance rate without decreasing the time per switch by a significant amount:
I find it hard to believe it really has much effect on the acceptance rate whether you do the update in 100 pieces, 1000 pieces, or 10,000 pieces. As long as the individual changes are small enough, you'll stay very close to the ideal path. But try it and see.
We can explore this further in more detail now that we're able to access high acceptance rates consistently. (The plot above was from a while back when we were not able to do this due to multiple technical issues.)
From your graph, it looks like the main dependence is on the total number of time steps (the product of the two axes). There's a big change when you move from one diagonal to another, but little change when you move along a diagonal.
Just making a note that we've observed a major slowdown in CustomNonbondedForce
dispersion correction computation in going from OpenMM 7.1.1 to the OpenMM 7.2 dev conda builds in this thread, and are trying to track down the source:
https://github.com/choderalab/yank/issues/705#issuecomment-313899251
It could be differences in switching to the new omnia-linux-anvil
docker build image, differences in the build.sh
, or it may be related to changes made in https://github.com/pandegroup/openmm/pull/1841
We're running into an odd problem with a
CustomIntegrator
and an alchemical system containingCustom*Force
terms where changing a context parameter seems to result in a sudden increase in the per-timestep time by 60x.On a GTX-TITAN, for example, the per-timestep time remains low until the context parameter
lambda_sterics
is modified from 1.0, at which point the time-per-timestep rockets from 10 ms to 600 ms:I've attached a simple script that illustrates this using serialized system, integrator, and state.
I've checked the "usual suspects":
nan
or blowing upAny thoughts on what might be going on here?
sudden-slowdown.zip