Multidimensional Optimization

ramess101 commented 7 years ago

Currently the code is designed just to optimize epsilon for a given sigma. This is done in part because the golden section code is intended for 1-D optimization. There are a few options for modifying the code to optimize both epsilon and sigma:

Make the 1-D optimization alternate between epsilon and sigma
Use a different optimization algorithm
Modify the code to loop through several sigma values and just optimize each independently

Option 3 is the easiest to implement. We would just create a for loop in Ethane_automated_optimization_multiple_loops that tracks different sigma values. This approach would be nice for creating a plot of the optimal epsilon with respect to sigma. But would require much more simulations than necessary (proportional to the number of sigma values).

Option 1 would also be easy to implement, just tedious book keeping. But this is more of a temporary band-aid than a real solution. We would not be using this approach for a higher dimensional optimization.

Option 2 is the best approach but requires an investigation of optimization algorithms. Specifically, we want a method that does not require a large number of function evaluations. Preferably this method would be gradient free, most likely a genetic algorithm of some sort.

I think that for now we could just do Option 3 with five different sigmas. Then we can see how well Option 2 works.

ramess101 commented 7 years ago

I have included Option 3 in the Ethane_automated_optimization_multiple_loops script.

ramess101 commented 7 years ago

After improving the code so that python does not need to exit to perform Gromacs reruns, it is much easier to implement python optimizers. This is beneficial for many reasons, one of which being that we can perform a multidimensional optimization more easily. Currently the "Ethane_subprocess" code uses the hand-written golden section algorithm. "Ethane_scipy" is an attempt to implement scipy optimizers. Some of these have the disadvantage of a large number of function evaluations or need analytic gradients. Also, most of these are not compatible with a bounded problem. L-BFGS-B handles bounds but takes an unnecessary number of iterations. By increasing the tolerance we can reduce the number of iterations without affecting the final answer since we only need epsilon to a few digits. Other approaches could be used without bounds by simply including an if statement in the objective function that returns an artificially large value outside of the bounds so that Gromacs does not try to run extremely different parameter sets.

Perhaps we should write our own optimizer, specifically the Dud one described in the paper you sent me.

ramess101 commented 7 years ago

I have been working on several different multidimensional optimizers. I am still trying to figure out which optimizers are best for the task at hand (i.e. require few function evaluations, no analytic derivatives, handle noise, scale well with dimensional space, accept boundaries, etc.). In order to test the different optimizers it is good to know what the objective function landscape looks like. In order to facilitate visualizing the landscape I have scanned the parameter space using a 40x40 (epsilon by sigma) grid. This is computationally tractable since I am using only a single reference simulation (eps = 1.0143 kJ/mol, sig = 0.377 nm) and using MBAR to reweight the rerun simulations of the other 1600 parameter sets.

Here are contour plots of the SSE for the different observables that may be used in an objective function. rhoL, rhov, Psat are obtained from ITIC and, therefore, correspond to 5 Tsats while P, Z, and U correspond to each of the different T, rho state points simulated along IT and the ICs.

There is some redundancy in these plots since Psat is intimately related to rhov. Similarly, the P along the isochores are very influential in rhoL (as they determine Tsat). Finally, the SSE of Z is effectively the same as the SSE of P but with a weighting factor of T and rho. So, I think looking at rhoL, Psat, and U is sufficient to really get some insight.

You will notice from the contour plots that the optimal for rhoL is actually reasonably close to that for Psat but quite different from the optimal for U. Furthermore, the contours for Psat are much more narrow which suggests a significant depreciation when moving away from the minimum. To give you an idea as to how well we can predict Psat and rhoL simultaneously, here is a plot comparing the REFPROP VLE values to the current TraPPE model and the optimal that I obtained using TraPPEfs where I only used rhoL in my objective function. I used the L-BFGS-B optimizer available through scipy to optimize the TraPPEfs model. The optimal parameters were: epsilon = 0.9509 kJ/mol and sigma = 0.3771 nm. This optimizer required only 16 rerun iterations to obtain the parameters with this precision.

These two figures suggest to me that we can do a reasonable job matching both rhoL and Psat but if we want a more accurate model (for internal energy as well) we need to use AUA and/or a Mie potential. This is what we concluded previously but now with these contour plots we can say it more definitively.

Let me know what you think.

ramess101 commented 7 years ago

Optimization methods that I have tested:

Golden search
Steepest descent
scipy's L-BFGS-B
scipy's fsolve
Leapfrog

The literature (Hulsmann et al.) suggests that steepest descent should be the best. However, my results suggest that steepest descent is much worse than methods 3 and 4. That being said, my implementation of Armijo step in steepest descent might need to be looked at. Leapfrog requires more function evaluations than 3 and 4 but it should scale nicely with dimensions and avoid getting stuck in local minima. Also, for leapfrogging you can initialize the players once (which is a significant portion of the cost) without respect to the objective function. Whereas other methods must be completely rerun for a new objective function. Therefore, leapfrogging may be nice when changing the objective function to see what happens to the optimal.

The "Dud" method that Richard emailed me did not seem to be as promising as these approaches, in my opinion. But I have not tried to implement it yet.

ramess101 commented 7 years ago

Thanks Rich,

Very interesting. I think that the optimal epsilon for TraPPEfsw is 114.4 K for Mostafa’s benefit. The 0.377 for sigma is the same as what we found for our sf potential, but eps/k was 122.

I worry a little about the optimal value of sigma. I wonder what would happen if you used a different value of sigma in your reference simulation, since you seem to have come back to where you started in that regard. What optimal sigma would you get if you used 0.376 in your reference simulation?

I am a little confused about “the current TraPPE model.” Is that the new model for ethane/ethylene that came out in 2017? If so, it seems we are beating Siepmann’s latest and greatest for ethane in Psat while maintaining most of the accuracy in liquid density. It would be nice to know %AAD of Psat, denLiq, and Udep by both models. I guess Hvap must be off if Udep is off, but the Psat vs T looks ok. How do we explain that?

I wonder what it would look like to remake the plot you had in your thesis where you showed the regions of acceptable accuracy for Psat and liquid density on a sigma vs epsilon plot and saw they did not overlap. I guess those regions would be a little closer to overlapping now. Does that mean that simply applying fsw improves the potential model in some sense?

JRE

ramess101 commented 7 years ago

Richard,

I will try with a different reference sigma to see what happens.

The TraPPE model is the original 1998 model. I wanted to use an UA TraPPE model for a fair comparison with TraPPEfs since we use the 0.154 nm bond-length model. The most recent (2017) TraPPE model is AUA which performs much better than TraPPE-UA and TraPPEfs since it has an additional fitting parameter. That is why I think the next step would be to optimize the TraPPEfs using an AUA approach.

Yes, I could try to make the acceptable regions plot. The plot would not be as refined since I would be using the results from my 40x40 grid. I believe we will see that the regions appear to be closer which would suggest that somehow fsw improves the model fit. The only explanation I have is that the cutoff approach assumes that after r_cut the RDF is 1, which is typically not true. So maybe the tail corrections introduce a slight error that affects VLE. Or it could have something to do with the ITIC approach and, specifically, how I am incorporating the NIST B2 and B3 values. In other words, perhaps using NIST’s B2 and B3 improves Psat slightly. I know you have validated ITIC but perhaps I should validate my specific implementation by comparing what I get with the TraPPE model.

ramess101 commented 7 years ago

OK, so the preliminary results suggest that sigma_ref does matter but that it mainly just affects how long it will take to get to the optimal (both in number of direct simulations and number of reruns). In this figure we show the path that the L-BFGS-B optimizer takes

Hopefully it is clear that the red points go with the sig_ref=0.377 nm path (blue line/points) while the purple points go with the sig_ref = 0.38 nm path (green line/points). These "optimal" and "guess" points are just to help you know where the optimizations start and finish. Notice that they both use the same (poor) guess in epsilon.

The main conclusions from this figure are:

The optima are different for the two initial guesses
The guess of 0.38 nm requires more iterations to converge

We can get some more insight by plotting the objective function with respect to the iterations:

This figure demonstrates a few important points. First, we see again that sig_ref of 0.38 nm results in more iterations. We also see that both paths appear to converge faster than the L-BFGS-B tolerance setting. I.e. we probably could stop the iterations at 15 for the blue points and maybe 30 for the orange points.

But the main conclusion is that the optimal for the orange points is much worse than the optimal with the blue points. There are two explanations for this:

The optimal obtained with sig_ref of 0.38 nm is not the true optimal (i.e. the optimizer got stuck)
The optimal with sig_ref is different (and worse) than the optimal with sig_ref of 0.377 nm.

To determine which explanation is correct we could try:

Scanning the entire parameter space (like I did previously using 0.377 nm as the ref) and see where the optimal is.
Using a different optimizer or changing L-BFGS-B settings
Evaluating the objective function using sig_ref = 0.38 nm but with the optimal parameter set obtained with the sig_ref=0.377 nm.

Option 1 is very expensive (well, relative to the others, it only takes about 12 hours). So I will avoid Option 1 if I can. Option 2 is simple enough but L-BFGS-B seemed to be one of the best so I don't think that is the problem. Option 3 is the easiest, so I went ahead and did this. The result I got was:

Direct simulations using: sig_ref = 0.38nm

Rerun/MBAR simulations using: optimal 1: eps = 0.951 kJ/mol sig = 0.3771 nm objective function = 1855 optimal 2: eps = 0.961 kJ/mol sig = 0.3786 nm objective function = 265

Since the objective function for optimal 1 is so much higher than that for optimal 2 we can see that the optimal using sig_ref of 0.38nm really is different than the optimal with sig_ref of 0.377nm. To help quantify the difference the reference makes, the objective function for optimal 1 using sig_ref of 0.377 nm was 83. Essentially, this demonstrates that MBAR struggles when sigma changes too much. So the optimal will likely be along the optimal valley but within the region that MBAR is accurate.

For these reasons, if we start with a poor guess for sigma we should iterate with a new batch of direct simulations and repeat the MBAR process. I am going to see how quickly this process converges, hopefully I will have results by tomorrow.

ramess101 commented 7 years ago

Also, as we perform more direct simulations (i.e. with different sigmas) MBAR should become more accurate. So the two methods are:

Start with a guess and perform additional direct simulations at the new optimal
Scan the sigma parameter space with a few different values to feed into MBAR

Modifying the code to handle multiple reference parameter sets is one of the key tasks ahead.

ramess101 commented 7 years ago

Using a different optimizer or changing L-BFGS-B settings

Here are the results using the Leapfrog algorithm. Again we see that sig_ref of 0.377 nm has a better optimal. Although now the number of function evaluations is about the same (because leapfrog does not use the reference as the initial guess).

Also, it is interesting that the Leapfrog algorithm did better than L-BFGS-B at finding the actual minimum for sig_ref=0.38 nm. This can be seen on this figure where we compare the objective function values for the different optimizers:

Notice that in the previous figure Leapfrog obtained a single point that was lower than the L-BFGS-B minimum for sig_ref = 0.38nm. This is likely because L-BFGS-B depends on gradients which become very flat in the valley. By contrast, leapfrog does a better job of sampling around which both introduces more randomness in the objective function values sampled but also avoids getting trapped.

Here we see how different the optimal epsilon and sigma are for the two optimizers. Note that the red points were obtained with eps_ref = 0.377 nm while purple used eps_ref = 0.38 nm.

So these results demonstrate that the optimizer is not to blame for the discrepancy. In addition, we gained some insight as to the strengths and weaknesses of these optimizers.

ramess101 commented 7 years ago

Here we can see how the optimal changes with different reference epsilon and sigma. The different iterations are obtained using leapfrog. The lightly colored points are the optimal (lowest SSE). Also, notice that I have inverted the axis compared to previous plots. Sorry.

This first figure is the sig_ref = 0.377 nm

This second figure is the sig_ref = 0.38 nm

This third figure uses the eps_optimal, sig_optimal from sig_ref = 0.38 nm (i.e the optimal in Figure 2 is used as the reference in Figure 3). Notice that this second iteration is very similar to the optimal from the first figure.

Finally, the fourth figure uses the optimal from the third figure (which is similar to the optimal from the first figure).

Notice that the optimal sigma is coming back towards the 0.375 value of TraPPE. Also, notice that I had to change the epsilon axis to include these lower epsilon values. These epsilons are still greater than the TraPPE value (98 K = 0.815 kJ/mol).

This last figure is the most surprising result. At first I thought this divergence was troubling, but in reality the FTIC obtained with this new optimal is much better. You can see this by looking at the scale of log(SSE) in the different plots or just by visually inspecting the VLE plot. Since rhoL is the only objective property I am using right now, it looks much better in rhoL using this last optimal. The figure below shows the VLE using the original "optimal" around 0.377 nm and the other optimal around 0.375 nm (more like 0.3756 nm):

Again, the main reason why the optimal is closer to 0.375 nm is that I am just fitting rhoL. Clearly, we do see a significant depreciation in Psat (and rhov):

This makes more sense since changing the LJ to LJfs should not really improve the VLE. I also like that sigma is closer to TraPPE sigma and that it is mainly just epsilon that has to change for TraPPEfs.

ramess101 commented 7 years ago

This plot shows that the optimal after an additional iteration is now very close to the reference. This suggests that we have actually converged because the direction simulation results are matching the MBAR results at the optimal.

And here is after one more iteration:

We have converged at this point to within 0.0002 nm and 0.001 kJ/mol.

It is important to note that so far I am not including multiple references. Hopefully with additional references we would be able to converge quicker than 4 iterations. But another key point is that we are able to use "global" optimizers with this methodology.

ramess101 commented 7 years ago

Hi Rich,

Greeting from Istanbul. I read over the results on Github. It looks to me like BFGS finds the optimum faster as long as you use the proper reference/direct simulation.

It would be nice to see summary statistics like %AAD and %BIAS in Psat, rhoL, and Udep for each optimum.

I am leaning toward AUA for the next step in this effort.

ramess101 commented 7 years ago

To clarify, there is really only one optimum when the objective function is rhoL. And that optimum is the last one that you obtain. The previous optimum are just a result of MBAR not being perfect in its ability to extrapolate to different parameter sets. L-BFGS-B is faster than leapfrog but much more susceptible to get trapped in local optima. The logical usage of these algorithms is to use leapfrog to test a large parameter space and use L-BFGS-B when you have located the possible optima. In this simple case leapfrog is unnecessary, but since the reruns are pretty cheap I prefer leapfrog.

Here are the %AAD and %BIAS for rhoL, Psat, and Udep:

% AAD rhoL Psat Udep 0.042 31.1 6.45

% bias rhoL Psat Udep 0.0074 31.1 -6.45

To clarify, these values are only for the final iteration (the "real" optimum). Remember that I only included rhoL in my optimization. This optimum is not intended to be a more accurate force field for other properties. For the AUA optimization I will include rhoL and Psat since those are used for the new TraPPE model. If we want to be more consistent with TraPPE, we could also include Tc. However, it is worth noting that we are already in a different temperature regime and using different data for our fit.

ramess101 commented 7 years ago

These plots demonstrate the improvement that you can get using an AUA model. Here I have optimized the TraPPEfs-AUA potential using rhoL and Psat while TraPPEfs-UA only uses rhoL, similar to how the respective TraPPE models were developed. The bond length for TraPPEfs-AUA is 0.23 nm since that is the same value as TraPPE-AUA (or TraPPE-UA2, as they termed it). Alternatively, we could optimize the bond length as a third parameter. For now, this is a nice demonstration of how quickly MBAR can convert the TraPPE model to a TraPPEfs model (this took only a few hours)

The TraPPEfs-AUA has:

% AAD rhoL Psat Udep 0.09 2.15 2.69

%bias rhoL Psat Udep -0.005 -0.08 0.5

ramess101 commented 7 years ago

I am trying to determine what the best approach is for determining convergence. Recall that there are two types of convergence. Convergence for a given reference parameter set and overall convergence of the final parameters. The convergence for a given reference parameter set is more straight forward because it can be specified in the optimization algorithm. However, the overall convergence is a bit trickier because we must account for random fluctuations in the simulation results (i.e. we may never truly converge to within the desired precision if the simulations have a lot of scatter). In addition, the uncertainties in MBAR can cause two consecutive optima to not be identical.

Another related challenge is specifying the bounds. For the first reference simulation optimization I use a fairly wide range. However, this leads to a slow convergence of the MBAR optimization. Subsequent optimizations likely do not need such a wide range. Therefore, my code should reduce the size of the range with each reference simulation.

ramess101 commented 7 years ago

These figures help demonstrate some of the ways we can improve the optimization methodology. These plots represent the SSE of saturated liquid density for ethane. The reference system is what was used in the direct simulation. The iterations represent the different parameter sets that were used in a rerun and MBAR analysis. The leapfrog optimization was used in each case. The model is the TraPPEAUAfs form, meaning that the bond length is 0.23 nm and the LJfs is used. The initial parameter set (i.e. the reference in the top figure) corresponds to the new TraPPE model. Each subsequent reference is the optimal from the previous reference.

There are a few main conclusions from these figures:

The leapfrog optimization is nice for checking for multiple minima, but after the first optimization it is probably overkill. I.e. leapfrog is good for figure 1 (and maybe figure 2) but the rest could just use a simple gradient approach
The bounds for the optimization could also decrease considerably after each optimization. Trying to automate how much the bounds shrink by is something that I have not yet figured out. Currently I am working on just having the bounds reduce by a factor of 2 for each reference simulation.
The optimum does not change very much from Figures 3-6. It appears to just be bouncing around due to simulation uncertainties and maybe MBAR uncertainties. The stopping criterion for the algorithm needs to take this into consideration. Currently it just tries to reach a certain precision in epsilon and sigma, but this precision may not be feasible. One way to overcome this is to include multiple references once we have converged to the optimal region. In other words, if references 2-5 were used in addition to reference 6 the uncertainty would decrease considerably for both the simulation because we would have more replicates and for MBAR because we would have more samples. One idea for implementing this is to just include all references after the first. Or all references after the optima are within some tolerance. Or to look at the number of effective samples for each reference relative to the previous optimum and see if it is worth including.
A single reference system is actually sufficient in this case (i.e. the optima do not change much after figure 1). I attribute this to the fact that sigma is nearly optimal already. We have seen that with a poor sigma value it may require a few cycles of different references.

jrelliottoh commented 7 years ago

Hi Rich, I don't know if it's interesting to you, but Mostafa found ITIC works to Tr=0.95 if we include B3. From there, critical scaling extrapolation should be quite reliable. JRE

ramess101 commented 7 years ago

That is good news. I have been including B3 for Tr of 0.85, just to be safe. Can you elaborate on how you obtained B3? I.e. are you using REFPROP or the slope from Z-1/rho vs rho from simulation?

mrshirts commented 7 years ago

So here's an option. Since a decent amount of data is collected, and were pretty sure that things like rho_l vary smoothly, why not construct a polynomial fit or a Gaussian Process model from the data collected at step N, optimize on that surrogate model, simulate at the minimum, and then rebuild the model using N+1 simulations? standard optimization methods try very hard NOT to assume smoothness to be robust, but if the functions are smooth there are alternate ways.

ramess101 commented 7 years ago

I do think that Gaussian Process models could be useful, especially for Bayesian methods and for finding the optimum quickly. However, I wonder how smoothly the properties vary with respect to lambda. I think we could construct these surrogate models rather easily for a constant lambda in the 2-dimensional epsilon and sigma space, but it might be tricky if we are including lambda as a third dimension.

But I would like to reiterate that I do not think the real problem is the optimizer. The optimization works for the MBAR space. So the Gaussian process model would only have a real benefit if it extrapolates reliably beyond where MBAR is accurate.

mrshirts commented 7 years ago

Yes, the idea would be to extrapolate beyond MBAR.

I think you should contact Josh Fass (on the Open Forcefield group), and we can perhaps set up a time to talk. I think the amount of data you have right now is enough set something up and test how well it would work.

jrelliottoh commented 7 years ago

Hi Rich. Sorry for slowness. Mostafa's test on B3/TrMax was using the refprop value of B3. We would need to test how well our linear fits are reproducing B3 to get a better idea of TrMax when using md for B3. This should be straightforward by checking for Trappe-UA model, for which B2-B6 are available from the Schultz-Kofke paper. I talked to Mostafa about doing this, but I don't know if he is likely to do it or when.

ramess101 commented 7 years ago

I think you should contact Josh Fass (on the Open Forcefield group), and we can perhaps set up a time to talk.

I have started a direct message on Slack

ramess101 commented 7 years ago

Mostafa's test on B3/TrMax was using the refprop value of B3. We would need to test how well our linear fits are reproducing B3 to get a better idea of TrMax when using md for B3.

OK, I have been using REFPROP B3 as well. The Kofke approach might be more work than necessary. I would think that our linear model fit would be a pretty crude estimate of B3. But the main reason for using the REFPROP B3 value is because we don't need to perform simulations at all the different temperatures. I.e. estimating B3 along the isotherm is not too hard but getting B3 at each Tsat would be laborious. Of course, the main downfall of relying on REFPROP is the limited chemical space.

jrelliottoh commented 7 years ago

The point about kofkes virials is they are known already. I'm not suggesting we implement his method. Just that we can evaluate the accuracy of our md fit method by comparing to his B3.

J Richard Elliott

On Aug 7, 2017, at 4:46 PM, Richard Messerly notifications@github.com wrote:

Mostafa's test on B3/TrMax was using the refprop value of B3. We would need to test how well our linear fits are reproducing B3 to get a better idea of TrMax when using md for B3.

OK, I have been using REFPROP B3 as well. The Kofke approach might be more work than necessary. I would think that our linear model fit would be a pretty crude estimate of B3. But the main reason for using the REFPROP B3 value is because we don't need to perform simulations at all the different temperatures. I.e. estimating B3 along the isotherm is not too hard but getting B3 at each Tsat would be laborious. Of course, the main downfall of relying on REFPROP is the limited chemical space.

— You are receiving this because you commented. Reply to this email directly, view it on GitHub, or mute the thread.

ramess101 commented 7 years ago

The point about kofkes virials is they are known already. I'm not suggesting we implement his method. Just that we can evaluate the accuracy of our md fit method by comparing to his B3.

OK, that makes sense. Yes, that could be very beneficial.

ramess101 / MBAR_ITIC

Multidimensional Optimization #1