Scaling epsilon - Githubissues

ramess101 commented 6 years ago

I wrote a script that scales the energy by a factor (say 1.05) to show how MBAR would perform if you scaled all the epsilons by 1.05. However, I then realized that these results are for hexane, which has intramolecular interactions (bending, torsions, and 1-5, 1-6 nonbondeds). So by scaling U by 1.05 I was also scaling the intramolecular energies. Or at least that is what I thought would happen, but it turns out that the GOMC energies in the histograms are just the intermolecular nonbonded energies. This seems strange to me, since this is just the residual energy (neglecting the ideal gas contribution).

The fact that using the U_nb alone results in a good VLE curve means that it must be valid. However, MBAR typically requires the total energy. For example, for ITIC we used total energy to determine the weights and then used U_nb for the ITIC integration. The intramolecular energy is not constant with respect to temperature, so I cannot imagine that it just cancels out.

I need to talk with both Jeff and Michael about this.

ramess101 commented 6 years ago

For now, these are the results I got using MBAR-GCMC when I scale the rerun energies by 1.05:

Note that the critical temperature should increase.

ramess101 commented 6 years ago

I think it could be advantageous to start with Argon to make sure I get the expected epsilon dependence. And because Argon does not have any intramolecular contributions to worry about.

ramess101 commented 6 years ago

The "MBAR-GCMC" points are the actual 1.05 epsilon vapor-liquid coexistence curve, i.e. these results were obtained by performing GCMC simulations of the 1.05 epsilon force field. The Potoff points for epsilon (not scaled) are included as a point of reference:

The point here is that epsilon of 1.05 should increase the critical point and make the VLCC wider.

Note that a couple liquid densities did not converge properly. This is clear when plotting the chemical potential:

Here are the histograms for the 1.05 epsilon simulations:

ramess101 commented 6 years ago

In order to determine why the scaled epsilon results are not what they should be, I am now plotting the sampled states of the two force fields.

Here are the sampled states for eps:

Here are the sampled states for eps = 1.05 eps:

And here I have combined them on the same graph, without distinguishing by state points:

So these do appear to be matching very similar regions of the N-U space. This suggests that MBAR should be able to handle this reweighting well.

ramess101 commented 6 years ago

The MBAR-GCMC results that I plotted previously:

Are supposed to be the predicted VLE curve for eps=1.05 eps_Potoff with eps_ref = eps_Potoff. However, the updated code appears to do a better job of capturing the correct trend for the same test:

The points around 440 and 450 did not converge for some reason, and the near critical points could have issues with how I am determining what is a vapor and what is a liquid (Ncut).

This plot compares the actual 1.05 eps_Potoff results with those predicted with MBAR (ignore the anomalies for now):

From this figure we see that the predicted (green circles) over estimate the liquid densities compared to the direct simulation (blue squares).

ramess101 commented 6 years ago

This plot shows the one-dimensional histograms for N with the two force fields:

The shift in N for the liquid phase histograms could explain the difficulty in predicting some of the higher temperature VLE points. In other words, there are not enough samples in that region of N to predict reliable densities. I will need to think about that some more...

ramess101 commented 6 years ago

I believe I have located the problem why some state points do not converge to the correct mu. It is found in the Golden search optimizer. The bounds are assumed to have a large objective function than the guess. But this is not always the case. Should be an easy fix.

ramess101 commented 6 years ago

After fixing the Golden search optimizer, I was able to get better convergence.

Same trend as before, namely, MBAR shows some deviations at higher temperatures.

ramess101 commented 6 years ago

With MBAR-GCMC, I was able to evaluate 50 different values of epsilon between 0.98 and 1.02 epsilon_Potoff. Even though my code is not yet fully optimized, it only took 3 hours to generate these results on my laptop.

I do not yet have vapor pressures working, but rhov is a reasonable substitute. Clearly, the Potoff epsilon values are already very good for rhol and rhov of hexane, but this approach should allow for some refinement when needed.

ramess101 commented 6 years ago

Weidler and Gross recently published an article, Individualized force fields for alkanes, olefins, ethers and ketones based on the transferable anisotropic Mie potential, where they optimize a single epsilon scaling parameter for each molecule.

To provide a frame of reference for our results above, here are their optimized scaling factors for each molecule:

For hexane they report a value of 1.0004. This is on the same order of magnitude with the optimal scaling factor for the Potoff Mie potenial.

ramess101 commented 6 years ago

@msoroush @jpotoff @mostafa-razavi @jrelliottoh

I have analyzed the GCMC-GOMC output files for isobutane, neopentane, 2,2,4-trimethylpentane. With MBAR I was able to compute the percent deviation in rhol and rhov for a range of epsilon scaling factors.

The optimal epsilon scaling factor depends on the compound. For isobutane it is practically 1. For neopentane it appears to be slightly greater than 1, ca. 1.002. While 2,2,4-trimethylpentane has the largest deviation from 1, with a value around 0.995.

Note that by performing this analysis with only 5000 snapshots (instead of O(100,000) as provided) I can generate these plots for 50 different epsilon scaling factors in just a few minutes.

The red square and blue circle are the deviations calculated using the literature values reported by Mick et al. This serves as a reference point to verify that the MBAR results are consistent for epsilon/epsilon_Potoff = 1. The vapor densities show the largest difference between MBAR and histogram reweighting. This discrepancy tends toward zero if we include all of the snapshots, instead of this much smaller subset.

jpotoff commented 6 years ago

@ramess101 @msoroush @mostafa-razavi @jrelliottoh These are the molecules used in the optimization of the CH and C parameters.

There was some compromise involved in optimizing the parameters because there are slight shifts in the optimal parameters going from molecule to molecule.

msoroush commented 6 years ago

@ramess101, @jpotoff

I can add the histograms for the rest of the branched molecules by tomorrow.

msoroush commented 6 years ago

@ramess101, I added all 32 branched molecules.

ramess101 commented 6 years ago

@msoroush

Great! Thanks!

ramess101 commented 6 years ago

Now that we have vapor pressure working, we can repeat this analysis with rhov, rhol, and Psat all plotted:

Note that I generated these figures using only 5 epsilon values. Also, since I am trimming the number of snapshots I decided to shift the AD curves such that they equal the Potoff AD values at eps/eps_Potoff=1. In the future I will rerun these with 50 epsilon values, without trimming the snapshots and without shifting AD to match.

ramess101 commented 6 years ago

A significant bug was found in my hexane simulations for epsilon scaled by 1.05. Namely, eps_CH2 was properly scaled (64.05/61 = 1.05) but eps_CH3 had the wrong value (121.3125/121.25 = 1.0005). The correct value is 127.3125 K. More than half of hexane's interactions do not involve CH3 (16 CH2-CH2 interactions, 4 CH3-CH3 interactions, and 8 CH3-CH2 interactions), but this should still impact the coexistence curve somewhat. This could explain the discrepancy reported previously between MBAR predicted from GCMC with Potoff and MBAR with direct eps_scaled GCMC:

ramess101 commented 6 years ago

I am now rerunning the 1.05 epsilon_Potoff GCMC simulations to see if I get better agreement with the MBAR epsilon_ref = epsilon_Potoff simulations. I would expect that for a larger epsilon the critical temperature will increase which should lead to better agreement at high temperatures.

ramess101 commented 5 years ago

@msoroush @jpotoff

I have repeated the epsilon scaling process for the alkynes that have REFPROP data.

mrshirts commented 5 years ago

Can you summarize what this should be telling us? Is this an expected result, does it support/refute a hypothesis?

ramess101 commented 5 years ago

@mrshirts

In a recent study, Joachim Gross proposed "Individualized", a.k.a. non-transferable force field parameters. The idea is that for some molecules we have a large amount of experimental data and we want to match these data very accurately. However, to avoid overfitting Gross proposed a 1-dimensional optimization, where we start with a transferable force field (they used TAMie, we will use Potoff Mie) and then we scale all the epsilons by the same factor. This is a very easy application of MBAR, because we don't even need to store the configurations. We can just scale all the non-bonded energies by the epsilon scaling factor.

In a previous post I reported the Gross optimal epsilon scaling values. They were typically between 0.99 and 1.01. This demonstrates that the TAMie transferable parameters are already really good, but some improvement is possible with this epsilon scaling. We observe a similar range for the Potoff force field, suggesting that Potoff is also already quite transferable. Whether this was an expected result or not depends on your belief in transferability. But it does agree with the TAMie results, and it demonstrates a simple application of MBAR that does not require configurations or basis functions.

ramess101 commented 5 years ago

@jpotoff @msoroush

I would like to report eps_scaling optimal values for the branched alkanes and alkynes that we have in REFPROP. To be consistent with your previous parameterizations, I would like to use the same scoring function. However, I am not 100% certain how to include the derivative terms. For Hvap, did you just perform a linear interpolation of Psat vs 1/T? Currently, I am only including the rhol, rhov, and Psat non-derivative terms in my scoring function. Is that going to be a problem? Also, we are likely using different data sets. I don't recall whether you used REFPROP, DIPPR, etc.

msoroush commented 5 years ago

@ramess101

We used DIPPR for alkyne paper. We included the derivative of error in each properties. We calculate the error of rhol, rhov, and Psat at each temperature and calculate the slope of error with respect to T. This parameter helps to optimize a parameter where error is same across all temperature and avoid calculated VLE data crossing the experimental data.

jpotoff commented 5 years ago

@ramess101 @msoroush
Reference experimental data used in the model optimization process were taken from the NIST Chemistry WebBook for isobutane, neopentane, and 2-methylbutane. Everything else for branched alkanes came from Smith, B. D.; Srivastava, R., Thermodynamic Data for Pure Compounds: Part A Hydrocarbons and Ketones. Elsevier: Amsterdam, 1986.

ramess101 commented 5 years ago

@msoroush

OK, so to approximate the slope do you just fit a line to the error of rhol, rhov, and Psat? If so, do you include just neighboring points or several points in the linear regression?

ramess101 commented 5 years ago

@jpotoff

OK, so the NIST chemistry WebBook is typically taken from REFPROP when the compounds are in REFPROP, which is true for the three you mentioned.

msoroush commented 5 years ago

@ramess101 We did not fit to linear line. We calculated slope for neighboring temperature (Ti+1 - Ti) and then take an average over all temperature.

ramess101 commented 5 years ago

@msoroush

Well if you determined the slope from just two temperatures (Ti+1) and Ti, then you are essentially fitting a linear line to two data points, correct?

To clarify, when computing this term:

You first compute err(rhol(Tj)) using:

You then compute the derivative using neighboring points. For example, the derivative of j=0 would be:

(Err(rhol(T1))- Err(rhol(T0))/(T1-T0)

msoroush commented 5 years ago

Yes it is correct. I assume that at the end, you will divide it by total number of temperature point.

ramess101 commented 5 years ago

@msoroush

I assume that at the end, you will divide it by total number of temperature point.

Yes.

ramess101 commented 5 years ago

@jpotoff @msoroush

OK, I have performed a preliminary analysis for 5 branched alkanes and 3 alkynes. I have used the REFPROP data values, since I already had those ready. It will be fairly straightforward to utilize the DIPPR values for the alkynes, although the difference will likely be negligible. The scoring function includes the derivative terms, but it does not include heat of vaporization yet, as I still need to roll that into MBAR. But the preliminary results suggest that the 1-dimensional epsilon scaling is going to be very close to 1. Which again suggests that the Potoff force field is already quite transferable. In fact, I think that if we performed a UQ analysis we could find that the Potoff and epsilon scaled Potoff force fields are indistinguishable.

Here are the optimal epsilon scaling values to the same precision as reported by Gross:

{'1butyne': 1.0037, '224TMPentane': 0.9997, '22DMPropane': 1.0024, '2MButane': 1.0027, '2MPentane': 1.0039, '2MPropane': 1.0014, 'ethyne': 1.0000, 'propyne': 0.9980}

Here are the 1-dimensional optimization plots. I used the GOLDEN search optimization scheme, which typically converges in approximately 10 iterations. Notice that the optimal is quite noisy. This is due to the numerical uncertainty in the simulation results. We could use a more robust optimizer that handles stochastic optimization better and/or we could include the simulation uncertainty by randomly resampling which snapshots are included in the analysis. This might be overkill though.

These were the optimal scores:

{'1butyne': 1.2064697029571354, '224TMPentane': 0.40068996230987253, '22DMPropane': 0.49816089166059713, '2MButane': 0.37726031361206847, '2MPentane': 0.49362709119365411, '2MPropane': 0.38971457813168026, 'ethyne': 0.45227684766174286, 'propyne': 0.45463054764025351}

mrshirts commented 5 years ago

include the simulation uncertainty by randomly resampling which snapshots are included in the analysis.

This is probably the better, cleaner way to do it.

ramess101 commented 5 years ago

@jpotoff @msoroush

I have pulled all the alkyne data from DIPPR. Because DIPPR doesn't report rhov values, I realized that the scoring function in the alkyne paper only uses rhol and Psat (with modified weights). So I will make sure to repeat the optimizations using only those two properties (and their derivative errors).

I also verified that the NIST webbook data you used for isobutane, neopentane, and 2-methylbutane are exactly the same as REFPROP v 10.1. The webbook data were from REFPROP v 7, but the values have not changed since then.

I'm not sure what to do about the other branched alkanes with data from Smith, B. D.; Srivastava, R., Thermodynamic Data for Pure Compounds: Part A Hydrocarbons and Ketones. Elsevier: Amsterdam, 1986. I know DIPPR and TDE considered these values when developing their own correlations. I'm not sure about REFPROP. But obtaining the values from the original text would be tedious at best. If you have text files with these values already, I would greatly appreciate you sending them to me. Otherwise, our other options are to use REFPROP (when available) and/or DIPPR/TDE. Any preference? I am pretty confident the difference in the data will be negligible, but it would be nice to report that we used the same data set as the original studies.

jpotoff commented 5 years ago

@ramess101 @msoroush I'm sure we have a spreadsheet with all of the Smith and Srivastava data somewhere.

ramess101 commented 5 years ago

@msoroush @jpotoff

We included the derivative of error in each properties. We calculate the error of rhol, rhov, and Psat at each temperature and calculate the slope of error with respect to T. This parameter helps to optimize a parameter where error is same across all temperature and avoid calculated VLE data crossing the experimental data.

I am somewhat confused again regarding the derivative term in the scoring function. Should we actually use the absolute value of the slope (derivative)?

From the way the equation is written, it appears that we calculate the absolute value of the relative difference, and then compute the slope from that with respect to temperature. But then the slope will be negative when the absolute error decreases with respect to temperature. However, don't we want the error to be close to constant, i.e., a slope near zero? So we don't want the slope to be a large positive or large negative value.

I think we want to take the absolute value of the slope, because otherwise the scoring function decreases when the slope is negative, which would favor our fits being better at high temperatures than low temperatures. Does that make sense?

jrelliottoh commented 5 years ago

If you use DIPPR to get properties, be sure to limit the range of temperature to where they have experimental T-dependent data. Also, for vapor pressure, DIPPR does not cross list boiling temperature as a boiling point. e.g. imidazole: several experimental Tb values are listed but the highest vp listed in T-dep data is about 7kPa. I might have some text files ready for the compounds of interest. Let me know any compounds you need after Mohammed finds what he has.JRE

On Thursday, October 18, 2018, 3:42:46 PM GMT-3, Richard Messerly <notifications@github.com> wrote:

@msoroush @jpotoff

We included the derivative of error in each properties. We calculate the error of rhol, rhov, and Psat at each temperature and calculate the slope of error with respect to T. This parameter helps to optimize a parameter where error is same across all temperature and avoid calculated VLE data crossing the experimental data.

I am somewhat confused again regarding the derivative term in the scoring function. Should we actually use the absolute value of the slope (derivative)?

From the way the equation is written, it appears that we calculate the absolute value of the relative difference, and then compute the slope from that with respect to temperature. But then the slope will be negative when the absolute error decreases with respect to temperature. However, don't we want the error to be close to constant, i.e., a slope near zero? So we don't want the slope to be a large positive or large negative value.

I think we want to take the absolute value of the slope, because otherwise the scoring function decreases when the slope is negative, which would favor our fits being better at high temperatures than low temperatures. Does that make sense?

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub, or mute the thread.

ramess101 commented 5 years ago

@jrelliottoh @jpotoff

I agree that it is important to verify that the DIPPR correlations are reliable over the temperature region we use in the optimization. However, I also want to use the same data (including the same temperature range) as Mick et al. and Barhaghi et al. so that this epsilon-scaling analysis is consistent with their previous work. We should certainly verify the ranges used by Mick et al. and Barhaghi et al. for future considerations though.

ramess101 commented 5 years ago

@jpotoff @mrshirts @msoroush @jrelliottoh

Here is the current figure for epsilon scaling to be used in the manuscript.

You will notice that I decided not to perform the analysis for all of the branched alkanes. Although @msoroush did provide me with the Smith and Srivastava data, I made this decision for two reasons. First, I think it would just be a mess to include 20+ compounds on one panel. Second, the epsilon scaling approach is only recommended for compounds that have highly reliable data. So I decided to just analyze the branched alkanes that are in REFPROP, which gives us enough compounds for a good figure and ensures that the data are of high quality.

I am still trying to figure out if there is a way to get reduce the large fluctuations around the optimal. Also, these results are from the golden ratio optimization. I am going to repeat this using evenly spaced values of psi (around 20 values) so that there are fewer points near the optimal and more points at the extreme values of psi.

ramess101 commented 5 years ago

I think the branched alkanes are noisier because they include vapor densities in the scoring function. Vapor density estimates have larger uncertainties and so the scores are a little more sporadic.

ramess101 commented 5 years ago

@jpotoff @mrshirts @msoroush @jrelliottoh

OK, I found the problem. It was not an MBAR issue or a scoring function issue, it was a tolerance issue. Specifically, the tolerance was too high that determines when the pressures of the two phases are equal (the integration of the exp(Ci)).

Here are the results for a single compound after I changed the tolerance from 0.0001 to 0.000001:

Note that this 2 orders of magnitude decrease in tolerance require 9 more iterations to reach convergence (22 instead of 13, so almost a factor of 2 increase). Obviously we could run some diagnostics to determine the most "stable"/optimal tolerance value, but for now I am just going to keep this as the new tolerance.

ramess101 commented 5 years ago

@jpotoff @mrshirts @msoroush @jrelliottoh

Here is the updated figure:

Note that 3-methylpentane is somewhat of an outlier for branched alkanes. Previously I used the "generalized" MiPPE parameters for 3MC5 because the optimal psi was closer to 1(around 1.005, see previous comment). Plotted here is using the "S/L" MiPPE parameters as the base epsilon for all compounds.

mrshirts commented 5 years ago

Looks pretty!

ramess101 / MBAR_GCMC

Scaling epsilon #4