Bayesian Inference - Githubissues

ramess101 commented 7 years ago

One of my main concerns related to the implementation of Bayesian inference with ITIC is how to implement experimental data. ITIC works well when comparing to REFPROP because we can evaluate REFPROP predictions for rhol, Psat, etc. at any temperature. However, for Bayesian inference (and really any optimization) we probably want to use experimental data. Now, this might not necessarily be the case if we want to use REFPROP to predict U and Z along the IT and ICs, but for the purpose of VLE optimization I think we should use data. Furthermore, when we go to compounds beyond REFPROP we would need to either use experimental data or smoothed TRC correlations.

So the main problem is that ITIC will give us VLE points at Tsat values that are different than Tsat_exp, i.e. we cannot prescribe the exact Tsat values. Instead, with ITIC we prescribe the rhol values. We could use experimental rhol for the isochores and compare Tsat_ITIC with Tsat_exp, but that is not an ideal solution in my opinion for a number of reasons. Conceptually, we usually consider Tsat the independent variable. Plus, this does not solve the issue of Psat. Also, it would make comparison to other force fields difficult since everyone reports their SSE/RMS with respect to rhol, Psat, etc. But the biggest reason of all is that this would necessitate running simulations along an IC that corresponds to every experimental data point.

I think the easiest and best way to obtain ITIC rhol, Psat at the experimental Tsats is to just fit a line to the ITIC points. In other words, fit the Antoine equation to Psat and rectilinear + scaling to rhol and rhov. These correlations usually give agreement with the simulation data to within the simulation uncertainty.

These plots show how well the smoothed correlations (black lines) match the simulation data (red points):

For liquid density I also included a green line. This line represents a fit to just liquid density data, whereas the black line was fit to rhol and rhov (effectively) through the rectilinear and scaling law relationship. The green line is essentially the same as the black line up until you extrapolate beyond the data. So the green line is useful for interpolation as it is a slightly better fit to rhol than the black line.

Again, the main advantage of using smoothed correlations of the ITIC results as opposed to the ITIC results themselves is that now we can perform a Bayesian analysis using all the experimental data we want.

ramess101 commented 7 years ago

The VLE_model_fit.py code is written to convert ITIC results into rhol, rhov, and Psat correlations. These are now implemented in parameter_space_SSE_analysis_Mie.py.

mrshirts commented 7 years ago

However, for Bayesian inference (and really any optimization) we probably want to use experimental data.

Correct.

Now, this might not necessarily be the case if we want to use REFPROP to predict U and Z along the IT and ICs, but for the purpose of VLE optimization I think we should use data. Furthermore, when we go to compounds beyond REFPROP we would need to either use experimental data or smoothed TRC correlations.

Yes.

So the main problem is that ITIC will give us VLE points at Tsat values that are different than Tsat_exp, i.e. we cannot prescribe the exact Tsat values.

OK.

Instead, with ITIC we prescribe the rhol values. We could use experimental rhol for the isochores and compare Tsat_ITIC with Tsat_exp, but that is not an ideal solution in my opinion for a number of reasons. Conceptually, we usually consider Tsat the independent variable. Plus, this does not solve the issue of Psat. Also, it would make comparison to other force fields difficult since everyone reports their SSE/RMS with respect to rhol, Psat, etc. But the biggest reason of all is that this would necessitate running simulations along an IC that corresponds to every experimental data point.

Hmm. I don't think I quite understood the whole thing here -- perhaps we can talk it over when we meet next week?

I think that our main thinking in Bayesian inference is that we use the data we have, and see how much that constrains the parameters, and by extension the properties computed for those parameters. In most cases, our solution is that we can use that information to tell us where to collect more data.

But that's the general case -- and the end point is "collect more data" or "be satisfied at the limits of your force field".

For this specific case, it's not clear that's enough.

ramess101 commented 7 years ago

Hmm. I don't think I quite understood the whole thing here -- perhaps we can talk it over when we meet next week?

After rereading my comment I can understand your confusion. The point I was trying to make is that for ITIC we get a Tsat that corresponds to the prescribed rhol (the isochoric density). We cannot guarantee that Tsat is the same as the experimental Tsat for the experimental rhol (or Psat). So we want to fit the ITIC results (rhol and Psat) to a line so that we can predict rhol and Psat at every temperature. Does that make more sense? If not, we can certainly include that in our conversation next week.

mrshirts commented 7 years ago

Hmm. Let's go over it Tuesday. If fundamentally one is trying to predict something that is the solution of a function of an expectation f()=0, especially if both the function and the estimator of x are approximate, then the error analysis gets a bit complicated.

ramess101 commented 7 years ago

We just spoke about this and concluded that we can use correlations to represent the molecular simulation results. The key is just to incorporate the fit uncertainty into the Bayesian analysis. Since ITIC only provides about 5 data points quantifying the fit uncertainty might not be reliable using standard methods. We might just want to assign a constant uncertainty that is related to the uncertainty in the molecular simulation results themselves.

ramess101 commented 7 years ago

@mrshirts

I have successfully integrated Bayesian inference with the MBAR-ITIC basis function methodology. Currently, I am running a Markov Chain for lambda = 16. My only concern is that it is taking a bit longer than I had hoped. Specifically, for 10,000 Markov Chain iterations it is going to take about 15 hours. This is manageable, but I would like to reduce this some more. I have not performed any rigorous timing analysis to determine where the bottleneck is, but my suspicion is that it is in solving the MBAR self-consistency equations. My code currently creates a new MBAR instance from scratch each time I propose a new "rerun" parameter set. According to the discussion below I should not have to do that:

In other words, I should be able to just solve the MBAR equations once and then for the non-sampled parameter sets I don't need to resolve the equations, right? It looks like this is what you did in the lj_bayesian code (lj_construct_ukln):

I see that when creating an MBAR instance you set "initial_f_k=f_ki". Is that the only real difference?

Currently this is what my code looks like:

So it looks like I am already calculating the FreeEnergyDifferences (Deltaf_ij) so do I just need to store Deltaf_ij? Then, in future instances of MBAR I just need to include "initial_f_k=Deltaf_ij"? If so, I think the biggest issue will be figuring out the shape/dimensions required by "initial_f_k".

The present system only has four reference parameter sets (epsilon, sigma), so maybe solving the MBAR equations is not taking very long. However, for systems with more reference parameter sets I will want to figure this out anyways.

mrshirts commented 7 years ago

Correct, adding unsampled states does not require any additional iterative math. That code was written by Levi, so I can't remember the details exactly. If you set the initial f_k equal to the solution, then it will at least just do one iteration or so before exiting. Getting it to do NO iterations would require a bit of tweaking, so I'd first see how much time you save by starting with the full-precision solutions from a previous step, and observing how many iterations it takes to solve the equations, and how much time is saved.

ramess101 commented 7 years ago

@mrshirts

OK, I think I have properly implemented this approach with the initial_f_k set to a stored value. However, I did not observe much of a speed-up. I did a simple speed diagnosis and I concluded that MBAR is not the slow step in this specific system (again I only have 4 references). That being said, I would like to verify that I am using the stored free energies properly.

Here is the output when I set verbose=True.

Notice in this screen shot that the final free energies for the four reference systems (column 0-3) are identical to the initial free energies. I believe this is expected. However, the final column (that corresponds to the unsampled parameter set) changes. I think it makes sense that the final column must change from the initial value (the initial value of the fifth column was arbitrarily set to be the same as the fourth column). But is this how it is supposed to be set up? In other words, are the free energies for the reference systems supposed to stay constant but not the nonsampled parameter set? Is there a way that I can check how many iterations MBAR used? I assume it was only a few since the other free energies are constant but it would be nice to see how many iterations I eliminated compared to having initial guesses of 0. I figure I can just add a counter and print statement to pymbar, but I was wondering if there is an easier way already coded up.

mrshirts commented 7 years ago

For the mbar initialization, you should be able to set the optional keyword verbose=true, and it SHOULD show you the number of iterations (mod possibly some changes done in the Chodera lab to use scipy optimizers)

The self-consistent iteration is only every done over the sampled states, so no need to worry about that. It will then go back and run the one-step process to get the free energies of the unsampled states. That's all done under the hood.

ramess101 commented 7 years ago

OK, I did not see any iterations output when verbose=True but I will figure that out. Based on what you said about the sampled and nonsampled states I think it is working properly. Thanks

ramess101 commented 7 years ago

@mrshirts

I have successfully implemented the Bayesian analysis for a 16-6 potential for ethane. Here is a plot of the parameter space that was sampled in the production stage of the Markov Chain. I have included the references used by MBAR and the Potoff value.

The main conclusions are that Potoff are within the uncertainty region and that the references provide an adequate coverage of the possible sigma and epsilon values (i.e. our results should be reliable).

Currently the color scale says that it represents logPosterior but it is really just log(Likelihood)+log(Prior) since Markov Chains do not require the normalization denominator of Bayes rule. Really all I want is for the colormap to give an idea as to the reliability of each parameter set. Michael, what do you think is the best way to plot the probability of each parameter set? I don't think logPosterior is technically correct and values of -4000 are not very meaningful. Should I try to calculate the true posterior? For that I would need the normalization term. Isn't there a way to calculate the normalization term after the Markov chain has finished? Do I integrate likelihood*prior of all the sampled points?

Anyways, although finding the best way to plot the results is important, I am really satisfied with the way the results came out.

ramess101 commented 7 years ago

@mrshirts

Here are plots of the properties used in the force field parameterization (rhol and Psat from 135-265 K):

Here are the percent deviation plots (where deviations are taken with respect to REFPROP):

Note that the Bayesian points are the MBAR_ITIC values obtained for a subset of the Markov Chain. Because the ITIC approach solves for Tsat iteratively the Bayesian points are not all at the same Tsat.

These uncertainties might seem a little bit large. Based strictly on the experimental data that would be correct. However, the uncertainty also accounts for the uncertainty in the ITIC approach. In other words, I expanded the uncertainty in the data by the uncertainty in the ITIC simulation results which I approximated. For example, ITIC has higher uncertainties and bias at higher temperatures because B3 becomes more significant. Therefore, I am in favor of more conservative uncertainties, although I may have been too conservative with my simulation uncertainty estimates. I think having the uncertainties in Psat being around 5-10% is probably better. This would require some iteration to find a good estimate of the simulation uncertainties.

* I recently discovered that I had an error in my MCMC code. The error was that the likelihoods from Psat were not being included in the summation. In other words, the results shown previously are actually the MCMC results when only rhol is considered. This makes more sense because the Psat uncertainties should not have been this large. In a future comment I will present updated MCMC results.

ramess101 commented 7 years ago

@mrshirts

Here are plots of properties not included in the parameterization. Although U and Z affect the ITIC results, they were not used directly in the Bayesian uncertainty analysis. So this is a moderate test of transferability and model adequacy outside of the training set. The results suggest that the Mie 16-6 potential is not capable of reproducing these properties at extreme temperatures and densities. However, it should be noted that the REFPROP correlations have higher uncertainties at those conditions. Quantifying that uncertainty is an ongoing task.

Here is Z (plotted as Z and Z-1/rho) along the isotherm:

Z for the isochores:

U for the isotherms:

And U for the isochores:

Clearly, even with the parameter uncertainties the Mie 16-6 potential cannot match rhol, Psat while also matching Z at high temperatures and high densities. The deviations for U are noticeable as well. Potoff has told me that he has some concern with heat capacities for the 16-6 potential. I think these results confirm that concern as the slope of U with respect to T has fairly strong deviations for the higher density isochores (IC0 and IC1).

ramess101 commented 6 years ago

@mrshirts

Michael, what do you think is the best way to plot the probability of each parameter set? I don't think logPosterior is technically correct and values of -4000 are not very meaningful.

So I found it helpful to normalize my logp plot by making the maximum logp = 0. Then I transform logp to just "p" so that it can be interpreted like a probability (although this is not strictly correct). This at least makes the maximum likelihood be p=1 and the different p values show a relative probability.

ramess101 commented 6 years ago

@mrshirts

Rather than perform a full MCMC run, I found it much faster to just scan the parameter space for a given value of lambda and calculate the log(posterior). The nice thing about this approach is that I can change how the posterior is calculated after the fact (i.e. change the standard deviations, the data set, etc.) and not have to repeat the whole MCMC run.

As stated in a previous comment, the results I presented above are erroneous because they only included rhol in the calculation of logp. By including Psat the uncertainty region is reduced considerably. This is mainly because the contours for rhol and Psat run in different directions, that is, rhol has a strong positive correlation between epsilon and sigma whereas Psat has a slightly negative correlation.

This can be best observed by plotting exp(logp-logp.max()) (as explained previously this is proportional to the probability) for rhol, Psat, and then combined. The plots that follow show the ~95% credible regions for different lambda values. I have included the reference points as squares with the corresponding colors.

This first figure is for just liquid density:

Notice the strong correlation in the parameters.

This is for vapor pressure

Note that the scan starts at 0.375 for all values of lambda and was cut-off at 0.38 nm for lam = 14-16 but was extended for lam =17-18. So these regions do NOT encapsulate the entire 95% credible region for vapor pressure. Vapor pressure is much more sensitive to epsilon than to sigma, so this region extends very far in sigma. For my purposes I did not extend the scan further because I was mainly concerned in the credible region where both rhol and Psat are included.

The combined credible region is:

Notice that most of the correlation between epsilon and sigma has been removed by combining rhol and Psat. Although we have greatly improved the uncertainty using rhol and Psat, it appears that we are not able to discern between the different values of lambda using just these two properties over this temperature range. We either need to extend the range of rhol and Psat or we need to include additional properties. For example, as shown previously, U and Z along the isochore can provide a lot of insight into the optimal value of lambda.

It is important to discuss the error model that was used to obtain these results. Specifically, recall that I combine the experimental uncertainty (which is very small) with the simulation uncertainty which I attribute to the simulation package, the ITIC analysis, and the second virial coefficients. This plot shows the deviation between the REFPROP correlation with respect to the TRC source experimental data. The uncertainty that TDE assigned and my simulation uncertainties are included.

My simulation uncertainties are empirical error models that were determined by comparing my VLE results with those of Potoff and Siepmann (see Issue #2). In brief, liquid density is very consistent at lower T but can deviate strongly at higher T. I used 0.75 Tc as a switch-point. Vapor pressure can deviate strongly at lower T (below 0.6 Tc) and can still show fairly strong deviations at higher T. Essentially the simulation uncertainty is showing the possible deviations we could have if we compared our ITIC-Gromacs results with another method, like GEMC-Cassandra. If we reduce these uncertainties, either by improving our methods or understanding better why our results deviate from Potoff and Siepmann, we would get smaller credible regions from our analysis.

@mostafa-razavi

How do you feel about my simulation uncertainty model? Do you think that this is too conservative? Recall the deviations I demonstrated previously in Issue #2. Are those anomalies? In your experience, do the results you get using ITIC-MD agree better with GEMC and GCMC than my error models suggest? In your thesis you demonstrated small deviations in vapor pressure but, as I understand it, this was using NIST values rather than simulations.

When you compared different simulation packages the deviations seemed to be quite a bit larger:

If you quantified these deviations, would they be <<1% in rhol and <<5% in Psat?

mrshirts commented 6 years ago

I think with only 2-3 parameters, then full enumeration of the posterior makes sense. For higher dimensions, it's not so easy to do a full parameter sweep.

Getting different results with Psat and Rho makes sense.

the different values of lambda using just these two properties over this temperature range.

Looks like it. How many temperatures did you look at?

It would be interesting to see the difference in total probability for the different high-probably regions with different lambda; Maybe a 3D plot instead of a heatmap? Or give the fraction of the total probability in each lambda region.

ramess101 commented 6 years ago

@mrshirts

I think with only 2-3 parameters, then full enumeration of the posterior makes sense. For higher dimensions, it's not so easy to do a full parameter sweep.

True.

How many temperatures did you look at?

This includes all experimental data from 137-260 K for both properties. ITIC is limited to this range (0.45-0.85 Tc). We could probably go to lower temperatures but I do not think that would be very beneficial.

Or give the fraction of the total probability in each lambda region.

If we reduce the uncertainty in our simulation output so that the error model looks like this:

We not only get smaller uncertainties in epsilon and sigma for a given lambda (as expected when we reduce the uncertainty in the likelihood), but we get a different lambda distribution:

The primary difference between the values of lambda is the vapor pressure curve. That is, the optimal lambda=14 values tend to overpredict Psat, especially at low T. While lambda = 18 underpredicts Psat, especially at low T. Lambda=15 and 16 do the best job for Psat at all temperatures. So by reducing the uncertainty in Psat, especially at low T, we are favoring lambda of 15 and 16 more.

ramess101 commented 6 years ago

@mrshirts @mostafa-razavi

Since the error model that we associate with our simulation results greatly influences the information gain from Bayesian inference, I wanted to focus on this a little bit more. In Issue #2 I pointed out the differences I observed between my results and those of Potoff and Siepmann for propane and butane. Here I present a similar comparison for ethane, since this is really what we want to know anyways.

Here the different points represent the percent deviation between the values I obtain using the Potoff and TraPPE models with ITIC and those values reported, respectively, by Potoff et al. and Siepmann (for both the original 1998 study by Martin et al. and the TraPPE 2014 validation).

The blue and red dashed lines demonstrate two uncertainty models that I have used in my Bayesian analysis. The red one is more conservative and captures almost all the data, at least to within their uncertainties. By contrast, the blue model has smaller uncertainties and although it agrees well for rhol it misses almost all of the TraPPE values for Psat, despite the large uncertainties. If we want to use the smaller uncertainties in Psat, we need to have a good reason to trust the Psat values Potoff obtained with GCMC more than those obtained by Siepmann with GEMC. I know that GEMC struggles to converge for vapor pressures at very low temperatures, but at these temperatures I would be surprised if Siepmann's GEMC results were not converging.

Remember, this comparison has nothing to do with force field accuracy, since I used the same force fields as the respective authors.

ramess101 commented 6 years ago

Here are the updated MCMC results for the 16-6 potential using the "Original Simulation Uncertainties":

Obviously, the uncertainties are much smaller compared to my previous post, where only rhol was considered.

ramess101 commented 6 years ago

@mrshirts

I am working on some figures for the Mie-Bayesian manuscript. Here is the figure where I compare different lambda values for ethane.

The primary conclusions from this are:

The evidence supports a 15-6 and 16-6 almost equally
A significant drop-off in evidence is observed for the 14-6 and 17-6
The deviations in liquid density are all pretty similar, except that the 14-6 has a much higher RMS
The deviations in vapor pressure show systematic trends, where the 15-6 is most centered around 0
Even the 14-6 over predicts pressure at high pressures by about 5%. The deviations get worse for higher lambda.

ramess101 commented 6 years ago

I have modified the figure so that the insets all use AD%, since that is the metric I have been using through the paper. I have also removed the lambda evidence, since that probably can have it's own figure. Now I include a 2-d AD% plot to show the Pareto front that is formed for the different values of lambda:

ramess101 commented 6 years ago

This updated figure includes the 13-6 results:

Clearly the 13-6 is best for extrapolating to high pressures, but it is the worst for VLE

ramess101 commented 6 years ago

@mrshirts

I have included this figure in the manuscript to demonstrate the Pareto front in more detail:

ramess101 / MBAR_ITIC

Bayesian Inference #8