ramess101 / IFPSC_10

Submission to Industrial Fluid Properties Simulation Challenge 10
1 stars 0 forks source link

Uncertainty in CH and C parameters #17

Closed ramess101 closed 5 years ago

ramess101 commented 6 years ago

@jpotoff @msoroush @mostafa-razavi @jrelliottoh

In a previous study we developed joint distributions for the eps_CH3, sig_CH3 and eps_CH2, sig_CH2 parameter sets. These various combinations are denoted as MCMC parameter sets because we used Markov Chain Monte Carlo to sample from the posterior distribution using Bayesian inference. The results from this analysis are shown below (note that I have cut out parts of these figures that would only distract from our discussion):

image

image

The question remains as to how we want to quantify the uncertainty in CH and C. Although I could perform this Bayesian analysis, this would take some considerable effort and might not be the best approach anyways (different objective function, data sets, compounds, etc.) Rather, I propose that we just use the uncertainties as they have been provided in Mick et al. Here are the heat maps for these site types:

image

From these heat maps I can construct an approximate multivariate normal distribution. All I need to do is assign the variances in eps_CH, sig_CH, eps_C, and sig_C, along with their corresponding covariances. This is actually quite easy, although it requires some decision making on our part. This type of uncertainty is somewhat of a Type A and Type B because we are using statistical measures (the scoring function) but we are not using statistical methods to determine which scoring function values we would consider "acceptable." We could attempt to use an F-test of sorts, but with a multi-property weighted scoring function this gets more ambiguous.

Instead, I use the variation between the short, general, and long parameters as well as the scoring function to assign the following standard deviations:

eps_CH: 0.5 K sig_CH: 0.05 A eps_C: 0.1 K sig_C: 0.08 A

The covariance between eps and sigma was assigned by visual inspection of the scoring function.

Here is my first pass at assigning uncertainties. Since the challenge compound is a "long" branched alkane, I am using the "long" parameter set as my maximum likelihood. Note that I use the same plot region as that of Mick et al. in an attempt to help visualize the scoring function compared with the MCMC points:

image

image

Does anyone object to these uncertainties? In other words, do you think they are too large, too small, or that the correlation between eps and sig is too strong or too weak?

If no one objects, I will begin simulating 200 different MCMC parameter sets which are independently sampled from the CH3, CH2, CH, and C parameter spaces.

Alternatively, if @jpotoff or @msoroush have the actual scoring function values, we could fit a model to this surface and perform MCMC on that model. This would certainly be more rigorous, but I was not sure if the scoring function values are readily accessible for all the different parameter sets.

jrelliottoh commented 6 years ago

I think this still assumes that the CH3 on a linear alkane is the same as the CH3 on a branched alkane. I have a doubt about that. I agree that we should get a baseline with this hypothesis, but we should also plan for testing the hypothesis. Will your proposed approach lend itself to testing that next hypothesis?JRE

On Wednesday, August 15, 2018, 5:22:15 PM EDT, Richard Messerly <notifications@github.com> wrote:  

@jpotoff @msoroush @mostafa-razavi @jrelliottoh

In a previous study we developed joint distributions for the eps_CH3, sig_CH3 and eps_CH2, sig_CH2 parameter sets. These various combinations are denoted as MCMC parameter sets because we used Markov Chain Monte Carlo to sample from the posterior distribution using Bayesian inference. The results from this analysis are shown below (note that I have cut out parts of these figures that would only distract from our discussion):

The question remains as to how we want to quantify the uncertainty in CH and C. Although I could perform this Bayesian analysis, this would take some considerable effort and might not be the best approach anyways (different objective function, data sets, compounds, etc.) Rather, I propose that we just use the uncertainties as they have been provided in Mick et al. Here are the heat maps for these site types:

From these heat maps I can construct an approximate multivariate normal distribution. All I need to do is assign the variances in eps_CH, sig_CH, eps_C, and sig_C, along with their corresponding covariances. This is actually quite easy, although it requires some decision making on our part. This type of uncertainty is somewhat of a Type A and Type B because we are using statistical measures (the scoring function) but we are not using statistical methods to determine which scoring function values we would consider "acceptable." We could attempt to use an F-test of sorts, but with a multi-property weighted scoring function this gets more ambiguous.

Instead, I use the variation between the short, general, and long parameters as well as the scoring function to assign the following standard deviations:

eps_CH: 0.5 K sig_CH: 0.05 A eps_C: 0.1 K sig_C: 0.08 A

The covariance between eps and sigma was assigned by visual inspection of the scoring function.

Here is my first pass at assigning uncertainties. Since the challenge compound is a "long" branched alkane, I am using the "long" parameter set as my maximum likelihood. Note that I use the same plot region as that of Mick et al. in an attempt to help visualize the scoring function compared with the MCMC points:

Does anyone object to these uncertainties? In other words, do you think they are too large, too small, or that the correlation between eps and sig is too strong or too weak?

If no one objects, I will begin simulating 200 different MCMC parameter sets which are independently sampled from the CH3, CH2, CH, and C parameter spaces.

Alternatively, if @jpotoff or @msoroush have the actual scoring function values, we could fit a model to this surface and perform MCMC on that model. This would certainly be more rigorous, but I was not sure if the scoring function values are readily accessible for all the different parameter sets.

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub, or mute the thread.

ramess101 commented 6 years ago

@jrelliottoh

Thanks for the reply.

Yes, this assumes that CH3 are transferable. It actually assumes that CH3, CH2, CH, and C are all independent (while their corresponding epsilon and sigma are strongly correlated). So it does not address that issue. I simply do not think we have enough time to reparameterize the CH3 sites before the October 7th deadline. But we can try to push towards that if we think it is necessary.

ramess101 commented 6 years ago

I think this issue needs to be revisited: https://github.com/ramess101/MBAR_ITIC/issues/12

Here we suggested that the uncertainty can be reduced for the MCMC sets by subsampling all 200. I now have reason to believe that this is not adequately representing the uncertainty. I think we do actually want to subsample 40 MCMC sets from the 200 and then bootstrap from these 40. More results to come...

ramess101 commented 6 years ago

Thank you @msoroush for providing me with the scoring function values! After analyzing these results I got a fairly similar MCMC parameter sets to what I had before.

Here are my previous results (where I assumed a multivariate normal distribution and assigned the standard deviations based on the Potoff Figures):

image

Here are the results where I actually use the scoring function values without assuming a normal distribution:

image

image

The plots are quite similar. I interpolated the scoring function values whenever sigma or epsilon is within the "simulated region" but I used my normal distribution from before outside of this region. So for CH about half of the points are from interpolation whereas the other half are from the model fit.

Note that I only plotted 200 MCMC points for clarity, to really see the distribution you should include more samples. For example, here is C with 5000 points:

image

You can see more clearly the abrupt change at the "simulated region" boundary. Fortunately, C does not sample outside this region often and CH is well represented with a normal distribution.

The main difference between my previous results and these new MCMC sets is the pronounced curvature in the C MCMC sets which was not possible with a normal distribution. Again, this is most obvious when you plot more MCMC points.