Viscosity (PoU) - Githubissues

ramess101 commented 6 years ago

We are interested in how well the Mie potential can predict viscosity. Preliminary results show that viscosity is over predicted for the Potoff 16-6 potential and under predicted for TraPPE 12-6. This is primarily true at higher densities (higher viscosities), so we are testing the Mie potential for ethane at Tsat of 137 K and 600 kg/m3.

One goal of this exercise is to compare the magnitudes of the force field parameter uncertainty and statistical/numerical uncertainty. The numerical uncertainty in viscosity is fairly large, so originally I thought that the uncertainty in the parameters would be small, at least for a given value of lambda. With varying lambda I expected large differences in viscosity, but that the corresponding uncertainty in epsilon/sigma would be negligible. However, I discovered that this is not necessarily the case. The epsilon/sigma uncertainties for the 16-6 potential are great enough that they do increase the overall uncertainty in viscosity (i.e. the viscosity uncertainty is not just the statistical/numerical uncertainty). This can be seen in the following figure:

This figure was obtained by performing 40 replicate simulations for 30 eps/sigma pairs obtained from MCMC. Each of the histograms (except for the black "combined" histogram) correspond to a different epsilon/sigma parameter set. So these represent just statistical/numerical uncertainty from the 40 replicates for a single epsilon/sigma. These uncertainties are obtained by bootstrapping the 40 replicates and repeating the fit of the Green-Kubo integral according to the method proposed by Maginn. The "upper" and "lower" non-transparent histograms are just selected from these 30 epsilon/sigma sets to demonstrate that they are statistically different. That is, the numerical uncertainty in the "upper" and "lower" epsilon/sigma sets do not have significant overlap. The black "combined" histogram is when you sum up the bins of all 30 epsilon/sigma parameter sets. The 95% credible interval corresponds to just the "combined" uncertainty. Since the combined histogram is much wider and has a lower peak than any individual histogram, it is clear that the parameter uncertainty contributes significantly. This was pretty surprising. 40 replicate simulations is not a lot (in fact it is about the minimum) for a good viscosity estimate. If you did 100 replicates you could get even smaller statistical/numerical uncertainties and the impact of the parameter uncertainty would be even more pronounced. But for our purposes this is good enough. Note that despite these large uncertainties (about +- 6%), the Mie 16-6 potential still over predicts viscosity at this state point by about 30% (i.e. the lower 95% ~ 0.455 is 30% greater than experimental/REPFROP). I will repeat this process for the 14-6 and 15-6 potentials.

The big question is, can we get the combined uncertainty with fewer than 40*30=1200 simulations? This is not that computationally intensive (I only ran 20 jobs in parallel and this finished in 30 hours) but it would be nice if we could use, say 200 simulations. Can we just use 200 epsilon/sigma samples from MCMC and not perform any replicates? In this scenario you would not be able to estimate viscosity for a given epsilon/sigma (you need at least 10 replicates to do a reasonable job of that). Instead, the hope would be that you could go directly to the overall uncertainty (parameter + numerical). The data analysis would be similar to how the numerical uncertainties are obtained for multiple replicates of a single epsilon/sigma. That is, we would randomly select which epsilon/sigma are included in the average. I believe but am not certain that the bootstrapped uncertainty will incorporate both parameter and numerical uncertainty when done this way. The numerical uncertainty should manifest itself through the large fluctuations in the Green-Kubo integral for a single run while the parameter uncertainty should result from randomly sampling the various epsilon/sigma parameter sets. I have done this approach with only 30 epsilon/sigma parameter sets and the results were inconclusive. So I am now going to try it with 200 parameter sets and see if the uncertainty is similar to the combined uncertainty shown above.

ramess101 commented 6 years ago

The results I got for 200 MCMC parameter sets do not appear to represent the overall uncertainty:

Notice that this uncertainty is much smaller than the "combined" uncertainty in the previous comment. It appears to be about the same size as the numerical uncertainties. This is disappointing and I still don't understand why the parameter uncertainty is not manifest.

ramess101 commented 6 years ago

OK, I believe that the overall uncertainty IS the same. However, for a fair comparison of the "30 MCMC + 40 replicates" and the "200 MCMC no replicates" results, you must be careful that their respective analysis are comparable. Originally (the figure in the previous comment) I bootstrapped all of the 200 MCMC samples. This is effectively like reducing the numerical uncertainty (which is proportional to 1/sqrt(N)). This means that the numerical uncertainty should be about 45% for the 200 MCMC samples compared to the 40 replicates. Since numerical uncertainty is the primary contributor, the overall uncertainty will also be about half as large. This does not completely explain the large decrease, but it is a significant reason for the discrepancy. Specifically, the black region in the first comment (30 MCMC + 40 replicates) has a range of 0.505-0.455=0.05 and the blue region in the second comment (200 MCMC no replicates) has a range of 0.48-0.465=0.015, a ratio of about 0.3.

If instead we randomly select 40 samples from the 200 MCMC samples (and repeat this hundreds of time) and perform the bootstrapping on this subset of 40 samples we get a region that is much closer to the 30 MCMC + 40 replicates results. The results should not be identical because the 200 MCMC samples is a better representation of the true Markov Chain than the 30 MCMC samples.

Also, notice that the peak is shifted (as we saw before in the second comment) because the 200 MCMC samples give a better representation of the true maximum likelihood. For example, it appears that the "upper" region from the 30 MCMC samples is a low probability point that is over represented in the 30 samples but its' contribution is suppressed in the 200 MCMC samples.

The results shown previously were intended to validate that the 200 MCMC samples can give the same distributions as the 30 MCMC + 40 replicates. Instead, we can have smaller uncertainties by subsampling 200 MCMC samples. The figure below demonstrates this where we compare subsampling 40 MCMC samples with 200 subsamples. Note that this is different from what we did in the previous comment because we are performing bootstrapping on each of the subsamples. The purpose of this is to effectively reduce the numerical uncertainty while still properly accounting for parameter uncertainty.

These results are very important because this means that we CAN estimate the overall uncertainty from a single run of 200 MCMC samples. In addition, we are able to reduce the numerical uncertainty without requiring replicates at each MCMC sample.

ramess101 / MBAR_ITIC

Viscosity (PoU) #12