Derivation - Githubissues

mrshirts

Hi, Rich- MBAR and histogram reweighting are actually the same thing; MBAR just treats the densities as delta functions, and so calculates things a different way, but the information content is the same. The issue I'm having with this particular paper is I'm not sure which observables are actually being used to determine the phase coexistence line. When we do solid-solid phase coexistence, we actually calculate the chemical potentials at each T,P, and look to where they are equal. In this case, I'm not actually sure what equation is being solved for the T and P that satisfy it.

ramess101

I know that they look and feel very similar and that in some specific case histogram reweighting is the same as the simple Bennett acceptance ratio, but I have not been able to exactly equate the two methods in my head. Perhaps sometime we could talk about this in the Church hallways :innocent: or actually work out a derivation for fun

mrshirts

Sure, and I'd be happy to show the equality in this context, but I'm having a hard time figuring out what is being calculated for phase equilibrium . . .

ramess101

I have a code that performs the analysis to solve for rholsat and Pvsat. I can send that to you. It is not totally clear from this paper, I think there was one other reference that helped me figure it out

mrshirts

That would be useful (or the paper a bit more useful). I guess the question is how do you determine which point is the saturation point.

The one slide difference between multiple histogram and MBAR.

ramess101

Here is the best I can explain the steps: 1. Perform a GCMC simulation near, but below, the critical point. This samples from both the liquid and vapor phases. 2. “Solve histogram reweighting” to determine the chemical potential that would match the saturated densities at a temperature 10 K lower. 3. Repeat this process until you have sampled several (say 6) liquid phase points and 2 vapor phases and 1 near critical 4. Reweight these simulations to interpolate the entire coexistence curve Now as for how you “solve histogram reweighting”...

mrshirts

"interpolate the entire coexistence curve" Can you turn that into an equation? Not the "interpolate" part. Coexistence curves are implicit functions. What is the implicit function you are solving to find the T,P that satisfy the implicit function?

ramess101

You take those 9 or so simulations and you solve equations 14-15 iteratively until they converge. The different simulations correspond to different mu and T (V is normally the same for each state point). Once you have determined the Ci weights from this iterative process, you can then use Equation 14 and 17 to predict the average density for any mu and T. To determine the mu that corresponds to saturation you integrate the area under the vapor and liquid peaks, i.e. Equation 18 Finally, the vapor pressure is obtained by determining the relative pressure difference between the given mu and T and a reference, i.e. an ideal gas, using Equation 18 again

mrshirts

area under the vapor and liquid peaks, i.e. Equation 18 Hmm. Equation 18 is a function of mu,V, and beta, so I'm not sure which variable is being integrated over

ramess101

From Equation 15 you can see that Ci is the integral of the probability distribution in N and E Well exp(Ci) actually

mrshirts

Hmm: OK, we have " conditions for phase co xistence are equality of temperature, chemical potential, and pressure the first two are satisfied by construction" So is it about identifying what the coexistence pressure is? i.e. the T and mu at which P_liquid(T,mu) = P_vapor(T,mu)?

ramess101

Yes

mrshirts

"integral of the probability distribution in N and E". So it's the integral in N and E. And by "liquid peak" and "vapor peak" mean you separate the histogram values into "liquid" and "vapor" (which for a large enough system will be relatively straightforward, since there will be low N / high U and high N/ low U points, and not much in the middle).

ramess101

T can be essentially any value in the range that you simulated. So people typically report saturated properties every 10K or so But you need to solve for the mu that will result in equal pressure Yes the reweighted histograms have clearly marked vapor and liquid peaks (both N and E are very different between the two states) Yes separating the two peaks becomes a bit ambiguous near Tc, so people usually only go to about 0.95 Tc

mrshirts

Ah, OK, this is making more sense. Actually, it still ends up being "at which point are the "free energies" equal, since the "free energy" in grand canonical is betaPV, rather than G.

ramess101

That’s true. I didn’t think about that.

mrshirts

"Find where P_liquid(T,mu) = P_vapor(T,mu) was the missing ingredient there.

ramess101

Yeah. For such a popular and effective method (much lower uncertainty than GEMC) I wish they would have a Best Practices tutorial or something :smile:

mrshirts

Teach enough grad thermo and it starts to come naturally. :slightly_smiling_face: One more clarification: explain "chemical potential that would match the saturated densities at a temperature 10 K lower" There are two densities, the liquid and the vapor. (and presumably we need both branches).

ramess101

Sorry, that wasnt very clear at all What I meant to say is that you perform this exact same analysis we have been discussing (i.e. solving for mu that will give equal pressures at a lower T) but only using the N and E states you sampled with the near critical simulation, that sampled both liquid and vapor. In other words, after each additional GCMC run you repeat this process to find what the best guess for mu at a lower T should be to have equilibrium. You continue performing additional GCMC runs at lower T using the same V with the estimated mu from analyzing the previous runs

mrshirts

Do the simulations fall into either the liquid or vapor branch stochastically? (or am I missing something here . . .)

ramess101

Oh no sorry, at lower T you intentionally start the simulation with a liquid phase density and it will never sample the vapor. Usually the 2 or 3 highest T will sample both and the rest will be either all liquid or all vapor You typically only need two vapor runs at low T since they will likely cover/sample the entire range of N relevant to a vapor So one near critical samples both stochastically. The next two lower T might sample both. Then you have maybe 4 lower T that are initialized in the liquid phase. And two lower T that are in vapor phase.

Note: This is not exactly correct. See Potoff's point below that you want to use a mu that will ensure either liquid or vapor, so that you avoid a metastable state.

mrshirts

Ah, OK, that's all fitting together now. I think it will be pretty straightforward to work this out, I just need an hour or two to sort out the exact computations. Thanks for the patience with me on this!

ramess101

That would be great. I am confident that combining HR with MBAR can be done. My biggest concern is the computational cost of recomputing tens of thousands of snapshots to get reliable histograms. But my biggest hope is that the small system sizes (around 100-200 molecules) will result in better extrapolation in eps/sig No problem. I am still no expert on this so it is a very good exercise for me to try to explain it. Plus I appreciate you looking into this when it is really my problem :pray:

mrshirts

If I can get credit on a publication by doing a few hours of work, that's worth it for me :slightly_smiling_face: Teamwork! Or alternatively, figuring out how to do things better is literally what a professor is supposed to be doing. And this problem is close enough to my expertise that it makes sense.

ramess101

Haha, if you figure this out you could be lead author if you want :laughing: but yes this is certainly a topic that I knew you would learn quick. I don’t want to tell you how many times I had to study this and other articles before it clicked for me :-1::skin-tone-2: *”if you figure”=>”when you figure”

mrshirts

Quick question. Seems like this allows you to find the mu where P_liq=P_vap, but doesn't actually give you what that pressure is. Do you carry out a couple more vapor simulations at different volume or mu to connect to a state where ideal gas is good approximation, and you can assume PV = NRT?

ramess101

Yes, you need to have an ideal gas state point so that you can determine the absolute values for the pressure using Equation 18. I forget what the common practice is for deciding the rho-T for the ideal gas state point though. I think it needs to be along the saturation curve, so they probably choose the lowest Tsat vapor simulation as ideal (Zvap is usually very close to 1 around 0.6 Tc). But you certainly do need to perform at least two vapor GCMC runs at low T (0.6-0.8 Tc). And you need an ideal gas reference for determining the absolute pressure.

This plot helps visualize all the state points that you need to simulate. You can see the near-critical run (510 K) covers both phases, the 480 K run also samples from both phases, but the next 5 liquid runs (high N) and two vapor runs (low N) only sample one phase. Again, the simulations are performed sequentially starting near-critical and moving to lower T by estimating the best mu (mu_sat) to simulate at the next T.

(Sorry, that plot was not really related to your previous comments. I just did not want to forget to post it.).

This publication does a better job explaining how to determine the pressures.

mrshirts

OK, here’s my MBAR-ified version:

First, for a fixed force field:

Carry out a simulation relatively near, but below the critical point.
Divide the samples into liquid and vapor phases. Need to identify a dividing surface in N and or E. Qualitatively, could just histogram, and draw a line. If it’s sufficiently below, the exact N/E should not affect the results much. There are likely some mathematical ways to estimate the dividing line more precisely. Getting the uncertainty of the process right needs to involve the choice of N. Because log Xi = beta P V, and all samples have the same volume, then log # liquid / # vapor = beta (P_liquid – P_vapor) V at the given mu, T, so we can estimate how far off we are from the coexistence point.
We then need to estimate at T - 10 K the mu that gives log # liquid / # vapor = 1.

a. The probability of each sample in the grand canonical ensemble is given by exp(\Xi -beta U + beta N\mu). We perform importance sampling to estimate Delta P at a new T and mu.

i. We first solve for \Xi at the new T and mu. This can be simply done by knowing that the probabilities at the new state must be normalized, i.e. 1 = \sum_N exp(delta \Xi – (beta_2-beta_1) U - \beta_2\mu_2 – \beta_1\mu_1 N), and solved to give:

Exp(Delta \xi) = \sum_N exp(– (beta_2-beta_1) U + (\beta_2\mu_2 – \beta_1\mu_1) N). This is just Zwanzig (i.e. 1-state MBAR) in grand canonical instead of Helmholtz.

ii. We then solve the nonlinear equation for mu such that \sum_i,liquid w_i(T-10,mu) = \sum i,vapor w_i(T-10,mu).

iii. These two steps can be done by creating a function that assembles the u_kn matrix at a given T and mu, calculating the weights, dividing them into liquid and vapor (presumably using the same dividing line as above?), subtracting the two sums over liquid and vapor, and plugging that into any python root solver for mu that doesn’t require derivatives (could do one with derivatives as well, but almost certainly overkill, and would require more work.

iv. You can estimate whether you have stepped too far in T by looking at n_eff.

b. Given the new mu, we start simulations in the high N and low N volumes (can be determined by over the liquid and vapor samples.

Repeat the process, except now we have more simulations that we can use to find the new Xi at T-20, and we solve the nonlinear equation such that \sum_i,liquid w_i(T-20,mu) = \sum i,vapor w_i(T-20,mu), with the weights determine from MBAR.

d. Repeat, moving down as far as one is interested.

Notes:

A vapor simulation low enough to assume PV=NRT must be included to connect the pressure difference to the absolute pressure. In theory, some additional ones could be done away from the coexistence curve if necessary and linked with MBAR.
Assuming a decent definition of the liquid/vapor interface is made at first, one can go back with a different dividing line between vapor and liquid, and recalculate the coexistence curve.
One can go back and of course recalculate all of the intermediate point T,mu points filling in, solving the nonlinear equation for MBAR.

Version with multiple force fields.

I think there can potentially be significant savings since we are looking only at molecules that have approximately the same coexistence curve. Once we have configurations that are representative covering all of the force fields, we shouldn’t need that many more. Assuming the liquids don’t have structure that is too different, I would image that after doing one coexistence curve, one can perform additional simulations at each T at neighboring mu-dmu, mu+dmu, mu-2dmu, mu+2dmu, etc. Then one should be able to do most of the reweighting just from this set of simulations. One would simply solve the implicit pressure equation (sum of weights of liquid and vapor is equal, find the mu that it agrees for each T). If the structure of the liquid gets too different as a function of parameters, then additional simulations of the different force fields might be necessary.

I might be overly optimistic, of course. If the structure is too different, then one would need to have a parallel set of the T,mu simulations for differently structured force fields.

How does this sound? Happy to talk some more about the details. We’ve managed to get a stomach virus as well as a bad cold here, so maybe not in person today. When I manage to get up and get kids cleaned, I’ll look at the outlines, but might not be until tonight.

ramess101

OK, so the methodology sounds exactly the same as histogram reweighting, we would just be using the MBAR approach instead of constructing histograms, right? In other words, we would be performing the same simulation sequence, estimating mu for the next T, solving where P is equal, etc., but using MBAR instead of a histogram approach, right? So would we be using MBAR in place of Equation 14 to solve for the p(N,E;mu,beta) histogram (i.e. we don't need to store the 2-d histogram f_i(N,E) anymore) or do we not even need the p(N,E;mu,beta) histogram anymore? I think you are saying that we don't need the p(N,E;mu,beta) histogram and just use the MBAR weights to solve for equilibrium by equating \sum_i,liquid w_i(T-10,mu) = \sum i,vapor w_i(T-10,mu), right?

The multiple reference force fields approach is still a bit unclear to me. What I had originally thought is that we would use MBAR to predict the histograms of N and E (f_i(N,E)) for a different eps/sig. In other words, I was thinking that we would use MBAR to predict f_i(N,E) by reweighting all the snapshots and then use the standard histogram reweighting approach (Equation 14) to solve for p(N,E;mu,beta). However, if I understand you correctly, you want to get rid of histograms altogether, i.e. not even solve for p(N,E;mu,beta), right? Or are you saying that we use MBAR to predict p(N,E;mu,beta) directly for a given eps/sig and then follow the standard method of solving for VLE by equating pressures?

In summary, my main question for both cases is: by using MBAR are we avoiding histograms altogether (both f_i(N,E) and p(N,E;mu,beta)), or do we still need to solve for p(N,E;mu,beta) but by using MBAR we do not need to store the 2-D histogram from each run (f_i(N,E)) anymore?

A quick thought: The more I compare the two methods, MBAR and histogram reweighting, the more I am convinced they are the same. Before I thought there was an extra term in Equation 14, the f_i(N,E) term, but I think that is because Equation 14 is for predicting an observable (the histogram). So the "MBAR weighting" (W_ni) is the RHS of Equation 14 when you exclude the f_i(N,E) term which is what we usually call O(x). Does that sound right?

For what Potoff wants to do, we would not be changing the force field dramatically, essentially just refining the existing Mie parameters. So I think the structure would stay similar and this should not require too many reference force fields.

mrshirts

OK, so the methodology sounds exactly the same as histogram reweighting, we would just be using the MBAR approach instead of constructing histograms, right? I

Right. It’s the same statistical thermodynamics idea, just using MBAR to calculate the quantities of interest. It will have a slightly higher resolution, but maybe not too much.

So would we be using MBAR in place of Equation 14 to solve for the p(N,E;mu,beta) histogram (i.e. we don't need to store the 2-d histogram f_i(N,E) anymore) or do we not even need the p(N,E;mu,beta) histogram anymore? I think you are saying that we don't need the p(N,E;mu,beta) histogram and just use the MBAR weights to solve for equilibrium by equating \sum_i,liquid w_i(T-10,mu) = \sum i,vapor w_i(T-10,mu), right?

Well, the downside is you need to store the configurations. That’s the difference with MBAR and histogram reweighting techniques: with the histogram reweighting, you bin a bunch of information together, with some information loss (But some advantages; no saving the configurations).

MBAR weights to solve for equilibrium by equating \sum_i,liquid w_i(T-10,mu) = \sum i,vapor w_i(T-10,mu), right?

Yes: we have eliminated the histograms in favor of a bunch of delta functions. The sum of the weights is equivalent to ln P_liquid = ln P_vapor in grand canonical.

In summary, my main question for both cases is: by using MBAR are we avoiding histograms altogether (both f_i(N,E) and p(N,E;mu,beta)), or do we still need to solve for p(N,E;mu,beta) but by using MBAR we do not need to store the 2-D histogram from each run (f_i(N,E)) anymore?

We replace histograms with delta functions (i.e. weights). The rest of the theory is the same. We ditch the histogram, but need the configurations; we are actually keeping more information, but we lose less.

The more I compare the two methods, MBAR and histogram reweighting, the more I am convinced they are the same

Well, that’s what I’ve been saying. :) See the figure I posted last night in the chat. A set of weights associate with points is literally a histogram with zero width points. Some disadvantages, but many advantages.

A quick thought: The more I compare the two methods, MBAR and histogram reweighting, the more I am convinced they are the same. Before I thought there was an extra term in Equation 14, the f_i(N,E) term, but I think that is because Equation 14 is for predicting an observable (the histogram). So the "MBAR weighting" (W_ni) is the RHS of Equation 14 when you exclude the f_i(N,E) term which is what we usually call O(x). Does that sound right?

Sort of. Eq 14 is MBAR but integrated over each N,E histogram. Imagine each bin is infinitely narrow in both directions. Then f_i(N,E) is just 1 in all occupied histograms, and zero in all the others. Then you have MBAR.

For what Potoff wants to do, we would not be changing the force field dramatically, essentially just refining the existing Mie parameters. So I think the structure would stay similar and this should not require too many reference force fields.

Right, if you aren’t trying to jump from lambda=14 to lambda=16, maybe you just need a couple of mu’s. OR maybe you will need a couple of reference force fields for different sigmas. Though it’s possible you may only need different reference forcefields at lower T, depending on overlap.

One important thing: I don’t think you CAN reweight the N,E histograms to different force fields. The reason is that two samples in the same E bin might belong to different E bins after applying the force field. An example: if E = 0, it might be because two particles are very far apart, or because they are at sigma*(1/2)^-6 (with a LJ force field). When the force field changes, they don’t change to the same E. MBAR, on the other hand (i.e. zero width histograms) doesn't have that problem; every energy gets regenerated.

ramess101

OK, that helped a lot. Thanks

Well, the downside is you need to store the configurations.

Fortunately, since we are interested in Mie potentials, we can just store the basis functions for the energy. So storing large amounts of configurations should not be a problem.

Well, that’s what I’ve been saying. :) See the figure I posted last night in the chat.

Ha, very true. I think I am officially convinced :-)

Sort of. Eq 14 is MBAR but integrated over each N,E histogram. Imagine each bin is infinitely narrow in both directions. Then f_i(N,E) is just 1 in all occupied histograms, and zero in all the others. Then you have MBAR.

Right, that makes sense.

Right, if you aren’t trying to jump from lambda=14 to lambda=16, maybe you just need a couple of mu’s. OR maybe you will need a couple of reference force fields for different sigmas. Though it’s possible you may only need different reference forcefields at lower T, depending on overlap.

That is a good point. You could probably just include a few additional reference force fields at the lower T where overlap might be worse. This would be very interesting to investigate.

One important thing: I don’t think you CAN reweight the N,E histograms to different force fields. The reason is that two samples in the same E bin might belong to different E bins after applying the force field. An example: if E = 0, it might be because two particles are very far apart, or because they are at sigma*(1/2)^-6 (with a LJ force field). When the force field changes, they don’t change to the same E. MBAR, on the other hand (i.e. zero width histograms) doesn't have that problem; every energy gets regenerated.

Yes, I thought of that too. The approach that I was considering was a little bit different than this, in an attempt to avoid this very problem. Basically, I was thinking that you would recompute the energies for all the configurations, then use MBAR to determine the weights for each configuration, and then you would re-bin the histograms but using the weights to count the number of configurations in each bin. In other words, two configurations that are originally in the same E bin could (and would likely) get placed in different E bins after the reweighting. And they would no longer contribute 1 to their respective bins, but would rather contribute the weight to that bin. Does that make sense? It is probably a moot point, but it would be good to make sure that my MBAR understanding is sound.

Also, here are two additional (and somewhat related) advantages to using GCMC instead of ITIC when coupled with MBAR:

ITIC requires accurate pressures which require a higher number of effective samples (to get a good average) than energy. By contrast, GCMC does not require estimates for pressure.
ITIC requires larger systems to get good results while GCMC uses much smaller systems. This should help with overlap (the Neff problem) for large changes in force field parameters. Also, even if you are not using basis functions (Exp-6), the smaller systems will be faster to recompute, although you do need to store about 20 times as many configurations for GCMC than ITIC.

mrshirts

Yes, I thought of that too. The approach that I was considering was a little bit different than this, in an attempt to avoid this very problem. Basically, I was thinking that you would recompute the energies for all the configurations, then use MBAR to determine the weights for each configuration, and then you would re-bin the histograms but using the weights to count the number of configurations in each bin. In other words, two configurations that are originally in the same E bin could (and would likely) get placed in different E bins after the reweighting. And they would no longer contribute 1 to their respective bins, but would rather contribute the weight to that bin. Does that make sense? It is probably a moot point, but it would be good to make sure that my MBAR understanding is sound.

I think so, but it seems more confusing than necessary . . .

Also, here are two additional (and somewhat related) advantages to using GCMC instead of ITIC when coupled with MBAR:

They both seem reasonable. One question I do have: is there a system-size dependence on the coexistence line? I would imagine that liquids would have slightly larger correlation lengths than vapor, and thus likely the free energy of the liquid with respect to the free energy of the vapor would go down a bit with increased size, shifting the line to higher T / lower P. Thoughts on this? Any data on it?

I think so, but it seems more confusing than necessary

Yes, this was just my naive way of applying MBAR to the histogram reweighting approach, before you explained how MBAR could replace the histograms completely.

One question I do have: is there a system-size dependence on the coexistence line? I would imagine that liquids would have slightly larger correlation lengths than vapor, and thus likely the free energy of the liquid with respect to the free energy of the vapor would go down a bit with increased size, shifting the line to higher T / lower P. Thoughts on this? Any data on it?

Yes, this is a very well studied issue. Essentially, the literature demonstrates that below about 0.95 Tc you do not need to worry about finite-size effects. However, if you are trying to estimate the critical point you should correct for finite-size effects by using different system sizes and then extrapolating the results to the infinite-size limit. But for our purposes (VLE) we do not need to correct for finite-size effects.

So I currently have a working version of the histogram reweighting approach for VLE. We can use this to benchmark the MBAR results for a single force field, i.e. no change in the non-bonded interactions.

ramess101 to GOMC-WSU:

In my meeting with Dr. Potoff today, we briefly discussed using MBAR with GCMC. We need to validate that MBAR is giving the same results as histogram reweighting for the case where epsilon and sigma do not change. Unfortunately, our histogram reweighting code appears to have a bug in it. Specifically, the VLE properties we obtain have systematic deviations from the values reported in the literature.

For example, we simulated hexane using the Mie 16-6 potential with GOMC.

These are the histograms we obtained for the different chemical potentials and temperatures:

As you can see here, our densities do not agree perfectly with the values reported by Potoff et al. (especially at higher T):

The strange thing is, that our higher temperature results get worse as we include histograms from the lower temperature simulations. For example, this is what we get at the highest temperature if we only include the histograms at the "bridge":

This suggests that we are not weighting the histograms properly.

Dr. Potoff thought that we could do a couple of things to resolve this issue:

You send me your histogram reweighting code
I send you my raw simulation output files for hexane and you run your code to see what results you get
You send me your raw simulation output for some system (he thought you should have the data for the noble gases readily available) and I use my code to see if I get the same results
I send you my histogram reweighting code

I think we should try options 1, 2, and 3 before we consider 4.

If you could send me your histogram reweighting code and/or simulation output files that would be great.

Here are my hexane GOMC output files for the number of particles and energies (first row has Temperature, Chemical potential, and box dimensions)

hexane_GCMC_output.zip

Thanks

jpotoff

Since we have your histogram data, we will start by pushing it through our analysis code and see what we get

Ok, I'm pretty sure your code is working, and the issue is the chemical potentials used to generate the histograms. I ran your data through our code and got results that are very similar to yours.

Many of your low temperature runs in the liquid phase should be sampling both the liquid and vapor phases, since you are running at close to the coexistence chemical potential.

However, because of the large free energy barrier, the simulation is never going to tunnel between phases. What you have is a metastable simulation, which throws the reweighting off. I'm surprised that your results are as good as they are. Normally one will get weird unphysical behavior in the coexistence curve in these situations.

To account for this, we run slightly off the coexistence curve so the simulation is sampling only the liquid or vapor phase. If you try your analysis code on the attached group of histograms you should get the same answer.

histograms.zip

ramess101

OK, great, this is very reassuring. I am glad that it is our state points that are the problem and not that we have a bug in our code. Changing the chemical potential in a script is a lot faster than debugging code +1

It appears that shifting the chemical potentials at lower temperatures is a subtle but very important step. Running the histograms you provided me with through our histogram reweighting code gives excellent agreement:

Thank you for helping resolve this issue.

ramess101 / MBAR_GCMC

Derivation #1