Fundamental questions on supercells/configurations selection, and physically robust ECI fitting

sjtuzhanglei commented 2 years ago

Dear CASM developers,

First thanks for answering my previous questions. Hope you had a peaceful and enjoyable holiday.

When I go deeper into the package and CE method, I am confused with the convergence issue:

Theoretically, supercells/configurations in the training set is the larger/the more, the better. How to know the supercell size cutoff? Is there any general rule-of-thumb, or theoretical references behind that you can point me to read?
I also noticed the linearity/dependency of configurations in the training set. How to check it and make sure those are as diverse as possible (for a robust ECI fitting)?
What would be a general rule-of-thumb to pick up the basis sets? I start from pure pair-wise interactions, and use the two point charges' electrostatic energy within the dielectric medium as the guidance. What is the master's opinion? When will the triplet and quadruplet be necessary?
In CASM, ECI value is an average value over symmetrically equivalent clusters (with crystal symmetry info but agnostic of chemical occupancy info). How to distinguish ECIs of different chemical occupancies, as different occupancy should have different ECI strengths and signs, right? I think this should be more physically meaningful. E.g., positive ECIs means repulsion, negative ones means attraction.

Best regards,

bpuchala commented 2 years ago

Theoretically, supercells/configurations in the training set is the larger/the more, the better. How to know the supercell size cutoff? Is there any general rule-of-thumb, or theoretical references behind that you can point me to read?

As a rule of thumb, often about 10-30 calculations per non-zero coefficient are needed for a decent fit with prediction errors on the order of 1 meV / atom. In the end, the prediction results such as hull ground states, equilbrium phases and phase boundaries are the test of model fitness.

I also noticed the linearity/dependency of configurations in the training set. How to check it and make sure those are as diverse as possible (for a robust ECI fitting)?

This will arise with basis function truncation. Modifying your bspecs.json to include more cluster functions would allow for a finer fit.

What would be a general rule-of-thumb to pick up the basis sets? I start from pure pair-wise interactions, and use the two point charges' electrostatic energy within the dielectric medium as the guidance. What is the master's opinion? When will the triplet and quadruplet be necessary?

Generally we include triplets and some quadruplets in the potential basis set and let ourselves be guided by the data. If I'm curious about things like this, I tend to use RFE to select varying numbers of functions and plot the RMS to get a basic sense of how they affect the fit.

In CASM, ECI value is an average value over symmetrically equivalent clusters (with crystal symmetry info but agnostic of chemical occupancy info). How to distinguish ECIs of different chemical occupancies, as different occupancy should have different ECI strengths and signs, right? I think this should be more physically meaningful. E.g., positive ECIs means repulsion, negative ones means attraction.

One can usually rationalize the coefficients that you get after fitting, but it can quickly get confused as multiple-body interactions are summed. I would generally avoid too much focus on "understanding" individual interactions at the expense of the overall fit. That said, this type of understanding can be found by looking at a system with a point defects with respect to the ground state orderings along with the casm bset --functions output to understand what the individual functions are.

sjtuzhanglei commented 2 years ago

Thanks again!

I saw CE papers where people do two things differently, please comment and advise their pros and cons:

Chemically distinct pair correlations vs. site wise pair correlations (no chemical occupancy distinctions, just an average of possible occupancies)
Fixed supercell enumerations (generally a large supercell with a given lattice orientation) vs. enumeration from the smallest ones to bigger ones, with various possible lattice vector transformations (this seems a convention also in ATAT, there must be some advantage of using different shapes of supercells, right? like better sparsity of configurational ordering maybe?)

Thanks,

bpuchala commented 2 years ago

Chemically distinct pair correlations vs. site wise pair correlations (no chemical occupancy distinctions, just an average of possible occupancies)

Can you clarify what you mean by these?

Fixed supercell enumerations (generally a large supercell with a given lattice orientation) vs. enumeration from the smallest ones to bigger ones, with various possible lattice vector transformations (this seems a convention also in ATAT, there must be some advantage of using different shapes of supercells, right? like better sparsity of configurational ordering maybe?)

Using multiple supercells allows calculating all possible configurations using the smallest possible DFT calculation. Fixing the supercell might be easier to implement, but will only give you a subset of configurations and not the most primitive version of each configuration.

sjtuzhanglei commented 2 years ago

Absolutely!

Chemically distinct pair correlations vs. site wise pair correlations (no chemical occupancy distinctions, just an average of possible occupancies)

In a conventional CE, ECIs are only corresponding to the symmetry of the cluster, not the chemical information (occupation state). Therefore, it is an average value of all possible occupation state under a specific symmetry of cluster. Does that make sense?

Check this paper where they implement a user-defined ECIs, differentiating interactions of specifed chemical species and only pick ones that they think are "physically meaningful", e.g. significant electrostatic interactions.

http://aluru.web.engr.illinois.edu/Journals/CM19.pdf https://pubs.rsc.org/en/content/articlelanding/2017/cp/c7cp04106c

sjtuzhanglei commented 2 years ago

Using multiple supercells allows calculating all possible configurations using the smallest possible DFT calculation. Fixing the supercell might be easier to implement, but will only give you a subset of configurations and not the most primitive version of each configuration.

I am wondering, for long-range interactions, e.g. electrostatic, 1/r^2; elastic, 1/r, to account for the sufficiently significant interactions, do we need to have supercells or the cluster basis set as large as possible? Since it is useless to have cluster size bigger than the training set supercell size. So we need to have a large enough supercell and configuration where the furthest point charges in the lattice have negligible electrostatics, right?

bpuchala commented 2 years ago

In a conventional CE, ECIs are only corresponding to the symmetry of the cluster, not the chemical information (occupation state). Therefore, it is an average value of all possible occupation state under a specific symmetry of cluster. Does that make sense?

It's not unambiguous what you mean here, so I'm going to avoid commenting on general statements like this. If you have specific equations or terms you want to consider I might be able to comment on them.

Check this paper where they implement a user-defined ECIs, differentiating interactions of specifed chemical species and only pick ones that they think are "physically meaningful", e.g. significant electrostatic interactions.

http://aluru.web.engr.illinois.edu/Journals/CM19.pdf https://pubs.rsc.org/en/content/articlelanding/2017/cp/c7cp04106c

Both of these papers consider multi-component systems, but where the total composition is restricted to have only 1 degrees of freedom (i.e. x in "Sr(Ti 1−xFex)O 3−x/2" and "RExCe1-xO2-x/2"). In these problems, each site has only binary alloying, so the cluster expansion looks like a binary cluster expansion and each basis function and its associated ECI can be identified with a particular cluster occupation.

I would avoid thinking of the selection and fitting of coefficients in the manner of the second paper. The discussion of choosing coefficients in section 3.3 seems likely to cause misunderstanding of the cluster expansion. They sort of wave away the difference between the pair, triplet, quadruplet, etc. polynomial terms of the cluster expansion by relabeling them as pair terms with different types "Between" the pair cluster sites. This may work, but it seems confusing to me and hard to check.

Second, I think it is usually a mistake to think too much about the specific physics of a problem and how that should affect which cluster basis functions are non-zero. This is not to say that the physics does not affect the final model, but that it is probably not so useful for obtaining a good model as just following the data. The cluster expansion provides a complete basis, but does not guarantee fast convergence. It's not possible to "contain" elastic interactions by having large enough pair terms, but many problems can still be fit well enough without a reciprocal space formulation or composition dependent basis functions. You have to check its predictive ability in the configuration space where you will apply it. If you are having trouble fitting the entire space of interest at first, try starting with a more limited space (for instance either a limited composition range or limited perturbations from some ordered state) and then try to extend it to see what is possible.

sjtuzhanglei commented 2 years ago

Thanks so much for the detailed explanation and comment on the second paper. It helps a lot!

sjtuzhanglei commented 2 years ago

What would be a good way to estimate the supercell size needed for training?

For bspecs, how to choose the cluster size cutoff for pairs, triplets, etc.? The cluster size should be smaller than the maximum(x,y,z) dimension of the largest supercell? How to avoid overfitting?

bpuchala commented 2 years ago

I'll just refer back to my previous comment that it may take some learning from the data and iteration, guided by CV scores to avoid overfitting.

prisms-center / CASMcode

Fundamental questions on supercells/configurations selection, and physically robust ECI fitting #251