Structure.to(fmt='cif') Precision Issues

materialsproject / pymatgen

Python Materials Genomics (pymatgen) is a robust materials analysis code that defines classes for structures and molecules with support for many electronic structure codes. It powers the Materials Project.

https://pymatgen.org

Other

1.53k stars 868 forks source link

Structure.to(fmt='cif') Precision Issues #1784

Closed cpashartis closed 4 years ago

cpashartis commented 4 years ago

It makes it difficult to reproduce bugs or get a good nearest neighbour precision when the CIF writer has no precision CIF writer seems to have no precision for output, can give some errors in calculations if a certain precision is needed.

Add the number of decimals or precision to output when saving a file as CIF Perhaps add **kwargsinto .to method for Structure. Side note: json seems to store more digits...

mkhorton commented 4 years ago

Can you clarify what you mean by "CIF writer has no precision"?

If you want to increase the number of decimal places we print in the CIF we can do that.

gmatteo commented 4 years ago

Perhaps he refers to the default format used to write floats when structure.to() is invoked.

Indeed, there's no option to specify the number of digits used to print lattice vectors and atomic positions to text files so we have to rely on the default behaviour of the different writer instances.

We encountered a similar problem when reading POSCAR files produced by pymatgen in Abinit because, by default, the get_string method of Poscar uses significant_figures = 6.

Wyckoff positions such as 1/3, 2/3 are truncated and other codes may not detect all the symmetries of the system because we loose (too much) information when converting from the internal python representation to text file (especially for hexagonal and rhombohedral lattices).

It seems to me that many POSCAR files available on the materials project website have 6 significant_figures and this leads to frequent problems when new Abinit users try to use the POSCAR files to start an Abinit calculation because the code won't detect all the symmetry operations --> larger IBZ. We have an input variable to "refine" the structure in Abinit when the text file is not accurate enough but not all the users are aware of this trick.

In my opinion, the default number of figures for the lattice vectors and the atomic positions should be increased in order to facilitate code interoperability.

mkhorton commented 4 years ago

In my opinion, the default number of figures for the lattice vectors and the atomic positions should be increased in order to facilitate code interoperability.

This seems sensible to me, thanks for the exposition.

shyuep commented 4 years ago

Six decimals seem already excessive. You mean there are symmetry codes with that kind of tolerance. If u want fractions, that can be done using the python fractions module.

On Tue, Mar 3, 2020 at 16:58 Matthew Horton notifications@github.com wrote:

In my opinion, the default number of figures for the lattice vectors and the atomic positions should be increased in order to facilitate code interoperability.

This seems sensible to me, thanks for the exposition.

— You are receiving this because you are subscribed to this thread. Reply to this email directly, view it on GitHub https://github.com/materialsproject/pymatgen/issues/1784?email_source=notifications&email_token=AAEM4UZU23ED723UBHNJ6XTRFWRTPA5CNFSM4LAM7GA2YY3PNVWWK3TUL52HS4DFVREXG43VMVBW63LNMVXHJKTDN5WW2ZLOORPWSZGOENVYTFI#issuecomment-594250133, or unsubscribe https://github.com/notifications/unsubscribe-auth/AAEM4UY2YC3YLRRO42JWZITRFWRTPANCNFSM4LAM7GAQ .

-- Shyue Ping

mkhorton commented 4 years ago

I’ve seen errors with symmetry myself when “only” six decimals are present, though I’ve seen it in the context of constructing a cell from a space group.

On Tue, Mar 3, 2020 at 17:16, Shyue Ping Ong notifications@github.com wrote:

Six decimals seem already excessive. You mean there are symmetry codes with that kind of tolerance. If u want fractions, that can be done using the python fractions module.

On Tue, Mar 3, 2020 at 16:58 Matthew Horton notifications@github.com wrote:

In my opinion, the default number of figures for the lattice vectors and the atomic positions should be increased in order to facilitate code interoperability.

This seems sensible to me, thanks for the exposition.

— You are receiving this because you are subscribed to this thread. Reply to this email directly, view it on GitHub < https://github.com/materialsproject/pymatgen/issues/1784?email_source=notifications&email_token=AAEM4UZU23ED723UBHNJ6XTRFWRTPA5CNFSM4LAM7GA2YY3PNVWWK3TUL52HS4DFVREXG43VMVBW63LNMVXHJKTDN5WW2ZLOORPWSZGOENVYTFI#issuecomment-594250133 , or unsubscribe < https://github.com/notifications/unsubscribe-auth/AAEM4UY2YC3YLRRO42JWZITRFWRTPANCNFSM4LAM7GAQ

.

-- Shyue Ping

— You are receiving this because you commented. Reply to this email directly, view it on GitHub https://github.com/materialsproject/pymatgen/issues/1784?email_source=notifications&email_token=AAWWWRG36T3MAPGM36MXPETRFWTT3A5CNFSM4LAM7GA2YY3PNVWWK3TUL52HS4DFVREXG43VMVBW63LNMVXHJKTDN5WW2ZLOORPWSZGOENVZX4Q#issuecomment-594254834, or unsubscribe https://github.com/notifications/unsubscribe-auth/AAWWWRGAL2LWMD5FQDTD3HDRFWTT3ANCNFSM4LAM7GAQ .

gmatteo commented 4 years ago

Well, it seems to me that also pymatgen is trying to fix this finite precision issue with an ad-hoc solution:

       # check for finite precision frac co-ordinates (e.g. 0.6667 instead of 0.6666666...7)
        # this can sometimes cause serious issues when applying symmetry operations
        important_fracs = (1 / 3., 2 / 3.)
        fracs_to_change = {}
        for label in ('_atom_site_fract_x', '_atom_site_fract_y', '_atom_site_fract_z'):
            if label in data.data.keys():
                for idx, frac in enumerate(data.data[label]):
                    try:
                        frac = str2float(frac)
                    except Exception:
                        # co-ordinate might not be defined e.g. '?'
                        continue
                    for comparison_frac in important_fracs:
                        if abs(1 - frac / comparison_frac) < 1e-4:
                            fracs_to_change[(label, idx)] = str(comparison_frac)
        if fracs_to_change:
            self.warnings.append("Some fractional co-ordinates rounded to ideal values to "
                                 "avoid issues with finite precision.")
            for (label, idx), val in fracs_to_change.items():
                data.data[label][idx] = val

The question is: why do we spend a lot of CPU-hours to obtain a relaxed structure with ab-initio quality when a significant part of the information on the crystalline structure is then lost when we present the final result to the user or when we print these results to txt file in order to restart a calculation? Note that beyond KS-DFT calculations are more sensitive to the quality of the input structure and this is the reason why we mainly use json files to perform our studies.

cpashartis commented 4 years ago

I didn't expect this to blow up as much as it did, due to my time difference I didn't follow up yet, apologies.

I think @gmatteo gave a reasonable argument where the lack of precision could cause some issues (I appreciate it).

Ideally in my case, the only reason I noticed this issue was creating a test for some code I have written that sorts neighbours into different shells, my precision was too high when searching for the shells, and when I tried to reproduce the bug locally by exporting as a CIF file (and then reloading it) the bug was non-existent anymore.

I also agree that typically, we should not worry about extra precision, but I think from a numerical standpoint, the option for extra precision should be implemented and should not be hard coded. Also, from @gmatteo 's comment, it could be the case that some DFT codes might have issues restarting from a structure from a CIF file if it is not as close to what was used initially - say from a wave function restart with the previous structure loaded from a CIF. Although, I'm not sure how common this case is.

setten commented 4 years ago

great discussion, I'm currently developing an other multy code workflow framework using pymatgen. Here we pull all magic numbers and parameter into a settings json file. The next parameter will be write precision. @shyuep please add the write precision in the to method. not all users have the same needs. I don't think we should be deciding what is good enough for our users. Cheers,

shyuep commented 4 years ago

I have set the default float precision to 8, and added a significant_figures arg (which also works in Structure.to and consistent with the naming used in Poscar). This should handle all these complaints. It should be noted that fundamentally cif, and all floating point translations to strings, is "lossy". Taking a DFT relaxed structure and transforming it to CIF before reading it back to a second DFT run is not a good practice. There are reasons why other formats (e.g., JSON) exist.

For future reference: pls be respectful during discussions and especially when asking for other people to do stuff. I am more than happy to hear feedback and we can always help. But often, it is also useful for someone to take on the mantle of simply editing the CifWriter code to implement a simple thing like float precision formatting. It was literally a few minor changes to the format strings + changes to associated tests. Generally the tone has been good, but @setten, you really shouldn't be issuing me orders. It just rubs me the wrong way.

setten commented 4 years ago

Dear Shyue, Please don't get me wrong, I never intended to order you around, anybody else would probably just have added it in a less conformal way probably even causing you even more work. The discussion just seemed to be spinning out of control. Best, Michiel