materialsproject / pymatgen

Python Materials Genomics (pymatgen) is a robust materials analysis code that defines classes for structures and molecules with support for many electronic structure codes. It powers the Materials Project.
https://pymatgen.org
Other
1.52k stars 865 forks source link

Magic Number of "max_sites=-20" in MP2020Compatibility Leads to Hidden Error #2529

Open JiQi535 opened 2 years ago

JiQi535 commented 2 years ago

Is your feature request related to a problem? Please describe. Yes. The current get_adjustments() function in MP2020Compatibility fails to get the correct energy adjustments for compositions with over 20 atoms and specific mixed anions, e.g. Li20O9Cl2. For Li20O9Cl2, it can only get the energy correction for O2-, while correction for Cl- is missed. https://github.com/materialsproject/pymatgen/blob/1f1056cab8b9358f4d1ac2012f2a5732cd29bcf1/pymatgen/entries/compatibility.py#L1041-L1047 It is because of using a magic number of "max_sites=-20" to set 20 as the cutoff for the maximum number of sites in a reduced formula whose oxidation states will be guessed. Together with the below codes, energy corrections for anions that are not the most electronegative are missed. https://github.com/materialsproject/pymatgen/blob/1f1056cab8b9358f4d1ac2012f2a5732cd29bcf1/pymatgen/entries/compatibility.py#L1061-L1071

Without documentation, this current setting can lead to hidden error when analysing the Ehull of materials with over 20 atoms and mixed anions, like missing Cl- in Li20O9Cl2 and missing N3- in Li20Cl17N.

Describe the solution you'd like We can set "max_sites=-1" so that this error no longer exist, while reduced formula will be tried to find so that composition will be as small as possible to keep performance. I believe that being correct is more important than having good speed, and some compromise in performance should be acceptable.

@rkingsbury Hey Ryan, I saw your helpful post about MP2020Compatibility before, so I'm quoting you here. Thanks in advance! 😁

rkingsbury commented 2 years ago

@JiQi535 thanks for posting. Your understanding of what's happening is 100% correct, and I hadn't realized that this wasn't documented (my apologies for that). The oxidation state guessing was a feature that we added somewhat at the last minute and the documentation must have slipped through the cracks. That said, you should be seeing a warning along the lines of "Failed to guess oxidation states..." as per this line. Were you not seeing that in your work?

I remember we did try a default value of max_sites=-1 but that led to unacceptable performance hangs with modestly large structures. My concern with setting to -1 is that a user might find process_entries hanging and not be able to figure out why. Note that any materials you obtain from MPRester() will have oxidation states already populated, and hence the corrections will be applied correctly even for larger structures.

I would suggest a few things:

  1. make max_sites a kwarg in the __init__ , keeping the default value at -20
  2. Add documentation explaining what the kwarg does and that a value of -1 will try regardless of the no. of sites
  3. If the warning is not working as intended and notifying users when oxidation state guessing is being skipped, we need to fix that. There is a unit test here that perhaps needs to be expanded?

Let me know what you think about the above. If this sounds good to you I'd be happy to review a PR with the changes. Also flagging @mkhorton for comments.