materialsproject / pymatgen

Python Materials Genomics (pymatgen) is a robust materials analysis code that defines classes for structures and molecules with support for many electronic structure codes. It powers the Materials Project.
https://pymatgen.org
Other
1.52k stars 864 forks source link

Argument `check_occu` is not working as intended (in `_get_structure` and `parse_structures`) #3816

Closed KunhuanLiu closed 5 months ago

KunhuanLiu commented 6 months ago

Python version

3.11.8

Pymatgen version

2024.2.8

Operating system version

No response

Current behavior

Hi all, I was trying to deal with a crystal structure (in .cif format) that had warnings on having occupancy issues:

"UserWarning: Some occupancies ([1.0, ... (x876)]) sum to > 1! If they are within the occupancy_tolerance, they will be rescaled. The current occupancy_tolerance is set to: 1.0
  warnings.warn(msg)

I am unsure if it is the structure's error given that many other programs (ase, VESTA, etc) were not reporting error, so I tried to turn off the occupancy check and check myself. When I called CifParser function parse_structures with check_occu = False, the same warning and errors were logged. Checking the source code I think there is something implemented that is inconsistent with the algorithm, by a quick glance: Under _get_structures():

 # If check_occu is True or the occupancy is greater than 0, create comp_d
            if not check_occu or occu > 0:
                coord = (x, y, z)
                match = get_matching_coord(coord)
                comp_dict = {el: max(occu, 1e-8)}

I think not check_occu is not consistent with the algorithm commented above. Meanwhile, the warning I'm receiving comes from the following code snippet that is not switched off by "check_occu":

        if any(occu > 1 for occu in sum_occu):
            msg = (
                f"Some occupancies ({sum_occu}) sum to > 1! If they are within "
                "the occupancy_tolerance, they will be rescaled. "
                f"The current occupancy_tolerance is set to: {self._occupancy_tolerance}"
            )
            warnings.warn(msg)
            self.warnings.append(msg)

My temporarily hacky solution to my structure is to raise the occupancy_tolerance to 1000 to get the structure.

Expected Behavior

Sites that are too close to each other are kept without raising warning and errors that prevent getting the structures.

Minimal example

No response

Relevant files to reproduce this bug

No response

DanielYang59 commented 6 months ago

I'm not a maintainer, but if you could provide the cif file and code snippet to reproduce this issue, that would be very helpful :)

KunhuanLiu commented 6 months ago

Here is a minimally reproducible example with the structure (Crystal.cif) attached, highlighting that the argument check_occu cannot turn off the occupancy check. In this example structure, Site1 -- Site3 and Site4--6 are problematic positions in my structure. Atoms C4 and C5 mimic the normal atom sites in my structure.

To reproduce the bug:

from pymatgen.io.cif import CifParser
parser = CifParser("Crystal.cif", site_tolerance = 1e-4, occupancy_tolerance = 1.0)
b = parser.parse_structures(check_occu = False)

Error that I encountered:

Some occupancies ([2.0, 2.0, 2.0, 1.0, 1.0]) sum to > 1! If they are within the occupancy_tolerance, they will be rescaled. The current occupancy_tolerance is set to: 1.0
No structure parsed for section 1 in CIF.
...
in CifParser.parse_structures(self, primitive, symmetrized, check_occu, on_error)
   1219     warnings.warn("Issues encountered while parsing CIF: " + "\n".join(self.warnings))
   1221 if len(structures) == 0:
-> 1222     raise ValueError("Invalid CIF file with no structures!")
   1223 return structures

ValueError: Invalid CIF file with no structures!

Expected behavior: occupancy would not be checked and a structure with all atoms should be given. I think code review could be more helpful to see the issue though. Crystal.txt

DanielYang59 commented 6 months ago

Hi @KunhuanLiu, I just had a look at this issue, and the inconsistency between comment and implementation you mentioned is indeed an issue: https://github.com/materialsproject/pymatgen/blob/2e1c3016742d930cbdc5f4129e4036819e7755d3/pymatgen/io/cif.py#L985-L989

But in our case, that's not the cause of the failure.

Cause of code failure

The issue is the bad default value of occupancy_tolerance for CifParser (and lack of proper error message): https://github.com/materialsproject/pymatgen/blob/2e1c3016742d930cbdc5f4129e4036819e7755d3/pymatgen/io/cif.py#L290

And the condition for scaling occupancy: https://github.com/materialsproject/pymatgen/blob/2e1c3016742d930cbdc5f4129e4036819e7755d3/pymatgen/io/cif.py#L1085-L1086

Which means the default occupancy_tolerance value would prevent invalid occupancy (in our case 2.0) being scaled, the thus the code breaks as Structure does not allow such invalid occupancy: https://github.com/materialsproject/pymatgen/blob/2e1c3016742d930cbdc5f4129e4036819e7755d3/pymatgen/io/cif.py#L1105

I would talk to @janosh on changing the bad default value of occupancy_tolerance soon, and I appreciate your report.

To fix your code

To fix your code, you need to set the occupancy_tolerance to a value greater than the max occupancy (2.0 in your case), for example:

from pymatgen.io.cif import CifParser

parser = CifParser("Crystal_min.cif", occupancy_tolerance=2.0)

structures = parser.parse_structures(check_occu=False)

Hope it helps :)

KunhuanLiu commented 6 months ago

Thanks @DanielYang59 , for investigating the case. Totally agree with your thoughts; I came up with a similar manual fix (I offered a test case where manually setting the occupancy tolerance to 2.0 can address the issue. In my structure, I had to use occupancy tolerance = 1000.) Of course, it will be difficult to manually check every structure (and set tolerance) for high throughput screening...

In addition to attending to the behavior of occupancy tolerance, I think the argument check_occu according to the API reference would be very useful if it can work properly: "site occupancy will not be checked, allowing unphysical occupancy != 1..."

Again thank you for the attention and time!

DanielYang59 commented 6 months ago

In addition to attending to the behavior of occupancy tolerance, I think the argument check_occu according to the API reference would be very useful if it can work properly: "site occupancy will not be checked, allowing unphysical occupancy != 1..."

Maybe the behavior should be changed such that if not check_occu, any occupancy would be allowed with a warning message. What do you think? Feel free to comment on https://github.com/materialsproject/pymatgen/pull/3819.

Again thank you for the attention and time!

No worries at all.

KunhuanLiu commented 5 months ago

Sorry about the delayed response -- I noticed another thing about the current check_occu behavior. I am sorry I cannot offer more help reviewing the algorithm right now.

Maybe the behavior should be changed such that if not check_occu, any occupancy would be allowed with a warning message. What do you think? Feel free to comment on https://github.com/materialsproject/pymatgen/pull/3819.

I think that's what I thought the argument should do based on the API documentation description. When not check_occu, all atom sites, regardless of occup_tolerance, will be returned.

I want to note the following behavior when I use check_occu = False though:

b = parser.parse_structures(check_occu = False)[0]
# from pymatgen.analysis.structure_analyzer import get_neighbors
for site in b:
    neighbors = b.get_neighbors(site, 0.1)  # 0.5 Å as an example threshold for overlap
    if neighbors:
        print(f"Overlap suspected near {site.label} at {site.frac_coords}:\n{neighbors}")

Returned atoms do not have labels:

Overlap suspected near C at [0.1191 0.5596 0.5713]:
[PeriodicNeighbor: C (-5.305, 16.0, 13.02) [0.1191, 0.5595, 0.5713]]
Overlap suspected near C at [0.4404 0.5595 0.5713]:
[PeriodicNeighbor: C (5.306, 16.0, 13.02) [0.4405, 0.5596, 0.5713]]
Overlap suspected near C at [0.4405 0.8809 0.5713]:
[PeriodicNeighbor: C (-0.001651, 25.19, 13.02) [0.4404, 0.8809, 0.5713]]
Overlap suspected near C at [0.4404 0.8809 0.5713]:
[PeriodicNeighbor: C (0.001651, 25.19, 13.02) [0.4405, 0.8809, 0.5713]]
Overlap suspected near C at [0.4405 0.5596 0.5713]:
[PeriodicNeighbor: C (5.305, 16.0, 13.02) [0.4404, 0.5595, 0.5713]]
Overlap suspected near C at [0.1191 0.5595 0.5713]:
[PeriodicNeighbor: C (-5.306, 16.0, 13.02) [0.1191, 0.5596, 0.5713]]
DanielYang59 commented 5 months ago

Hi thanks for following up. I have updated the behavior and docstring in #3819.

Meanwhile I have a question though, as per the docstring, "unphysical occupancy" refers to occupancy != 1, but the implementation just handles cases where occupancy > 1, is there any chance for the occupancy to be smaller than zero and still need to be included (I'm not aware of such cases)?

I would look into the site label soon, thanks for reporting.

DanielYang59 commented 5 months ago

The label issue is fixed by 05c11d4fbe95222eb4582b9986200dab42b36a00. Thanks.

esoteric-ephemera commented 5 months ago

Hi @KunhuanLiu : the CIF file you provided does not appear to be valid, as the three sites at the end, Site6, Site7, and Site8, are duplicates of Site3, Site2, and Site1, respectively. Removing these latter three sites successfully produces a structure with pymatgen

Pymatgen does not currently include functionality to fix bad CIFs. This isn't an issue with the occupancy tolerance function

KunhuanLiu commented 5 months ago

Hello @esoteric-ephemera , this ticket inquiries whether the implementation of the check_occu argument is consistent with the algorithm described in both the source code comment and the API documentation.

Removing these latter three sites successfully produces a structure with pymatgen

Example structure serves as a minimal example to highlight the impact of the issue.

Pymatgen does not currently include functionality to fix bad CIFs.

Thank you for sharing. Can you clarify what you mean? I am not requesting such functionality here. Can you comment on how or why check_occu = False should not return a structure when the perceived occupancy exceeds the user-set occu_tolerance? This can be important as @DanielYang59 has implemented several changes in #3819 and your review can be helpful. Just to make sure I am not proposing something based on misunderstanding of the intended behavior.

I also don't think that "bad cifs" (a term that should be defined more clearly) are the sole reasons for such errors. A structure with large unit cell and dense atoms can have atoms such as H that are "close" to each other under the 1e-4 default site tolerance (I cannot publish the structure here). If check_occu = False is not intended to keep the overlapping or nonphysical atoms, then the documentation, instead of the code, needs revision. Of course, it's then the user's responsibility to catch error or adjust tolerance parameters if they decide to continue their analysis on those structures.

This isn't an issue with the occupancy tolerance function

I never thought it was in this case. Setting a higher occupancy tolerance is an unorthodox trick to let pymatgen return my structure. The question concerns the proper behavior with the check_occu argument, and I will appreciate your comments on this

esoteric-ephemera commented 5 months ago

@KunhuanLiu: the example is not a reasonable CIF because it contains multiply-occupied sites. The site occupancy represents fractionally what elements occupy a given site, e.g., a disordered structure will have at least two sites with occupancies < 1, and an ordered structure will have all site occupancies == 1. A site occupancy > 1 is not physical

A "bad CIF" in this context means a CIF that is either out of spec per the IUCr or is unphysical. The ICUr specifically states that the occupancy represents:

The fraction of the atom type present at this site. The sum of the occupancies of all the atom types at this site may not significantly exceed 1.0 unless it is a dummy site.

Right now, pymatgen does not support this kind of "dummy site" because it's not compatible with other functionality in pymatgen. pymatgen does support disordered structures, where at least two elements occupy the same site with fractional (< 1) occupancy.

check_occ is not intended to ensure non-overlapping sites. It is intended to ensure that the fractional composition on a given site does not exceed 1

DanielYang59 commented 5 months ago

Hi @esoteric-ephemera thanks a lot for your comment.

Combine of sites

the CIF file you provided does not appear to be valid, as the three sites at the end, Site6, Site7, and Site8, are duplicates of Site3, Site2, and Site1, respectively.

Pymatgen does not currently include functionality to fix bad CIFs.

It appears CifParser is able to handle such cifs: https://github.com/materialsproject/pymatgen/blob/bb68c780834fea588f35420e1835ed3b21894a44/pymatgen/io/cif.py#L306-L329

And: https://github.com/materialsproject/pymatgen/blob/bb68c780834fea588f35420e1835ed3b21894a44/pymatgen/io/cif.py#L328-L329

The cif @KunhuanLiu provided:

 Site1  C 0.11910   0.55960   0.57130   1.0
 Site2  C 0.44040   0.55950   0.57130   1.000  << 
 Site3  C 0.44050   0.88090   0.57130   1.000
 C4     C 0.07117   0.21573   0.93523   1.000
 C5 C 0.14457   0.21573   0.93523   1.000
 Site6  C 0.44040   0.88090   0.57130   1.000
 Site7  C 0.44050   0.55960   0.57130   1.000  << would be combined into Site2
 Site8  C 0.11910   0.55950   0.57130   1.000

Site2 and Site7 (and two other pairs similarly) are not duplicate, but within the site_tolerance, and were therefore combined into a single site (with occupancy summed).

Handle of "unphysical" (occupancy > 1) sites

Right now, pymatgen does not support this kind of "dummy site" because it's not compatible with other functionality in pymatgen.

In parse_structures, there is: https://github.com/materialsproject/pymatgen/blob/bb68c780834fea588f35420e1835ed3b21894a44/pymatgen/io/cif.py#L1263-L1266

Which should allow user to parse cif with occupancy > 1, once check_occu is turned off?

KunhuanLiu commented 5 months ago

@esoteric-ephemera Thanks so much for clarifying, and I apologize for my oversight on the fact that in a realistic crystal structure, no atom sites would have summed occupancy > 1. I understand and support the design that raises the error ValueError: Invalid CIF file with no structures! when there are sites with occupancy > occupancy_tolerance and not rescaled to 1.

I am still unsure what the algorithm should do when check_occu is set to FALSE, but according to @esoteric-ephemera it should still not return a bad CIF with site occupancy > 1 that is caused by merging atom sites.

I'm very thankful to the attention and help from @DanielYang59, and I want to make sure we are not changing the code implementation unnecessarily. It looks like I had some misunderstanding caused by warning messages and documentation.

To make sure we are aligned on the intended behavior, here's my (corrected) understanding:

@DanielYang59 thanks for your attention. Based on our discussion I think the original version has a very reasonable design and raises ValueError as intended. It sounds reasonable to me to refer the structures with occupancy > 1 as bad cifs. Changes around tolerance are not necessary. I think there can be some improvement on the documentation.

https://github.com/materialsproject/pymatgen/blob/2e1c3016742d930cbdc5f4129e4036819e7755d3/pymatgen/io/cif.py#L1011-L1018 Here, ({sum_occu}) sum to > 1! is confusing and can lead to a very very long warning message when there are lots of atoms. This happened in my case and made it difficult to correctly understand the warning message. I suggest the following:

unphysical_occu = [occu for occu in sum_occu if occu > 1]
...         f"Some occupancies sum to > 1! If they are within " 
         "the occupancy_tolerance, they will be rescaled. " 
           f"The current occupancy_tolerance is set to: {self._occupancy_tolerance}. Sites with occupancy sum > 1: {unphysical_occu}" 

@esoteric-ephemera Can please you confirm that the behavior described above on check_occu is correct? I also wonder if this is a good place to close the ticket, or to open new issues with sharpened focus. Thanks for all the attention!

esoteric-ephemera commented 5 months ago

Sorry for the delay. @DanielYang59:

Site2 and Site7 (and two other pairs similarly) are not duplicate, but within the site_tolerance, and were therefore combined into a single site (with occupancy summed).

They're duplicates in terms of position and chemical species, their labels differ. The right way to specify a site like this is with 0.5 occupancy on Site2 and 0.5 occupancy on Site7. That specifies a disordered structure with 50-50 occupancy C labelled Site2 and C labelled Site7

Which should allow user to parse cif with occupancy > 1, once check_occu is turned off?

It might let you do that but I'm not sure what it parses will be meaningful / usable in pymatgen. I think pymatgen should parse CIFs that conform (within a tolerance) to the IUCr standards and not extrapolate when a CIF deviates too far from those standards.

@KunhuanLiu:

Can please you confirm that the behavior described above on check_occu is correct? I also wonder if this is a good place to close the ticket, or to open new issues with sharpened focus. Thanks for all the attention!

This is right except occupancy_tolerance is used to allow for occupancies slightly larger than 1 (this often arises from rounding errors in hand-written CIFs, such as those from the ICSD)

check_occu = False removing the site labels is probably necessary because pymatgen just merges sites with >100% occupancy and the same chemical identity

As for closing the issue / @DanielYang59's PR: I would prefer to just amend the ValueError message for clarity - the site tolerance shouldn't be changed unless there's an in-spec CIF that isn't being parsed correctly

DanielYang59 commented 5 months ago

Sorry for the delay. @DanielYang59:

All good. Thanks a lot for your comments :)

Site2 and Site7 (and two other pairs similarly) are not duplicate, but within the site_tolerance, and were therefore combined into a single site (with occupancy summed).

They're duplicates in terms of position and chemical species, their labels differ. The right way to specify a site like this is with 0.5 occupancy on Site2 and 0.5 occupancy on Site7. That specifies a disordered structure with 50-50 occupancy C labelled Site2 and C labelled Site7

Sorry I don't think they're duplicates, as their x/y coordinates differ (slightly by 1e-4):

 Site2  C 0.44040   0.55950   0.57130   1.000  
 Site7  C 0.44050   0.55960   0.57130   1.000  

So if the tolerance is set to a stricter value, they could be two individual sites. But in our case, the default tolerance is such that these two sites are indeed considered at the same position (but I would not consider them duplicates).

https://github.com/materialsproject/pymatgen/blob/14b7357d373c1eae88a3ca014fb14ffba44df75f/pymatgen/io/cif.py#L328-L329

Which should allow user to parse cif with occupancy > 1, once check_occu is turned off?

It might let you do that but I'm not sure what it parses will be meaningful / usable in pymatgen. I think pymatgen should parse CIFs that conform (within a tolerance) to the IUCr standards and not extrapolate when a CIF deviates too far from those standards.

I'm unsure about the purpose of having such behavior either, but as per the docstring: https://github.com/materialsproject/pymatgen/blob/14b7357d373c1eae88a3ca014fb14ffba44df75f/pymatgen/io/cif.py#L1263-L1266

So I might assume there is a reason to allow for such cif files.

If you or anyone else with more knowledge on cif decide to remove this functionality, we might need to remove check_occu because it seems to causing confusing (for example this issue). Because occupancy would be checked no matter how it is set.

As for closing the issue / @DanielYang59's PR: I would prefer to just amend the ValueError message for clarity - the site tolerance shouldn't be changed unless there's an in-spec CIF that isn't being parsed correctly

Yes I have already updated the code with a more descriptive error message.

I didn't change the site tolerance but updated the behavior of check_occu to allow for any occupancy once turned off (to be consistent with the docstring). But certainly we need further confirmation on whether we want such behavior or not.