Closed KunhuanLiu closed 5 months ago
I'm not a maintainer, but if you could provide the cif file and code snippet to reproduce this issue, that would be very helpful :)
Here is a minimally reproducible example with the structure (Crystal.cif) attached, highlighting that the argument check_occu
cannot turn off the occupancy check. In this example structure, Site1 -- Site3 and Site4--6 are problematic positions in my structure. Atoms C4 and C5 mimic the normal atom sites in my structure.
To reproduce the bug:
from pymatgen.io.cif import CifParser
parser = CifParser("Crystal.cif", site_tolerance = 1e-4, occupancy_tolerance = 1.0)
b = parser.parse_structures(check_occu = False)
Error that I encountered:
Some occupancies ([2.0, 2.0, 2.0, 1.0, 1.0]) sum to > 1! If they are within the occupancy_tolerance, they will be rescaled. The current occupancy_tolerance is set to: 1.0
No structure parsed for section 1 in CIF.
...
in CifParser.parse_structures(self, primitive, symmetrized, check_occu, on_error)
1219 warnings.warn("Issues encountered while parsing CIF: " + "\n".join(self.warnings))
1221 if len(structures) == 0:
-> 1222 raise ValueError("Invalid CIF file with no structures!")
1223 return structures
ValueError: Invalid CIF file with no structures!
Expected behavior: occupancy would not be checked and a structure with all atoms should be given. I think code review could be more helpful to see the issue though. Crystal.txt
Hi @KunhuanLiu, I just had a look at this issue, and the inconsistency between comment and implementation you mentioned is indeed an issue: https://github.com/materialsproject/pymatgen/blob/2e1c3016742d930cbdc5f4129e4036819e7755d3/pymatgen/io/cif.py#L985-L989
But in our case, that's not the cause of the failure.
The issue is the bad default value of occupancy_tolerance
for CifParser
(and lack of proper error message): https://github.com/materialsproject/pymatgen/blob/2e1c3016742d930cbdc5f4129e4036819e7755d3/pymatgen/io/cif.py#L290
And the condition for scaling occupancy: https://github.com/materialsproject/pymatgen/blob/2e1c3016742d930cbdc5f4129e4036819e7755d3/pymatgen/io/cif.py#L1085-L1086
Which means the default occupancy_tolerance
value would prevent invalid occupancy (in our case 2.0
) being scaled, the thus the code breaks as Structure
does not allow such invalid occupancy: https://github.com/materialsproject/pymatgen/blob/2e1c3016742d930cbdc5f4129e4036819e7755d3/pymatgen/io/cif.py#L1105
I would talk to @janosh on changing the bad default value of occupancy_tolerance
soon, and I appreciate your report.
To fix your code, you need to set the occupancy_tolerance
to a value greater than the max occupancy (2.0
in your case), for example:
from pymatgen.io.cif import CifParser
parser = CifParser("Crystal_min.cif", occupancy_tolerance=2.0)
structures = parser.parse_structures(check_occu=False)
Hope it helps :)
Thanks @DanielYang59 , for investigating the case. Totally agree with your thoughts; I came up with a similar manual fix (I offered a test case where manually setting the occupancy tolerance
to 2.0 can address the issue. In my structure, I had to use occupancy tolerance = 1000
.) Of course, it will be difficult to manually check every structure (and set tolerance) for high throughput screening...
In addition to attending to the behavior of occupancy tolerance,
I think the argument check_occu
according to the API reference would be very useful if it can work properly: "site occupancy will not be checked, allowing unphysical occupancy != 1..."
Again thank you for the attention and time!
In addition to attending to the behavior of
occupancy tolerance,
I think the argumentcheck_occu
according to the API reference would be very useful if it can work properly: "site occupancy will not be checked, allowing unphysical occupancy != 1..."
Maybe the behavior should be changed such that if not check_occu
, any occupancy would be allowed with a warning message. What do you think? Feel free to comment on https://github.com/materialsproject/pymatgen/pull/3819.
Again thank you for the attention and time!
No worries at all.
Sorry about the delayed response -- I noticed another thing about the current check_occu
behavior. I am sorry I cannot offer more help reviewing the algorithm right now.
Maybe the behavior should be changed such that if not check_occu, any occupancy would be allowed with a warning message. What do you think? Feel free to comment on https://github.com/materialsproject/pymatgen/pull/3819.
I think that's what I thought the argument should do based on the API documentation description. When not check_occu
, all atom sites, regardless of occup_tolerance
, will be returned.
I want to note the following behavior when I use check_occu = False
though:
b = parser.parse_structures(check_occu = False)[0]
# from pymatgen.analysis.structure_analyzer import get_neighbors
for site in b:
neighbors = b.get_neighbors(site, 0.1) # 0.5 Å as an example threshold for overlap
if neighbors:
print(f"Overlap suspected near {site.label} at {site.frac_coords}:\n{neighbors}")
Returned atoms do not have labels:
Overlap suspected near C at [0.1191 0.5596 0.5713]:
[PeriodicNeighbor: C (-5.305, 16.0, 13.02) [0.1191, 0.5595, 0.5713]]
Overlap suspected near C at [0.4404 0.5595 0.5713]:
[PeriodicNeighbor: C (5.306, 16.0, 13.02) [0.4405, 0.5596, 0.5713]]
Overlap suspected near C at [0.4405 0.8809 0.5713]:
[PeriodicNeighbor: C (-0.001651, 25.19, 13.02) [0.4404, 0.8809, 0.5713]]
Overlap suspected near C at [0.4404 0.8809 0.5713]:
[PeriodicNeighbor: C (0.001651, 25.19, 13.02) [0.4405, 0.8809, 0.5713]]
Overlap suspected near C at [0.4405 0.5596 0.5713]:
[PeriodicNeighbor: C (5.305, 16.0, 13.02) [0.4404, 0.5595, 0.5713]]
Overlap suspected near C at [0.1191 0.5595 0.5713]:
[PeriodicNeighbor: C (-5.306, 16.0, 13.02) [0.1191, 0.5596, 0.5713]]
Hi thanks for following up. I have updated the behavior and docstring in #3819.
Meanwhile I have a question though, as per the docstring, "unphysical occupancy" refers to occupancy != 1
, but the implementation just handles cases where occupancy > 1
, is there any chance for the occupancy to be smaller than zero and still need to be included (I'm not aware of such cases)?
I would look into the site label soon, thanks for reporting.
The label issue is fixed by 05c11d4fbe95222eb4582b9986200dab42b36a00. Thanks.
Hi @KunhuanLiu : the CIF file you provided does not appear to be valid, as the three sites at the end, Site6
, Site7
, and Site8
, are duplicates of Site3
, Site2
, and Site1
, respectively. Removing these latter three sites successfully produces a structure with pymatgen
Pymatgen does not currently include functionality to fix bad CIFs. This isn't an issue with the occupancy tolerance function
Hello @esoteric-ephemera , this ticket inquiries whether the implementation of the check_occu
argument is consistent with the algorithm described in both the source code comment and the API documentation.
Removing these latter three sites successfully produces a structure with pymatgen
Example structure serves as a minimal example to highlight the impact of the issue.
Pymatgen does not currently include functionality to fix bad CIFs.
Thank you for sharing. Can you clarify what you mean? I am not requesting such functionality here. Can you comment on how or why check_occu = False
should not return a structure when the perceived occupancy exceeds the user-set occu_tolerance
? This can be important as @DanielYang59 has implemented several changes in #3819 and your review can be helpful. Just to make sure I am not proposing something based on misunderstanding of the intended behavior.
I also don't think that "bad cifs" (a term that should be defined more clearly) are the sole reasons for such errors. A structure with large unit cell and dense atoms can have atoms such as H that are "close" to each other under the 1e-4 default site tolerance (I cannot publish the structure here). If check_occu = False
is not intended to keep the overlapping or nonphysical atoms, then the documentation, instead of the code, needs revision. Of course, it's then the user's responsibility to catch error or adjust tolerance parameters if they decide to continue their analysis on those structures.
This isn't an issue with the occupancy tolerance function
I never thought it was in this case. Setting a higher occupancy tolerance is an unorthodox trick to let pymatgen return my structure. The question concerns the proper behavior with the check_occu
argument, and I will appreciate your comments on this
@KunhuanLiu: the example is not a reasonable CIF because it contains multiply-occupied sites. The site occupancy represents fractionally what elements occupy a given site, e.g., a disordered structure will have at least two sites with occupancies < 1, and an ordered structure will have all site occupancies == 1. A site occupancy > 1 is not physical
A "bad CIF" in this context means a CIF that is either out of spec per the IUCr or is unphysical. The ICUr specifically states that the occupancy represents:
The fraction of the atom type present at this site. The sum of the occupancies of all the atom types at this site may not significantly exceed 1.0 unless it is a dummy site.
Right now, pymatgen does not support this kind of "dummy site" because it's not compatible with other functionality in pymatgen. pymatgen does support disordered structures, where at least two elements occupy the same site with fractional (< 1) occupancy.
check_occ
is not intended to ensure non-overlapping sites. It is intended to ensure that the fractional composition on a given site does not exceed 1
Hi @esoteric-ephemera thanks a lot for your comment.
the CIF file you provided does not appear to be valid, as the three sites at the end, Site6, Site7, and Site8, are duplicates of Site3, Site2, and Site1, respectively.
Pymatgen does not currently include functionality to fix bad CIFs.
It appears CifParser
is able to handle such cifs:
https://github.com/materialsproject/pymatgen/blob/bb68c780834fea588f35420e1835ed3b21894a44/pymatgen/io/cif.py#L306-L329
The cif @KunhuanLiu provided:
Site1 C 0.11910 0.55960 0.57130 1.0
Site2 C 0.44040 0.55950 0.57130 1.000 <<
Site3 C 0.44050 0.88090 0.57130 1.000
C4 C 0.07117 0.21573 0.93523 1.000
C5 C 0.14457 0.21573 0.93523 1.000
Site6 C 0.44040 0.88090 0.57130 1.000
Site7 C 0.44050 0.55960 0.57130 1.000 << would be combined into Site2
Site8 C 0.11910 0.55950 0.57130 1.000
Site2
and Site7
(and two other pairs similarly) are not duplicate, but within the site_tolerance
, and were therefore combined into a single site (with occupancy summed).
Right now, pymatgen does not support this kind of "dummy site" because it's not compatible with other functionality in pymatgen.
In parse_structures
, there is:
https://github.com/materialsproject/pymatgen/blob/bb68c780834fea588f35420e1835ed3b21894a44/pymatgen/io/cif.py#L1263-L1266
Which should allow user to parse cif with occupancy > 1, once check_occu
is turned off?
@esoteric-ephemera Thanks so much for clarifying, and I apologize for my oversight on the fact that in a realistic crystal structure, no atom sites would have summed occupancy > 1. I understand and support the design that raises the error ValueError: Invalid CIF file with no structures!
when there are sites with occupancy > occupancy_tolerance and not rescaled to 1.
I am still unsure what the algorithm should do when check_occu
is set to FALSE
, but according to @esoteric-ephemera it should still not return a bad CIF with site occupancy > 1 that is caused by merging atom sites.
I'm very thankful to the attention and help from @DanielYang59, and I want to make sure we are not changing the code implementation unnecessarily. It looks like I had some misunderstanding caused by warning messages and documentation.
To make sure we are aligned on the intended behavior, here's my (corrected) understanding:
site_tolerance
: decides whether sites are merged and the resulted site has occupancy as sum of all the site occupancyoccupancy_tolerance
: used to scale down sites with occupancy <= occupancy_tolerancecheck_occu
: setting to True
(default value), check if any site has occupancy > 1; if yes, raise ValueError
. Setting to False
, ignores _atom_site_occupancy
. However, it is intended to still raise ValueError
when any site has occupancy >1. (Needs confirmation)check_occu = False
removes atom site labels.@DanielYang59 thanks for your attention. Based on our discussion I think the original version has a very reasonable design and raises ValueError as intended. It sounds reasonable to me to refer the structures with occupancy > 1 as bad cifs
. Changes around tolerance
are not necessary. I think there can be some improvement on the documentation.
https://github.com/materialsproject/pymatgen/blob/2e1c3016742d930cbdc5f4129e4036819e7755d3/pymatgen/io/cif.py#L1011-L1018
Here, ({sum_occu}) sum to > 1!
is confusing and can lead to a very very long warning message when there are lots of atoms. This happened in my case and made it difficult to correctly understand the warning message.
I suggest the following:
unphysical_occu = [occu for occu in sum_occu if occu > 1]
... f"Some occupancies sum to > 1! If they are within "
"the occupancy_tolerance, they will be rescaled. "
f"The current occupancy_tolerance is set to: {self._occupancy_tolerance}. Sites with occupancy sum > 1: {unphysical_occu}"
@esoteric-ephemera Can please you confirm that the behavior described above on check_occu
is correct? I also wonder if this is a good place to close the ticket, or to open new issues with sharpened focus. Thanks for all the attention!
Sorry for the delay. @DanielYang59:
Site2
andSite7
(and two other pairs similarly) are not duplicate, but within the site_tolerance, and were therefore combined into a single site (with occupancy summed).
They're duplicates in terms of position and chemical species, their labels differ. The right way to specify a site like this is with 0.5 occupancy on Site2
and 0.5 occupancy on Site7
. That specifies a disordered structure with 50-50 occupancy C labelled Site2
and C labelled Site7
Which should allow user to parse cif with occupancy > 1, once check_occu is turned off?
It might let you do that but I'm not sure what it parses will be meaningful / usable in pymatgen. I think pymatgen should parse CIFs that conform (within a tolerance) to the IUCr standards and not extrapolate when a CIF deviates too far from those standards.
@KunhuanLiu:
Can please you confirm that the behavior described above on check_occu is correct? I also wonder if this is a good place to close the ticket, or to open new issues with sharpened focus. Thanks for all the attention!
This is right except occupancy_tolerance
is used to allow for occupancies slightly larger than 1 (this often arises from rounding errors in hand-written CIFs, such as those from the ICSD)
check_occu = False
removing the site labels is probably necessary because pymatgen just merges sites with >100% occupancy and the same chemical identity
As for closing the issue / @DanielYang59's PR: I would prefer to just amend the ValueError message for clarity - the site tolerance shouldn't be changed unless there's an in-spec CIF that isn't being parsed correctly
Sorry for the delay. @DanielYang59:
All good. Thanks a lot for your comments :)
Site2
andSite7
(and two other pairs similarly) are not duplicate, but within the site_tolerance, and were therefore combined into a single site (with occupancy summed).They're duplicates in terms of position and chemical species, their labels differ. The right way to specify a site like this is with 0.5 occupancy on
Site2
and 0.5 occupancy onSite7
. That specifies a disordered structure with 50-50 occupancy C labelledSite2
and C labelledSite7
Sorry I don't think they're duplicates, as their x/y coordinates differ (slightly by 1e-4):
Site2 C 0.44040 0.55950 0.57130 1.000
Site7 C 0.44050 0.55960 0.57130 1.000
So if the tolerance is set to a stricter value, they could be two individual sites. But in our case, the default tolerance is such that these two sites are indeed considered at the same position (but I would not consider them duplicates).
Which should allow user to parse cif with occupancy > 1, once check_occu is turned off?
It might let you do that but I'm not sure what it parses will be meaningful / usable in pymatgen. I think pymatgen should parse CIFs that conform (within a tolerance) to the IUCr standards and not extrapolate when a CIF deviates too far from those standards.
I'm unsure about the purpose of having such behavior either, but as per the docstring: https://github.com/materialsproject/pymatgen/blob/14b7357d373c1eae88a3ca014fb14ffba44df75f/pymatgen/io/cif.py#L1263-L1266
So I might assume there is a reason to allow for such cif files.
If you or anyone else with more knowledge on cif decide to remove this functionality, we might need to remove check_occu
because it seems to causing confusing (for example this issue). Because occupancy would be checked no matter how it is set.
As for closing the issue / @DanielYang59's PR: I would prefer to just amend the ValueError message for clarity - the site tolerance shouldn't be changed unless there's an in-spec CIF that isn't being parsed correctly
Yes I have already updated the code with a more descriptive error message.
I didn't change the site tolerance but updated the behavior of check_occu
to allow for any occupancy once turned off (to be consistent with the docstring). But certainly we need further confirmation on whether we want such behavior or not.
Python version
3.11.8
Pymatgen version
2024.2.8
Operating system version
No response
Current behavior
Hi all, I was trying to deal with a crystal structure (in .cif format) that had warnings on having occupancy issues:
I am unsure if it is the structure's error given that many other programs (ase, VESTA, etc) were not reporting error, so I tried to turn off the occupancy check and check myself. When I called CifParser function
parse_structures
withcheck_occu
= False, the same warning and errors were logged. Checking the source code I think there is something implemented that is inconsistent with the algorithm, by a quick glance: Under_get_structures()
:I think
not check_occu
is not consistent with the algorithm commented above. Meanwhile, the warning I'm receiving comes from the following code snippet that is not switched off by "check_occu":My temporarily hacky solution to my structure is to raise the occupancy_tolerance to 1000 to get the structure.
Expected Behavior
Sites that are too close to each other are kept without raising warning and errors that prevent getting the structures.
Minimal example
No response
Relevant files to reproduce this bug
No response