Python Materials Genomics (pymatgen) is a robust materials analysis code that defines classes for structures and molecules with support for many electronic structure codes. It powers the Materials Project.
CIF files are currently parsed for composition data using the label field
CIF files should be parsed using the composition field with the information stored as occupancy data
Example code
A CIF file can specify site occupancy. For example, the structure of the cubic phase of Ge2Sb2Te5 is cubic with one Wyckoff site occupied by Te while the other site is occupied by 40% Sb, 40% Ge, and 20% vacancies. The appropriate bit of the (correct) CIF file for this structure reads
loop_
_atom_site_label
_atom_site_type_symbol
_atom_site_occupancy
_atom_site_fract_x
_atom_site_fract_y
_atom_site_fract_z
Ge Ge 0.4000 0.5000 0.5000 0.5000
Ge Sb 0.4000 0.5000 0.5000 0.5000
Te Te 1.0000 0.0000 0.0000 0.0000
The first column is a label, while the second represents the occupancy. Below are the results of reading in the CIF file of which a clip is shown above.
In [50]: gst = mg.Structure.from_file('Ge2Sb2Te5_Fm3m.cif')
PeriodicSite: Te (0.0000, 0.0000, 0.0000) [0.0000, 0.0000, 0.0000]
PeriodicSite: Te (0.0000, 3.0155, 3.0155) [0.0000, 0.5000, 0.5000]
PeriodicSite: Te (3.0155, 3.0155, 0.0000) [0.5000, 0.5000, 0.0000]
PeriodicSite: Te (3.0155, 0.0000, 3.0155) [0.5000, 0.0000, 0.5000]
Note that there is no evidence of Sb in the structure at all! It would appear that the Structure.from_file method simply parses the labels as if they were composition and this is clearly incorrect. I came across this when I received a (different) CIF file exported from Materials Studio (CASTEP among other codes) and found the user had changed the composition, but not the labels. Materials Studio had no trouble with this as it correctly used the composition for calculations based upon the structure, but the CIF file maintained the older labels and included the correct compositions, which strictly speaking is still correct.
Correct behavior:
The Structure.from_file method should use the occupancy field to determine the atom type of the site, not the label field. I have included the CIF file of which I showed a clip above for reference.
Suggested solution (if any)
The Structure.from_file() method should parse the composition fields and not the label fields to determine the contents of a given site in a structure.
<When reporting bugs/issues, please supply the following information. If this is a feature request, please simply state the requested feature.>
System
Summary
Example code
A CIF file can specify site occupancy. For example, the structure of the cubic phase of Ge2Sb2Te5 is cubic with one Wyckoff site occupied by Te while the other site is occupied by 40% Sb, 40% Ge, and 20% vacancies. The appropriate bit of the (correct) CIF file for this structure reads
loop_
_atom_site_label
_atom_site_type_symbol
_atom_site_occupancy
_atom_site_fract_x
_atom_site_fract_y
_atom_site_fract_z
The first column is a label, while the second represents the occupancy. Below are the results of reading in the CIF file of which a clip is shown above.
In [50]: gst = mg.Structure.from_file('Ge2Sb2Te5_Fm3m.cif')
In [51]: gst
Out[51]:
Structure Summary
Lattice
angles : 90.0 90.0 90.0
volume : 219.36532779099997
PeriodicSite: Ge:0.800 (3.0155, 3.0155, 3.0155) [0.5000, 0.5000, 0.5000]
PeriodicSite: Ge:0.800 (3.0155, 0.0000, 0.0000) [0.5000, 0.0000, 0.0000]
PeriodicSite: Ge:0.800 (0.0000, 0.0000, 3.0155) [0.0000, 0.0000, 0.5000]
PeriodicSite: Ge:0.800 (0.0000, 3.0155, 0.0000) [0.0000, 0.5000, 0.0000]
PeriodicSite: Te (0.0000, 0.0000, 0.0000) [0.0000, 0.0000, 0.0000]
PeriodicSite: Te (0.0000, 3.0155, 3.0155) [0.0000, 0.5000, 0.5000]
PeriodicSite: Te (3.0155, 3.0155, 0.0000) [0.5000, 0.5000, 0.0000]
PeriodicSite: Te (3.0155, 0.0000, 3.0155) [0.5000, 0.0000, 0.5000]
Note that there is no evidence of Sb in the structure at all! It would appear that the Structure.from_file method simply parses the labels as if they were composition and this is clearly incorrect. I came across this when I received a (different) CIF file exported from Materials Studio (CASTEP among other codes) and found the user had changed the composition, but not the labels. Materials Studio had no trouble with this as it correctly used the composition for calculations based upon the structure, but the CIF file maintained the older labels and included the correct compositions, which strictly speaking is still correct.
Correct behavior:
The Structure.from_file method should use the occupancy field to determine the atom type of the site, not the label field. I have included the CIF file of which I showed a clip above for reference.
Suggested solution (if any)
The Structure.from_file() method should parse the composition fields and not the label fields to determine the contents of a given site in a structure.
Files (if any)
Ge2Sb2Te5_(Fm3m).cif