materialsproject / pymatgen

Python Materials Genomics (pymatgen) is a robust materials analysis code that defines classes for structures and molecules with support for many electronic structure codes. It powers the Materials Project.
https://pymatgen.org
Other
1.51k stars 863 forks source link

Bug with structure read from cif file #3373

Open epatyukova opened 1 year ago

epatyukova commented 1 year ago

Structure with collection code 133424 from ICSD (Pr Sm Pb Si2 S8) is read incorrectly by Structure.from_file(). In particular there should be 18 Pb2+ sites, but there are 30 after the read. There is clearly a bug.

ICSD_CollCode133424.cif.

janosh commented 1 year ago

@epatyukova You're right, there is a problem here. Fwiw, the problem is more widespread than pymatgen.

@esoteric-ephemera Did some investigation and it turns out Vesta and ase also parse this CIF file incorrectly. As @computron has said in the past, CIF is a troublesome format.

ase completely drops the Sm(Samarium) from the CIF file. Vesta also has the wrong stoichiometry though doesn't drop any elements outright.

# %%
from ase.io import read as ase_read
from pymatgen.io.ase import AseAtomsAdaptor
from pymatgen.io.cif import CifParser

# %%
cif_filename = "./ICSD_CollCode133424.cif"
cif_ase = ase_read(cif_filename)
NPb_sites_ase = len([site for site in cif_ase.get_chemical_symbols() if site == "Pb"])

# %%
pmg_from_ase = AseAtomsAdaptor.get_structure(cif_ase)
pmg_prim_from_ase = pmg_from_ase.get_primitive_structure()
print(
    f"ASE atoms --> PMG: composition = {dict(pmg_from_ase.composition.as_dict())}, {pmg_from_ase.get_space_group_info()}"
)
print(
    f" --> primitive cell: composition = {dict(pmg_prim_from_ase.composition.as_dict())}, {pmg_prim_from_ase.get_space_group_info()}"
)

# %%
for i, site_tol in enumerate([1.0e-2, 1.0e-3, 1.0e-4, 1.0e-5]):
    cif_pmg = CifParser(cif_filename, site_tolerance=site_tol).get_structures()[0]
    print(f"{site_tol=}")
    rounded_comp = {
        element: round(cif_pmg.composition.as_dict()[element], 3) for element in cif_pmg.composition.as_dict()
    }
    print(f"  Composition={rounded_comp}")
    try:
        print(f"  Space group = {cif_pmg.get_space_group_info()}")
    except Exception:
        pass
    print(f"  ASE struc primitive equiv? {pmg_prim_from_ase==cif_pmg}")
    print(f"  Total number of sites = {cif_pmg.num_sites} (PMG), {cif_ase.get_global_number_of_atoms()} (ASE)")
    pmg_str_sites = [str(site.species.remove_charges()) for site in cif_pmg.sites]
    NPb_sites_pmg = len([site for site in pmg_str_sites if site == "Pb0.3333"])
    print(f"  Number of Pb sites = {NPb_sites_pmg} (PMG), {NPb_sites_ase} (ASE)")

Not sure there's a straightforward solution here. Pinging @Andrew-S-Rosen @mkhorton @shyuep in case anyone else can offer more insights.

Andrew-S-Rosen commented 1 year ago

Any ideas off-hand why the Sm is an issue here? Is it a symmetry issue? Nothing jumps out at me.

Andrew-S-Rosen commented 1 year ago

What is the expected behavior here? The Sm, Pr, and Pb all have fractional occupancies. ASE does not support fractional occupancies to the best of my knowledge. And VESTA seems okay unless I'm missing something?

image

janosh commented 1 year ago

@epatyukova You could try passing primitive=False to get_structures. Looks like that gives better stoichiometry.

janosh commented 1 year ago

@Andrew-S-Rosen Can you export as POSCAR and check that the composition still matches the CIF?

Andrew-S-Rosen commented 1 year ago

@janosh: In short, I don't think any of the codes are misbehaving here.

It doesn't make much sense to export as POSCAR seeing as there are fractional occupancies, but ignoring that fact, this is what I get when making the POSCAR from VESTA:

Praseodymium samarium lead disilicon octasulfide
1.0
        8.9306001663         0.0000000000         0.0000000000
       -4.4653000832         7.7341266151         0.0000000000
        0.0000000000         0.0000000000        26.4927997589
   Sm   Pr   Pb   Si    S
   18   18   30   12   48

These all agree with the CIF's _atom_site_symmetry_multiplicity values except for Pb, which is 30 here as opposed to 18.

But if you look at the CIF, the positions for the Pb are slightly different than Sm and Pr even though they are describing the same atomic sites and should be symmetrically equivalent, presumably.

Changing the fractional coordinates of the Pb line to match that of Sm and Pr and then remaking the POSCAR yields the expected 18 18 18 12 48.

To me, this seems like a case of a janky CIF.

epatyukova commented 1 year ago

Dear all,

Thank you very much for looking into the issue. I'm using structure type to process large amounts of data. Chemically it is possible that Pb has other fractional coordinates, shifted with respect to coordinates of Sm and Pr. I will just then do an additional check that multiplicity produced by Structure is the same as in the cif file.

Best wishes, Elena.

сб, 7 окт. 2023 г. в 01:04, Andrew S. Rosen @.***>:

@janosh https://github.com/janosh: In short, I don't think any of the codes are misbehaving here.

It doesn't make much sense to export as POSCAR seeing as there are fractional occupancies, but ignoring that fact, this is what I get when making the POSCAR from VESTA:

Praseodymium samarium lead disilicon octasulfide 1.0 8.9306001663 0.0000000000 0.0000000000 -4.4653000832 7.7341266151 0.0000000000 0.0000000000 0.0000000000 26.4927997589 Sm Pr Pb Si S 18 18 30 12 48

These all agree with the CIF's _atom_site_symmetry_multiplicity values except for Pb, which is 30 here as opposed to 18.

But if you look at the CIF, the positions for the Pb are slightly different than Sm and Pr even though they are describing the same atomic sites and should be symmetrically equivalent, presumably.

Changing the fractional coordinates of the Pb line to match that of Sm and Pr and then remaking the POSCAR yields the expected 18 18 18 12 48.

— Reply to this email directly, view it on GitHub https://github.com/materialsproject/pymatgen/issues/3373#issuecomment-1751514716, or unsubscribe https://github.com/notifications/unsubscribe-auth/AFNA6BNPTMKGIDTI3H65Q73X6CMAVAVCNFSM6AAAAAA5STCF2GVHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMYTONJRGUYTINZRGY . You are receiving this because you were mentioned.Message ID: @.***>

Andrew-S-Rosen commented 11 months ago

@janosh: I believe this should be closed. Pymatgen can't be expected to read garbage, ambiguous CIFs.

epatyukova commented 11 months ago

It is still an interesting case of error. We also discovered that this file is visualised by my VESTA (M1 processor, MAC) and VESTA of my colleague (Intel, linux) differently (both of us have the latest VESTA version)... See pictures showing Pb sites with all other atoms removed.

Have no idea what the reason is.

чт, 2 нояб. 2023 г. в 07:01, Andrew S. Rosen @.***>:

@janosh https://github.com/janosh: I believe this should be closed. Pymatgen can't be expected to read garbage, ambiguous CIFs.

— Reply to this email directly, view it on GitHub https://github.com/materialsproject/pymatgen/issues/3373#issuecomment-1790118450, or unsubscribe https://github.com/notifications/unsubscribe-auth/AFNA6BO4ACBVZHSN7X6LU7TYCMZMBAVCNFSM6AAAAAA5STCF2GVHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMYTOOJQGEYTQNBVGA . You are receiving this because you were mentioned.Message ID: @.***>