materialsproject / pymatgen

Python Materials Genomics (pymatgen) is a robust materials analysis code that defines classes for structures and molecules with support for many electronic structure codes. It powers the Materials Project.
https://pymatgen.org
Other
1.48k stars 855 forks source link

Chemical system depends on assigned charges #2248

Closed CompRhys closed 2 years ago

CompRhys commented 3 years ago
>>> from pymatgen.core.composition import Composition
>>> Composition({"Fe3+": 2, "O2-":3}).chemical_system
'Fe3+-O2-'
>>> Composition({"Fe": 2, "O":3}).chemical_system
'Fe-O'
>>> Composition({"Fe3+": 2, "Fe2+": 1, "O2-":4}).chemical_system
'Fe2+-Fe3+-O2-'

Not sure if this is a bug or not, it not the behaviour i would have expected so if planned I will look at doing a docstring improvement.

mkhorton commented 3 years ago

Definitely not desired behaviour!

That was my code so mea culpa. Was there a reason you closed this issue?

CompRhys commented 3 years ago

It was related to much hash concerns about Site and there the takeaway for me was that because two compositions are not equal if they have assigned charges then it doesn't matter that the hashes aren't equal #2235. The chemical system is related to this because a few months back I needed a hash with less collisions for my PatchedPhaseDiagram and so the hash for Composition was changed to hash(frozenset(Composition._data.keys())) as it was barely slower than the sum of species charges but a lot more discriminating. In the end because so few sites are disordered and the discriminative hash is ~10% slower I decided against the PR for Site.

I was going to submit a PR later to add this example to the docstring:

>>> Composition({"Si4+": 1, "O2-":2}).chemical_system
'O2--Si4+'
>>> Composition({"Si": 1, "O":2}).chemical_system
'O-Si'
mkhorton commented 3 years ago

A docstring improvement would be welcome (given that this is the current behaviour), but I do think we should change this to just return the elements. I'm less concerned with the hashing and more with expected/correct behaviour.

CompRhys commented 3 years ago
    @property
    def chemical_system(self) -> str:
        """
        Get the chemical system of a Composition, for example "O-Si" for
        SiO2. Chemical system is a string of a list of elements
        sorted alphabetically and joined by dashes, by convention for use
        in database keys.
        """
        return "-".join(sorted([el.symbol for el in self.elements]))

gives

Python 3.7.10 (default, Jun  4 2021, 14:48:32) 
[GCC 7.5.0] :: Anaconda, Inc. on linux
Type "help", "copyright", "credits" or "license" for more information.
>>> from pymatgen.core.composition import Composition
>>> Composition({"Si": 1, "O":2}).chemical_system
'O-Si'
>>> Composition({"Si4+": 1, "O2-":2}).chemical_system
'O-Si'

haven't tested speed but Species.__str__ accesses symbol so it shouldn't be slower.

If happy with this I can make PR and add a test for charged composition chemical systems?

shyuep commented 3 years ago

I think we should keep this definition of chemsys. We either treat Species as distinct or not by default. Pymatgen treats them as different. And there are instances where we need to define Fe2+-Fe3+-O as opposed to just Fe-O system. If we want to ignore oxidation states, we can make it a different property or method.

CompRhys commented 3 years ago

would it not make more sense to update chemical_system to match it's current docstring and then add another option for the species system @mkhorton has already said this definition was a bug?

also in the species system just as a matter to ease of manipulation using regex the use of dash as a deliminator seems like a poor choice as we have things like "O2--Cu2+".split(deliminator) = ["O2", "", "Cu2+"] which could cause hidden errors if like me you think the current behaviour is unexpected.