ppdebreuck / modnet

MODNet: a framework for machine learning materials properties
MIT License
76 stars 32 forks source link

CompositionOnly featurizer preset does not use oxidation states #46

Open ml-evs opened 3 years ago

ml-evs commented 3 years ago

As above. We should probably add the oxidation state featurizers used in the DeBreuck2020Featurizer to the CompositionOnly featurizer too.

ancarnevali commented 2 years ago

Hi! Was this implemented or is it still pending?

ppdebreuck commented 2 years ago

Hi! Thanks for pulling this up. This hasn't been implemented but can easily be added manually. Small example:

from modnet.featurizers.presets import DeBreuck2020Featurizer
from modnet.preprocessing import MODData
from pymatgen.core import Composition

materials = [Composition("SiO2"),Composition("Li20")]
class CompositionOxidFeaturizer(DeBreuck2020Featurizer):
    def __init__(self):
        super().__init__()
        self.structure_featurizers = ()
        self.site_featurizers = ()

data = MODData(materials=materials, featurizer=CompositionOxidFeaturizer())
data.featurize()
print(data.df_featurized)

This will add 9 extra features compared to the current default one. Let me know if this helps. I might try this with some of the matbench tasks and implement this (maybe default).

ml-evs commented 2 years ago

I will just add one extra bit to @ppdebreuck's suggestion, which is to add the fast_oxid attribute to the featurizer, otherwise pymatgen (the underlying library matminer uses for oxidation states) will try to allow for every site in a structure to have a different oxidation state which scales horribly with the number of sites. Doing something like:

class CompositionOxidFeaturizer(DeBreuck2020Featurizer):
    def __init__(self):
        super().__init__()
        self.structure_featurizers = ()
        self.site_featurizers = ()
        self.fast_oxid = True

will use more sensible defaults for oxidation state calculations that you can then test: https://github.com/ppdebreuck/modnet/blob/06ca45f104a697e65d02d5405d26aec78f712972/modnet/featurizers/featurizers.py#L161-L170)

(also see the relevant code in matminer/pymatgen)

ancarnevali commented 2 years ago

Thank you for your inputs! so if I understand correctly, with this class we just add features, and the "base" featurizer is launched with data.featurize() later on. By the way, since this would be a composition-only task, why are there self.structure_featurizers = () and self.site_featurizers = ()? Is it just needed to define them as empty?

ml-evs commented 2 years ago

Thank you for your inputs! so if I understand correctly, with this class we just add features, and the "base" featurizer is launched with data.featurize() later on.

Yep, the base featurizer just knows to take each listed matminer featurizer in turn and apply it to the relevant sites, structure or composition.

By the way, since this would be a composition-only task, why are there self.structure_featurizers = () and self.site_featurizers = ()? Is it just needed to define them as empty?

Yeah, this is just convenience/laziness. You could instead inherit from the base featurizer and list out all of the matminer featurizer classes you wanted to use, but as we have them all already in the DeBreuck2020 preset it is easier base on that and disable the ones we don't want.

ml-evs commented 1 year ago

At some point the IonProperty featurizer in matminer started using oxidation states (maybe it always did). We never ran into an issue with this as the poor scaling was probably being masked by the time taken for other featurizers, or we were only using it on simple compositions. As shown in #126, the featurizers basically hang if a sufficiently complex composition is attempted.

Possible workarounds: