ppdebreuck / modnet

MODNet: a framework for machine learning materials properties
MIT License
81 stars 34 forks source link

complex compositions take very long to featurize #126

Open Pepe-Marquez opened 1 year ago

Pepe-Marquez commented 1 year ago

I would like to run modnet on a dataset in which I have compositions that have very complex stoichiometries. On example would be C100H3815Br21I279N2185Pb100

To reproduce, this could be an example code:

import pandas as pd
from modnet.models import MODNetModel
from modnet.preprocessing import MODData
from pymatgen.core import Composition

data = {'composition': ['Cu2ZnSnSe4', 'Cu2ZnSnS4', 'CsPbI3', 'CH3NH3PbI3', 'C100H3815Br21I279N2185Pb100' ],
        'target': [1.0, 1.5, 1.78, 1.6, 1.63]}
df_simple = pd.DataFrame(data)
df_simple["composition"] = df_simple["composition"].map(Composition)

data = MODData(
    materials=df_simple["composition"], # you can provide composition objects to MODData
    targets=df_simple["target"], # you can provide target values to MODData
    target_names=["target"]

data.featurize()

Am I doing something wrong here? Would there be a workaround to get these complex compositions running smoother through the featurizer?

Thanks!

ml-evs commented 1 year ago

Hi @Pepe-Marquez, my guess is that the pymatgen/matminer oxidation state solver is choking up on that complex composition. By default, it allows every "site" of a particular species (not strictly sites in this case, but it is the same thing in practice) to have a different oxidation state compatible with its species, so it scales very poorly with number of "sites".

You can customize the featurization pipeline to circumvent this. We have a specific workaround for structure featurizers, but not for composition only. I have prepared a hack below that disables the one featurizer that uses oxidation states... we are looking to optimise this process in the upcoming release, so keep an eye out!

import pandas as pd
from modnet.models import MODNetModel
from modnet.preprocessing import MODData
from modnet.featurizers.presets import CompositionOnlyMatminer2023Featurizer
from pymatgen.core import Composition

featurizer = CompositionOnlyMatminer2023Featurizer()
featurizer.composition_featurizers = [f for f in featurizer.composition_featurizers if f.__class__.__name__ != "IonProperty"]

data = {'composition': ['Cu2ZnSnSe4', 'Cu2ZnSnS4', 'CsPbI3', 'CH3NH3PbI3', 'C100H3815Br21I279N2185Pb100' ],
        'target': [1.0, 1.5, 1.78, 1.6, 1.63]}
df_simple = pd.DataFrame(data)
df_simple["composition"] = df_simple["composition"].map(Composition)

data = MODData(
    materials=df_simple["composition"], # you can provide composition objects to MODData
    targets=df_simple["target"], # you can provide target values to MODData
    target_names=["target"],
    featurizer=featurizer,
)v

data.featurize()

This now runs in about 10 seconds on my laptop.

ml-evs commented 1 year ago

Some more background at #46 (that I had completely forgotten about)

Pepe-Marquez commented 1 year ago

This fixed the error for me. Thanks for the help! Happy to close if you think it's ready

ml-evs commented 1 year ago

This fixed the error for me. Thanks for the help! Happy to close if you think it's ready

Awesome, thanks for letting us know. I think I'll actually keep it open until we fix it in the default preset.