Open Pepe-Marquez opened 1 year ago
Hi @Pepe-Marquez, my guess is that the pymatgen/matminer oxidation state solver is choking up on that complex composition. By default, it allows every "site" of a particular species (not strictly sites in this case, but it is the same thing in practice) to have a different oxidation state compatible with its species, so it scales very poorly with number of "sites".
You can customize the featurization pipeline to circumvent this. We have a specific workaround for structure featurizers, but not for composition only. I have prepared a hack below that disables the one featurizer that uses oxidation states... we are looking to optimise this process in the upcoming release, so keep an eye out!
import pandas as pd
from modnet.models import MODNetModel
from modnet.preprocessing import MODData
from modnet.featurizers.presets import CompositionOnlyMatminer2023Featurizer
from pymatgen.core import Composition
featurizer = CompositionOnlyMatminer2023Featurizer()
featurizer.composition_featurizers = [f for f in featurizer.composition_featurizers if f.__class__.__name__ != "IonProperty"]
data = {'composition': ['Cu2ZnSnSe4', 'Cu2ZnSnS4', 'CsPbI3', 'CH3NH3PbI3', 'C100H3815Br21I279N2185Pb100' ],
'target': [1.0, 1.5, 1.78, 1.6, 1.63]}
df_simple = pd.DataFrame(data)
df_simple["composition"] = df_simple["composition"].map(Composition)
data = MODData(
materials=df_simple["composition"], # you can provide composition objects to MODData
targets=df_simple["target"], # you can provide target values to MODData
target_names=["target"],
featurizer=featurizer,
)v
data.featurize()
This now runs in about 10 seconds on my laptop.
Some more background at #46 (that I had completely forgotten about)
This fixed the error for me. Thanks for the help! Happy to close if you think it's ready
This fixed the error for me. Thanks for the help! Happy to close if you think it's ready
Awesome, thanks for letting us know. I think I'll actually keep it open until we fix it in the default preset.
I would like to run modnet on a dataset in which I have compositions that have very complex stoichiometries. On example would be
C100H3815Br21I279N2185Pb100
To reproduce, this could be an example code:
Am I doing something wrong here? Would there be a workaround to get these complex compositions running smoother through the featurizer?
Thanks!