kjappelbaum / mofdscribe

An ecosystem for digital reticular chemistry
https://mofdscribe.readthedocs.io/en/latest/
MIT License
43 stars 7 forks source link

[BRAINSTORM] refactor to use `MOF` objects as inputs #421

Open kjappelbaum opened 1 year ago

kjappelbaum commented 1 year ago

I'm thinking of switching to a different design pattern to make the implementation of new featurizers, especially based on fragments, graphs, and building blocks, easier.

I started introducing more and more conditionals on the input types for featurizers, be it graphs/molecules/structures which make the code really complicated -- especially if we also want to cache as much as possible.

All of this would be much easier if the inputs for the featurizers would always be MOF objects. I already started implementing MOF objects for mofchecker and moffragmentor and they typically have (memoized) methods for computing graph/fragments/hashes/...

If we were to use them, the object would simply "cache" the graph and the featurizer would no longer need to worry about it. Also, the featurizers could simply get the object attributes it needs (and the caller would not need to worry about what to provide, i.e. this logic is only implemented in one place).

In the long run, this might also make the implementation of the datasets easier as, again, the logic is only on the structure object and not partially re-implemented on the datasets.

Advantages

Disadvantages

Usage example

from mofdscribe import MOF
mof = MOF.from_file("some_file.cif")
bu = BUFeaturizer(LSOP())
feats = bu.featurize(mof)

Implementation idea

class MOF: 
    def __init__(self, structure, structure_graph): 
        self.structure = structure 
        self.structure_graph = structure_graph

    @classmethod
    def from_structure(cls, structure): 
        structure_graph = get_structure_graph(structure)
        return cls(structure, structure_graph

    @classmethod
    def from_file(): 
        ... 
    @cached_property
    def fragments(): 
        ...

    @cached_property
    def scaffold_hash():
        ... 

The featurizers would have the following signatures

class GraphFeaturizer: 
    def featurize(self, mof): 
        structure_graph = mof.structure_graph
        return self._featurize(structure_graph)

class StructureFeaturizer: 
    def featurize(self, mof): 
        structure = mof.structure
        return self._featurize(structure)

MatminerFeaturizer = StructureFeaturizer

class SiteFeaturizer: 
      def featurize(self, mof, i): 
        structure = mof.structure
        return self._featurize(structure, structuregraph, i)

class BUFeaturizer: 
    def __init__(self, featurizer):
        if isinstance(featurizer, MultiFeaturizer): 
            raise ValueError("featurizer must be a single featurizer")

    def featurize(self, mof): 
        fragments = mof.fragments

In the last case, for the BUFeaturizer, the featurizer can inspect if it needs to get the molecules/structures/graphs from the fragments and then call the _featurize method of the featurizer passed in the constructor.

The MOFMultipleFeaturizer should probably always loop over structures for featurize_many.

Also, if I'd do something like featurizer = MOFMultipleFeaturizer([BUFeaturizer(LSOP()), BUFeaturizer(Dimensionality()), RACS()]) it should work without problems as all featurizers always accept MOF objects.

from mofdscribe import MOF 
from fastcore.xtras import save_pickle 

for file in files: 
    mof = MOF.from_file(file) 
    fragments = mof.fragment
    save_pickle(fragments) 
    features = featurizer.featurize(MOF)
kjappelbaum commented 1 year ago

also, testing the implementation now, another benefit seems to be that I no longer need to think about the primitive option. From files, this is the default. However, if users already have pymatgen.core.Structure objects we assume that they know what they do