BaseFeaturizer() doesn't support certain types of featurizers well

computron commented 6 years ago

Many featurizers don't have a set number or type of features that they return - the features depends on the inut. For example, the BagofBonds featurizers (currently as a PR) will return one value for each potential bond combination in the material. For Li2O this would mean Li-Li, Li-O, and O-O. But for Li (metal) the only feature is Li-Li. Or for Na2S it is Na-Na, Na-S, S-S. So the features are different for each entry. There are multiple featurizers like this, like ChemicalSRO.

There are two issues here:

You get this clunky thing where feature_labels() only really works well after you've called featurize(), and is only really pertinent to the most recent featurize() call. This makes the Featurizer a stateful object which is less preferable and more prone to breakage.
Sometimes, you need to override featurize_dataframe() to get this to work.

It would be better if someone could design an improvement to BaseFeaturize() to support these kinds of features. I should say that text mining handles this kind of use case all the time. For example, the CountVectorizer in scikit-learn will return one "feature" for every distinct word in a block of text, and the features will be different for each text block. The way to do this is to use the fit_transform() method in scikit-learn.

I am wondering if something similar is needed / helpful for some of our featurizers.

WardLT commented 6 years ago

I think this is a good point @computron.

Could we support this case by adding a 'fit' operation that takes a list of entries and then sets the necessary parameters of the featurization (e.g., which bonds to list)?

That would also conceptionally separate actions that prepare a featurizer to be used from those used when running/training a model.

Another option that I was considering while reading these PRs (sorry that I have not had time to contribute) was to advocate that all options that affect which features are computed should go in the initializer.

On Feb 12, 2018 6:08 PM, "Anubhav Jain" notifications@github.com wrote:

Many featurizers don't have a set number or type of features that they return - the features depends on the inut. For example, the BagofBonds featurizers (currently as a PR) will return one value for each potential bond combination in the material. For Li2O this would mean Li-Li, Li-O, and O-O. But for Li (metal) the only feature is Li-Li. Or for Na2S it is Na-Na, Na-S, S-S. So the features are different for each entry. There are multiple featurizers like this, like ChemicalSRO.

There are two issues here:

You get this clunky thing where feature_labels() only really works well after you've called featurize(), and is only really pertinent to the most recent featurize() call. This makes the Featurizer a stateful object which is less preferable and more prone to breakage.
Sometimes, you need to override featurize_dataframe() to get this to work.

It would be better if someone could design an improvement to BaseFeaturize() to support these kinds of features. I should say that text mining handles this kind of use case all the time. For example, the CountVectorizer in scikit-learn will return one "feature" for every distinct word in a block of text, and the features will be different for each text block. The way to do this is to use the fit_transform() method in scikit-learn.

I am wondering if something similar is needed / helpful for some of our featurizers.

— You are receiving this because you are subscribed to this thread. Reply to this email directly, view it on GitHub https://github.com/hackingmaterials/matminer/issues/135, or mute the thread https://github.com/notifications/unsubscribe-auth/AGrcDT8zwPDm9wFAUrRLlQCCJYOkHwNgks5tUNKFgaJpZM4SDBe2 .

computron commented 6 years ago

It is probably best if we try to list all possible options before deciding on one.

Just to continue listing potential options, another one is to change featurize() so that it returns a NamedTuple of (values, labels). We would get rid of the feature_labels() method altogether - featurize() would return both data and labels. One reason to do this is that the initial sketch of the BaseFeaturizer assumed that feature_labels() were an attribute of the featurizer (plus any initialization options). But, after coding many of these featurizers, it really turns out to be an attribute of the featurizer plus the data thrown into it in many cases.

So something like:

values, labels = MyFeaturizer.featurize(data)

or a different usage to highlight the NamedTuple aspect:

x = MyFeaturizer.featurize(data)
values = x.values
labels = x.labels

WardLT commented 6 years ago

I agree, listing the option is a good plan. One more to add to the list: having a separate base class for featurizers that do not perform well. This makes:

No changes. All options that effect featurize, feature_labels, et cetera must be set in the initializer
Add a training operation (fit) to BaseFeaturizer. The training operation changes sets attributes of the featurizer that may not be known before training data is available.
Featurize produces both labels and features. Each entry (row) can produce different features, and the feature_label operation becomes redundant
No changes to BaseFeaturizer, and we develop a new class. Perhaps featurizers that generate a "bag of words" type feature are indeed different enough to warrant a different API.

Am I missing any?

Of this current list, I think Option 2 is the best (e.g., the fit_transform method proposed by @computron , if I am not misunderstanding his proposal):

Option 2 provides the easiest interface to ML libraries. The main purpose of the BaseFeaturizer class is to transform materials data into a form that works with many machine learning libraries: a matrix of features. The current contract for featurize encourages users to build featurizers that will generate an easy-to-use matrix of features.
Option 2 will make it easier to update a model than Option 1. As @computron noted, we now have featurizers that generate different features depending on the input data. If we choose Option 1, each time an training set is updated we will have to re-initialize the featurizer, which places a burden on the user to track what the other inputs to the constructor are. A separate operation that updates the state of the featurizer given only the updated dataset reduces the complexity of updating a class by providing exactly the needed behavior using minimal inputs.
_Option 2 will preserve featurize_dataframe not affecting the featurizer state._ To be useful, featurize_dataframe must generate the same features each time it is called on different datasets. A featurizer that generates different inputs for a model given the training set and test set will complicate evaluating the performance of the model. To address this, we must have a separate operations that alter which features are being generated.

The downside of option 2 is that there are indeed some ML libraries that expect inputs besides rectangular matrices. Some libraries operate on bag-of-word-like representations, others sequences/graphs. The current BaseFeaturizer does not work well with these. Perhaps these will require a separate base class that does not force outputs to be table-like.

PS. Once we agree on a pattern, I can draft a 'Implementation Guide' for BaseFeaturizer that captures our plan. This will also fill issue #93.

dyllamt commented 6 years ago

I looked into the fit_transform method for CountVectorizer. To assign a total number of features for several text documents, the method finds the total vocabulary for each input text document. In the context of featurizers, this is analogous to computing feature_labels for each input sample.

For some featurizers you might be able to infer the number of features (for each sample) from the input data (e.g. many of the structure featurizers create site-specific features, so you could count the number of crystal sites. The SiteStatsFingerprint is a good example of this). However, this is not the general case, and you're back to computing feature_labels for all samples before calling fit_transform.

Therefore, I don't think that there is a way around returning both features and _featurelabels for every sample when you have a featurizer that can return heterogeneous features. To prevent GeneralFeaturizer from being a state-full object you'll need to return both features and labels with every featurize call.

However, you bring up a good point that Matminer is supposed to generate easy-to-use features, and having featurizers that don't return the same number of features doesn't really fit the bill. A user might have to develop a feature-prep pipeline that combines several site specific features in a meaningful way (e.g. sum all the site-specific features that are on tetrahedral sites together and then drop all the other features).

To separate featurizers that are "ML ready" and those that may require extra post-processing. I created a HeterogeneousFeaturizer, which is a child of the BaseFeaturizer class. GeneralFeaturiers that are children of this new class do not use feature_labels, but instead return a dictionary of features/labels with featurize. It's easy to generate a sparse DataFrame from a list of dictionaries.

dyllamt commented 6 years ago

This is just a temp solution for now, but it offers some advantages like relieving the need for a feature_labels function in general.

You might have two classes HeterogeneousFeaturizer and HomogeneousFeaturizer that are identical, but allow you to semantically differentiate between different featurizing behavior, so the user knows what they're getting with each featurizer.

WardLT commented 6 years ago

Thanks for this detailed reply, @dyllamt ! I think the HeterogeneousFeaturizer might be a good solution for features that really don't fit the mold of BaseFeaturizer. How rare do you expect that to be?

You make a good analogy for the fit function to be equivalent to computing labels before features. I agree that there are some examples where figuring out the feature labels ahead of time is easy, and I think this could be the general case. As another example, for the PRDF features you could determine which elements are present in your dataset, which can define both the features that will be computed and their labels.

To give us something concrete to discuss, I'm going to implement the PRDF using the fit approach (Option 2) above. We can then contrast that with your HeterogeneousFeaturizer implementation and, perhaps, we could use that to discuss which approach we like better. [Hopefully, I'll get the time to do that today or early tomorrow]

computron commented 6 years ago

Hi all,

OK, I thought a bit about this. Here's my current suggestion, which echoes Option 2 above but with some differences.

We make all featurizes follow the Transformer pattern in scikit-learn. See these links: http://scikit-learn.org/stable/data_transforms.html and http://www.dreisbach.us/articles/building-scikit-learn-compatible-transformers/ . This will also allow using matminer featurizers in sklearn pipelines if desired: https://signal-to-noise.xyz/post/sklearn-pipeline/ and naturally integrate matminer features with other sklearn tools.
BaseFeaturizer will subclass TransformerMixin from sklearn.
To make this work, the BaseFeaturizer must implement:
- fit(X matrix): for simple featurizers, this can be empty. For things like BagofBonds, it can generate the list of potential bonds and store as an internal variable.
- transform(X matrix): this is just like our current "featurize_dataframe()" function, except it takes in a numpy matrix instead of a dataframe & cols.
- fit_transform (X matrix) - OPTIONAL. One can implement this if there is an efficient process to do both fit and transform, otherwise this is automatically implemented by sequentially calling fit() and transform() (via TransformerMixin).

Some changes that will need to be made to BaseFeaturizer:

Need to add the fit method to BaseFeaturizer. Default implementation can be pass, the simple Featurizers will do this.
Need to add the transform method to BaseFeaturizer. Currently, I am thinking a default implementation based on how featurize_dataframe operates (i.e., using the user-defined fit() and featurize() methods instead of expecting the user to implement transform.
Probably, change featurize_dataframe to mainly just call transform.
Keep the feature_labels around the same as before, but note that one might need to call fit() first (which might set an internal variable to get those feature labels. This means that BaseFeaturizer() is stateful - i.e., feature_labels returns the labels from the most recent fit() operation only and might throw an error if it's called before calling fit(). Under the new rules, that's OK.
Note also that, for certain featurizers, featurize() could also throw an error if fit() is not called first. Again, under the new rules, that's OK.
Possibly add a method to BaseFeaturizer called requires_fit() with default implementation returning False. The more complex featurizers can override this to True. This might help communicate to the user which featurizers require a fit. It's possible that the BaseFeaturizer's implementation of transform or featurize_dataframe will leverage this function.

For the current set of feature implementations, technically one does not need to change anything for the vast majority of featurizers. All that code can stay the same - the fit() method already has a default implementation of pass, the transform() method is getting handled by the BaseFeaturizer, and the fit_transform method from the TransformerMixin. However, the more complex featurizers like BagofBonds or Dos that are heterogeneous should be rewritten to take advantage of fit() - this will simplify their code and make it more natural, as well as make featurize_dataframe work out-of-the-box.

Note that as a positive result of all these changes, all the matminer featurizers will be compatible with standard sklearn Transformers and use the same language - which sounds to me like the right way to go.

computron commented 6 years ago

WardLT commented 6 years ago

:+1: I'm on board with this plan.

WardLT commented 6 years ago

A quick thought: is the desired behavior of transform close to what we currently have named featurize_many?

dyllamt commented 6 years ago

I see why you were both in favor of this paradigm now! I didn't see the vision before.

I think you're right about transform and featurize_many @WardLT. sklearn's FeatureUnion might also take care of MultipleFeaturizer.

I noticed that in sklearn's CountVectorizer that the fit method is just a call to fit_transform. That might be convenient for BagofBonds and Dos where computing the features gives you the correct column headers. BagofBonds seems to be already coded in that style.

WardLT commented 6 years ago

I'm currently in the process of trying to capture @computron's description with an update of BaseFeaturizer and an implementation of the PRDF. Is anyone else already doing that?

dyllamt commented 6 years ago

Not me right now.

WardLT commented 6 years ago

And, it occurs to me that we should probably make BaseFeaturizer inherent from BaseEstimator from scikit-learn. I think that will provide us the ability to work with Pipelines, with the only additional coding requirement being that we cannot use *.args and **kwargs in the initializers (see here). Sound reasonable?

computron commented 6 years ago

I am mainly chasing around my daughter this long weekend so I am not implementing - @WardLT I think you are good to go!

Subclassing BaseEstimator sounds good to me.

As an aside, at first I thought we might want to have *args and **kwargs in certain Featurizer constructors (e.g., to pass those arguments into various pymatgen objects), but I think we can do without them. For example, we can directly take in a pymatgen object in the constructor and skip the need for passing *args and **kwargs.

computron commented 6 years ago

Closing this now based on https://github.com/hackingmaterials/matminer/pull/156

hackingmaterials / matminer

BaseFeaturizer() doesn't support certain types of featurizers well #135