matthewwardrop / formulaic

A high-performance implementation of Wilkinson formulas for Python.
MIT License
313 stars 21 forks source link

Suggestions for creating `get_feature_names_out` for Scikit Learn ColumnTransformer compatibility? #176

Closed GuiMarthe closed 3 months ago

GuiMarthe commented 4 months ago

Hey folks! Awesome library being built here! So, I'm trying to setup compatibility with scikilearn's ColumnTransfomer, which by default returns numpy arrays and provides a get_feature_names_out method if you want to inspect the transformation in a Pandas DataFrame.

How could that be built? I see in the documentation there is a suggestion for the Transformer implementation, so there could be a get_feature_names_out method right there.

GuiMarthe commented 4 months ago

Ah, I think I've found an ok implementation.

from sklearn.exceptions import NotFittedError

def formulaic_get_feat_names_out(self, names):
    if not hasattr(self, 'model_spec_'):
        raise NotFittedError('Model not fitted yet. Unable to get feature names.')
    return sum([term.columns for term in self.model_spec_.structure], [])

FormulaicTransformer.get_feature_names_out = formulaic_get_feat_names_out

Not sure if I need to know how to handle names if they are given and the model is not fitted, which I think its the case for sklearn expected implementation. But in this case it works.

matthewwardrop commented 3 months ago

Hi @GuiMarthe ,

Thanks for reaching out! I don't use sklearn much in my day-to-day work. Is this just a method I should add to the example that makes it also work with the ColumnTransformer interface?

The easiest way to use the ModelSpec to get column names is just: model_spec.column_names. See https://matthewwardrop.github.io/formulaic/guides/model_specs/#anatomy-of-a-modelspec-instance for more details.

matthewwardrop commented 3 months ago

Ah... I see it documented here: https://scikit-learn.org/stable/glossary.html#term-get_feature_names_out . I'll add it to the example.