Transformed regressors drop dataframe feature names

lemma-osu / sknnr

scikit-learn compatible estimators for various kNN imputation methods

https://sknnr.readthedocs.io

0 stars 1 forks source link

Transformed regressors drop dataframe feature names #20

Closed aazuspan closed 1 year ago

aazuspan commented 1 year ago

Hey @grovduck, I'm starting to sound like a broken record, but there's another issue blocking dataframe indexes in #2. This is a bit of a long one, so I tried to lay it out below.

The problem

All of our TransformedKNeighborsMixin estimators are incompatible with dataframes in that they don't store feature names. I didn't think to check this before, but updating the dataframe test to check for feature names fails for everything but RawKNNRegressor.

@pytest.mark.parametrize("estimator", get_kneighbor_estimator_instances())
def test_estimators_support_dataframes(estimator, moscow_euclidean):
    """All estimators should fit and predict data stored as dataframes."""
    num_features = moscow_euclidean.X.shape[1]
    feature_names = [f"col_{i}" for i in range(num_features)]

    X_df = pd.DataFrame(moscow_euclidean.X, columns=feature_names)
    y_df = pd.DataFrame(moscow_euclidean.y)

    estimator.fit(X_df, y_df)
    estimator.predict(X_df)
    assert_array_equal(estimator.feature_names_in_, feature_names)

This happens because they all run X through transformers that convert the dataframes to arrays before they get to KNeighborsRegressor.fit where the features would be retrieved and stored. The same thing would happen with sklearn transformers, so I think we should probably solve this in TransformedKNeighborsMixin rather than in the transformers.

EDIT: As detailed in the next post, once we solve the issue of losing feature names when fitting, we need to also retain feature names when predicting to avoid warnings.

Possible solutions

First of all, I think we should move the actual transformation out of the fit method for each estimator and into a fit method for TransformedKNeighborsMixin. That should probably be done regardless of this issue just to reduce some duplication, and also allows us to make sure everything gets fit the same way. Then, I think we need to modify that fit method to make sure it sets appropriate feature names after transformation.

To get feature names, all that sklearn does is look for a columns attribute on X. If we could copy that columns attribute onto the transformed array before passing it to KNeighborsRegressor.fit we'd be set, but there's no way to directly set attributes on Numpy arrays because they are implemented in C.

I think that leaves us with a few options:

Use sklearn.utils.validation._get_feature_names to get and validate the feature names before transforming, then manually set them as feature_names_in_ after fitting. I don't love this because it requires us to use a private method that could disappear, get renamed, change return types, etc. The upside is that we would know our feature names are retrieved consistently with sklearn.
Same as above but copy sklearn.utils.validation._get_feature_names into our code base. That bypasses the private method issue, but adds some maintenance cost and we would need to carefully consider how to do that consistently with the sklearn license. As with option 1, we would still need to handle setting the feature_names_in_ attribute.
Subclass ndarray to support a column attribute and pass that in to fit. As long as sklearn doesn't change how they identify features (which seems unlikely), we could let sklearn handle getting and setting feature names, and I think it would be transparent to users. I did confirm that the _fit_X attribute seems to store a numpy array regardless of what goes into it. Like option 2, this adds some maintenance cost.
Use the _check_feature_names method with the non-transformed X after fitting. This will set feature names on the model and fix the issue of losing feature names when fitting. The downside is that we're again using a private method.

I don't love any of these options, so let me know what you think or if any other solutions occur to you.

Estimator checks

I noticed that the sklearn.estimator_checks would have caught this, so I wonder if we should prioritize getting those checks to pass before we add any more functionality? I think that may be a big lift, but would at least prevent us from accidentally breaking estimators in the future. Also, it may be easier to do now than after they get more complex and would keep us from accidentally writing a lot of redundant tests.

EDIT: This would also catch warnings for predicting without feature names that I mention in the next post.

aazuspan commented 1 year ago

Sorry to tack on to an already extensive issue, but I've got another possible option and another related problem to consider.

BaseEstimator objects, including all our estimators, have a _check_feature_names method that can set feature names from a dataframe. I edited the original post to include this as Option 4.

However, while experimenting with this I discovered...

Another problem

Successfully adding feature names to the estimator introduces new warnings when calling predict. This is because, like fit, the transformation converts dataframe inputs to arrays. When predict is run without feature names on a model that was fitted with feature names, sklearn complains.

Short of overriding predict to skip the check for feature names (definitely not a good approach), I'm not sure there's any way to solve this secondary issue other than modifying X between transforming it and passing it in to super().predict(). So with that in mind, maybe Option 3 (subclassing ndarray to add a columns attribute and make it look like a dataframe) is the only remaining solution that tackles both problems?

It's possible I'm too deep into this issue and I'm just getting tunnel vision, so please check my logic and let me know if there might be other solutions I haven't considered.

grovduck commented 1 year ago

@aazuspan, such a great writeup of the issue and possible solutions. It takes a lot of time to put this much effort into an issue like this, so I really appreciate it. I think I fully understand the issue and offer a few responses:

I think we should move the actual transformation out of the fit method for each estimator and into a fit method for TransformedKNeighborsMixin

Yes, I agree with this approach. Just for my own edification, because now both superclasses (TransformedKNeighborsMixin and IDNeighborsRegressor) will have fit methods and you'll want to call TransformedKNeighborsMixin.fit first, will you call that explicitly (i.e. TransformedKNeighborsMixin.fit(self, X)) and then call super().fit(X)? But the return of fit should be the estimator itself, so I'm a bit confused how this would work. Could you instead just create the TransformedKNeighborsMixin.transform method?

I think that leaves us with a few options ...

This one took me a while to noodle on. Of the three options, I feel that option 3 (subclassing np.ndarray using the RealisticInfoArray pattern on the page you linked) might be the best choice. I like that it doesn't require any tweaks to the sklearn code at all and it should be transparent to the user. (While I was writing, I just saw your additional post which narrows in on this remedy!). I do like this approach (option 3) better than option 4.

Could this pattern also give us opportunities to store the dataframe index (IDs) as another attribute if they'd be lost otherwise with regular arrays?

I wonder if we should prioritize getting those checks to pass before we add any more functionality

👍. I think this is a splendid idea. Given that it's already popped up two issues, it seems like an important step to get into place earlier rather than later.

grovduck commented 1 year ago

(Just uncommented the compatibility test and, oof, that GNNRegressor is going to take some work!)

Correct me if I'm wrong, but it doesn't seem like the check_dataframe_column_names_consistency is actually part of the suite of checks that gets run on these estimators? How did you know that this would have tripped that check?

aazuspan commented 1 year ago

Just for my own edification, because now both superclasses (TransformedKNeighborsMixin and IDNeighborsRegressor) will have fit methods and you'll want to call TransformedKNeighborsMixin.fit first, will you call that explicitly (i.e. TransformedKNeighborsMixin.fit(self, X)) and then call super().fit(X)?

My thinking was that we'll call super().fit in each case and rely on MRO to fit them in the order below (using EuclideanKNNRegressor as an example):

EuclideanKnnRegressor.fit - Sets the transform_ attribute.
IDNeighborsRegressor.fit - Potentially sets the index_ attribute.
TransformedKNeighborsMixin.fit - Checks for transform_ attribute, applies the transformation, and in the case of dataframes, stores the transformed X in our array subclass.
KNeighborsRegressor.fit - Actually does the fitting.

Admittedly, it is a little tough to track the flow of data there, but I think this will end up cleaner than a more functional approach, since we'll need to run step 3 when we call fit, predict, or kneighbors on all of the transformed estimators.

Could you instead just create the TransformedKNeighborsMixin.transform method?

I think moving step 3 above into a single method like this is a good idea, which can be called from fit, predict, or kneighbors.

I do like this approach (option 3) better than option 4.

I'm on the same page with option 3, and after putting together a quick implementation, I think it's a pretty clean solution overall. It was going to be a lot to paste in here, so I threw it into a Gist if you want to take a look and let me know if you have any thoughts!

Any preference on the name for the ndarray subclass? As you'll see, my first thought was NamedFeatureArray, but I'm open to ideas!

Could this pattern also give us opportunities to store the dataframe index (IDs) as another attribute if they'd be lost otherwise with regular arrays?

Good idea! My loose thinking is that we can store IDs as a fit attribute on the model before the data is transformed, but there may be a snag there that would be better handled by storing it on the arrays. I'll keep this in mind.

Given that it's already popped up two issues, it seems like an important step to get into place earlier rather than later.

Great! I have a working fix for this issue using option 3 (pending your feedback on names and implementation), but I suppose I should probably hold off on making a PR... Don't want to dig us into a deeper hole.

Correct me if I'm wrong, but it doesn't seem like the check_dataframe_column_names_consistency is actually part of the suite of checks that gets run on these estimators?

You're 100% right. I assumed check_estimator ran everything in that module, and I'm not totally sure why it doesn't... I suppose we'll have to run some manually then, unless there's some configuration I'm missing?

grovduck commented 1 year ago

After putting together a quick implementation, I think it's a pretty clean solution overall

Once again, you amaze me 🤯. Super elegant solution and neat way of taking advantage of MRO. (I do always get a little baffled by calling super() in a superclass, in this case thinking that it's going to call the superclass of TransformedKNeighborsMixin, but it's really calling the next-in-line superclass of EuclideanKNNRegressor, right? In this way, it's just picking up information as it goes down the MRO chain. Neat!)

I like the private _transform method in that class as well - that looks like a good use.

As you'll see, my first thought was NamedFeatureArray

That name sounds perfectly fine to me - faithful to the concept of features in sklearn.

Good idea! My loose thinking is that we can store IDs as a fit attribute on the model before the data is transformed

You're too kind ... your solution (now that I understand it) seems like a better approach.

Great! I have a working fix for this issue using option 3 (pending your feedback on names and implementation), but I suppose I should probably hold off on making a PR

I'll leave this decision up to you. It seems like your fix here resolves a couple of the checks, so I can't imagine that you mess anything up by creating a PR for this issue before tackling the checks, but you're the better judge here.

I suppose we'll have to run some manually then, unless there's some configuration I'm missing

I didn't see anything that includes that check (along with a few others) as part of a wrapping function like yield_all_checks. I'm not sure why they are left out of the set of checks.

Thanks for the deep thinking on this one.

aazuspan commented 1 year ago

I do always get a little baffled by calling super() in a superclass, in this case thinking that it's going to call the superclass of TransformedKNeighborsMixin, but it's really calling the next-in-line superclass of EuclideanKNNRegressor, right? In this way, it's just picking up information as it goes down the MRO chain.

Yeah, trying to figure out super with multiple inheritance can be a little mind-bending, but I think you've got it. I think the trick is that in each of those calls to fit, self is still an instance of EuclideanKnnRegressor, which is how super is able to access the next method in MRO line.

It seems like your fix here resolves a couple of the checks, so I can't imagine that you mess anything up by creating a PR for this issue before tackling the checks

Thanks for helping me think it through--wanted to make sure I wasn't shooting us in the foot just to get a quick fix merged!

Thanks for the deep thinking on this one.

Likewise! I'm mostly used to working on code in a vacuum, so being able to bounce ideas around is a big help.

aazuspan commented 1 year ago

Resolved by #22

grovduck commented 1 year ago

Hey @aazuspan, continuing my bad habit of responding to already closed issues ...

I found this video earlier this week. The main point of the video is that all transformers should support column names if either setting the global set_config(transform_output="pandas") or the transformer specific transformer.set_output(transform="pandas") and the passed X is a dataframe. What was interesting to me was that some transformers (like PolynomialFeatures) create new output names because they set more columns as a result of the fit and transform.

After a fairly deep dive and using PolynomialFeatures as inspiration, I think I understand how this works a bit better. I think, at the very least, we need to support column names from our transformers (including the ones that transform the X array into reduced dimension arrays like CCATransformer). I think the three main things need to happen:

Add _SetOutputMixin as a base class on our transformers so that set_output is available to us.
Call self._check_feature_names or self._validate_data in fit with reset=True. This will set self.feature_names_in_ correctly if X is a dataframe.
Create a get_feature_names_out method on each transformer that correctly sets the output transform names. Without it, transform complains as I think you found out.

What I'm not entirely clear on is whether this obviates the need of NamedFeatureArray. I think it's entirely possible that you took even a deeper dive than I did 😄, so it's possible that I'm still missing something. I'm going to keep this issue closed for now, but let me know what you think.

aazuspan commented 1 year ago

Great point! I came across some articles mentioning that sklearn transformers support dataframes as a config option while researching this, but wasn't thinking about our transformers as part of the public API at that point. If we want them to be usable outside of our estimators (which I'm pretty sure we do), I think you're 100% right that they need to support those config options.

Add _SetOutputMixin as a base class on our transformers so that set_output is available to us.

After poking around, it looks like all BaseTransformer subclasses, including ours, already subclass _SetOutputMixin. ~In fact, StandardScalerWithDOF ~and CCATransformer~ already have set_output methods. MahalanobisTransformer doesn't, but I haven't totally worked out why yet. I think it comes down to this _auto_wrap_is_configured check, but I haven't wrapped my head around that yet.~

EDIT: In order for a transformer that subclasses _SetOutputMixin to access the set_output method, it needs to pass the _auto_wrap_is_configured check, which requires that the transformer has a get_feature_names_out attr. So I think solving that first will automatically give us access to set_output.

Call self._check_feature_names or self._validate_data in fit with reset=True. This will set self.feature_namesin correctly if X is a dataframe.

You may be a step ahead of me, but for this to work we need our transformer to return arrays when given arrays and dataframes when given dataframes, right? Are we doing this by calling set_output on the transformer based on the input X type in TransformedKNeighborsMixin._apply_transform? That was my first thought, but I haven't worked through all the potential difficulties there.

Create a get_feature_names_out method on each transformer that correctly sets the output transform names. Without it, transform complains as I think you found out.

I noticed that StandardScalerWithDOF already gets this method from OneToOneFeatureMixin, but I'm assuming we can't use that for our transformers that apply dimensionality reduction?

EDIT: To use OneToOneFeatureMixin.get_feature_names_out, an estimator must set n_features_in_ when it's fit. Aside from StandardScalerWithDOF, none of our transformers do this because they never call fit from a superclass. We can fix this by running self._validate_data at the start of each custom fit method which, among other things, sets that attr. It can also be used to validate data types and other aspects of the data, so maybe we can just copy the StandardScaler implementation? In any case, I'm pretty sure we need to set this attr regardless of how we handle get_feature_names_out, and this will probably show up in the estimator checks in #21 as well.

What I'm not entirely clear on is whether this obviates the need of NamedFeatureArray

Good question! I think if we can get our transformers to always respect their input data types within the estimators (without requiring users to set any config), we should be able to remove NamedFeatureArray because our X_transformed will now be a dataframe if a dataframe goes in.

grovduck commented 1 year ago

@aazuspan, great deep dive!

In order for a transformer that subclasses _SetOutputMixin to access the set_output method, it needs to pass the _auto_wrap_is_configured check, which requires that the transformer has a get_feature_names_out attr. So I think solving that first will automatically give us access to set_output.

Yes, that was the same conclusion that I came to as well.

To use OneToOneFeatureMixin.get_feature_names_out, an estimator must set n_features_in_ when it's fit. Aside from StandardScalerWithDOF, none of our transformers do this because they never call fit from a superclass. We can fix this by running self._validate_data at the start of each custom fit method which, among other things, sets that attr. It can also be used to validate data types and other aspects of the data, so maybe we can just copy the StandardScaler implementation? In any case, I'm pretty sure we need to set this attr regardless of how we handle get_feature_names_out, and this will probably show up in the estimator checks in #21 as well.

Just so I'm clear, this is an explanation of how this works for those estimators that can set the output feature names to be the same as the input feature names, correct? In our case, this would (currently) be StandardScalerWithDOF and MahalanobisTransformer, right? You're not suggesting that we use the OneToOneFeatureMixin on CCATransformer or CCorATransformer, correct? Only that we need to ensure that we call self._validate_data in their fit methods, right? That was where I had landed as well.

If you think you have a clear understanding of what needs to be done, I'd love for you to take a first stab at this. But I'm happy to circle back to this as well.

aazuspan commented 1 year ago

In our case, this would (currently) be StandardScalerWithDOF and MahalanobisTransformer, right? You're not suggesting that we use the OneToOneFeatureMixin on CCATransformer or CCorATransformer, correct? Only that we need to ensure that we call self._validate_data in their fit methods, right? That was where I had landed as well.

Exactly! Any thoughts on how to implement get_feature_names_out for CCA and CCorA? What do you think about using ClassNamePrefixFeaturesOutMixin like PCA does, where the feature names would be sequentially prefixed based on the name of the transformer?

If you think you have a clear understanding of what needs to be done, I'd love for you to take a first stab at this.

Happy to! I think I have a relatively clear picture of how this will work, although that may change once I get into the nitty gritty details. One thing I'm particularly unsure on is how best to test this, so let me know if you have any thoughts there. In any case, I'll hold off until MSN is merged since that will be affected.

grovduck commented 1 year ago

What do you think about using ClassNamePrefixFeaturesOutMixin like PCA does, where the feature names would be sequentially prefixed based on the name of the transformer?

That's perfect! Exactly what I was thinking in terms of naming.

aazuspan commented 1 year ago

I just noticed that the ClassNamePrefixFeaturesOutMixin.get_feature_names_out method requires that an _n_features_out attr is set on the transformer. For PCA, this is handled with this property.

My understanding is that we plan to make the number of output components (is there a more accurate term?) configurable for the CCA and CCorA ordinations, right? In that case, I imagine we would set an attribute when the transformer is instantiated, similar to PCA, and we could reference that for _n_features_out.

But maybe there's a more direct way we can do this know, just by checking the shape of an attribute on the transformer after it's fit. This is more similar to how PCA checks the shape of its components_ attr.

Since you have a much better idea of the inner workings of these transformers, what do you think about these options, and if you want to go with the second one, can you point me in the direction of attrs that would store the output shapes for CCATransformer and CCorATransformer?

grovduck commented 1 year ago

Since you have a much better idea of the inner workings of these transformers, what do you think about these options, and if you want to go with the second one, can you point me in the direction of attrs that would store the output shapes for CCATransformer and CCorATransformer?

Great question! (and one I've meant to circle back to ...). For CCATransformer, after the transformer is fit, we can use self.cca_.eigenvalues.shape[0] (it's a 1-d numpy array, so we could use len if preferred). And your intuition is right on, we'll want to have an option to set the number of output axes to use, so I think the logic goes like this:

Create an __init__ in CCATransformer that sets an attribute (something like n_axes) which would default to None
At the end of fit, either modify self.n_axes or create a new estimator attribute that stores self.cca_.eigenvalues.shape[0] if self.n_axes is None, otherwise stores min(self.n_axes, self.cca_.eigenvalues.shape[0]). This step may not be entirely necessary and we could just code this into the property below.
Create a property _n_features_out that references the attribute from above (or implements that logic)
Modify transform to only use up to _n_features_out axes.

For CCorATransformer, I think it's probably safest to use self.ccora_.projector.shape[0]. There is a property in there called n_vec which would return the same number (and is used in projector), but I don't know how much of the internals of CCorA (and, for that matter, CCA) we want to expose. Because we already use projector in transform, that seems safest to me.

(I can't promise that I won't go back and fiddle a bit with the API of both of these classes, including adding a projector property to CCA, but to keep you going, I won't do anything now.)

Hopefully that gives you enough to go on now?

grovduck commented 1 year ago

@aazuspan, I thought that it might be better to track the dimensionality reduction in a separate issue [#33], given that there are more considerations than just naming the axes. For now, if you just want to go with the properties I named above to get this working, I think that would be better. Thoughts?

aazuspan commented 1 year ago

That's exactly what I was looking for, thanks! I agree that we should track dimensionality reduction separately, but can move ahead here.

A couple more questions. First, I set up a quick test for this that fits each transformer (currently just using the linnerud dataset) and compares the shapes of get_feature_names_out to the number of features in the X data, assuming that they should be the same.

from sklearn.datasets import load_linnerud

@pytest.mark.parametrize("transformer", get_transformer_instances())
def test_transformers_get_feature_names_out(transformer):
    """Test that all transformers get feature names out."""
    X, y = load_linnerud(return_X_y=True)

    feature_names = transformer.fit(X=X, y=y).get_feature_names_out()
    assert feature_names.shape == (X.shape[1],)

This fails for CCATransformer, and after playing around with it, it looks like CCA will always reduce dimensionality by one feature. Am I understanding that right?

If that's the case, I'll probably switch to testing each transformer's feature names separately so we can also confirm that the names are correct, too. Does that sound like a good plan?

Second question: how do you feel about the fact that the output feature names for CCATransformer and CCorATransformer will be ccatransformer0, and ccoratransformer0, respectively (because they subclass ClassNamePrefixFeaturesOutMixin)? Those are obviously a little wordy and it feels kind of odd to include transformer in the feature names. We could think about shortening the names of the transformers, making them more akin to PCA, although we would obviously have to distinguish our CCA from the builtin CCA. We don't necessarily need to decide on this right now, unless we wanted to modify the implementation of get_feature_names_out to shorten feature names without renaming classes.

grovduck commented 1 year ago

This fails for CCATransformer, and after playing around with it, it looks like CCA will always reduce dimensionality by one feature. Am I understanding that right?

CCA shouldn't always reduce dimensionality by one - it should be data-dependent, e.g.:

import pandas as pd
from sklearn.datasets import load_linnerud
from sknnr.transformers import CCATransformer

X, y = load_linnerud(return_X_y=True)
txfr = CCATransformer().fit(X, y)
print(X.shape[1], txfr.cca_.eigenvalues.shape[0])
# prints (3, 2)

X = pd.read_csv("../tests/data/moscow_env.csv").iloc[:, 1:]
y = pd.read_csv("../tests/data/moscow_spp.csv").iloc[:, 1:]
txfr = CCATransformer().fit(X, y)
print(X.shape[1], txfr.cca_.eigenvalues.shape[0])
# prints (28,  28)

Was there something in the code that made you think it would always reduce dimensionality by one? It very well could be that I'm overlooking something!

Another question - did this not fail for CCorATransformer? This is where the statistical fit test usually reduces the number of axes quite a bit.

Second question: how do you feel about the fact that the output feature names for CCATransformer and CCorATransformer will be ccatransformer0, and ccoratransformer0, respectively (because they subclass ClassNamePrefixFeaturesOutMixin)?

Yeah, not ideal, eh? My preference would be that the names were cca0 and ccora0, but I understand that issues that causes with renaming those classes (and the fact that our CCorA is really supposed to produce the same results as sklearn's CCA). I don't think I'm in favor of renaming the transformers. Does it feel too hacky to post-fix the output of ClassNamePrefixFeaturesOutMixin.get_feature_names_out to splice out transformer? Having said that, I can live with whatever feels right to you (even a rename of the transformer class names).

aazuspan commented 1 year ago

I see, thanks for explaining!

Was there something in the code that made you think it would always reduce dimensionality by one? It very well could be that I'm overlooking something!

No, this was my very naive empirical test of throwing a bunch of randomized numpy arrays of different shapes at it and checking the outputs!

Given that no dimensionality reduction occurs with the Moscow data, how do you feel about this as a test for get_feature_names_out?

@pytest.mark.parametrize("transformer", get_transformer_instances())
def test_transformers_get_feature_names_out(transformer, moscow_euclidean):
    """Test that all transformers get feature names out."""
    X = moscow_euclidean.X
    y = moscow_euclidean.y

    feature_names = transformer.fit(X=X, y=y).get_feature_names_out()
    assert feature_names.shape == (X.shape[1],)

Another question - did this not fail for CCorATransformer? This is where the statistical fit test usually reduces the number of axes quite a bit.

It didn't, at least not with linnerud or the np.random.normal arrays I was using.

Does it feel too hacky to post-fix the output of ClassNamePrefixFeaturesOutMixin.get_feature_names_out to splice out transformer?

Actually, just implementing get_feature_names_out from scratch for those two transformers would probably be simpler than calling the superclass and modifying the output. I would probably lean towards that, with the only downside being a little bit of duplication between the two transformers. Something like:

def get_feature_names_out(self, input_features=None) -> np.ndarray:
    return np.asarray([f"ccora{i}" for i in range(self._n_features_out)], dtype=object)

grovduck commented 1 year ago

No, this was my very naive empirical test of throwing a bunch of randomized numpy arrays of different shapes at it and checking the outputs!

I think this is still interesting, though. Did they always return n-1 dimensions based on n features? The actual code that sets the number of eigenvalues is here, which basically takes the minimum of the rank of the least-squares regression or the number of positive eigenvalues. Although I don't fully understand matrix rank and how it relates to least-squares regression, I think rank differs based on whether you have under-, well-, or over-determined systems, which differs on the shape of the arrays passed. I might play around with this a bit more to try to understand what should be expected, but can't promise that I'll be able to provide a coherent answer!

Given that no dimensionality reduction occurs with the Moscow data, how do you feel about this as a test for get_feature_names_out?

I'm struggling with this one. If the test is meant to show the expected behavior (i.e. the number of features should always equal the number of axes), I think it could be misleading. Based on what you've already found along with the optimization of "meaningful" axes in both CCA and CCorA, I wouldn't necessarily expect this to be true. But if the test is meant to capture what is happening with this specific data, then I'm OK with it. I just think the non-deterministic nature of these two ordination methods makes testing tricky.

It didn't, at least not with linnerud or the np.random.normal arrays I was using.

That's because I gave you the wrong attribute! Sorry about that. The number of output features should be self.ccora_.projector.shape[1] (the columns, not rows). If you run transform on the Moscow test data, you'll get this:

import pandas as pd
from sknnr.transformers import CCorATransformer

X = pd.read_csv("../tests/data/moscow_env.csv").iloc[:, 1:]
y = pd.read_csv("../tests/data/moscow_spp.csv").iloc[:, 1:]
txfr = CCorATransformer().fit(X, y)
print(txfr.ccora_.projector.shape[1])
# Prints 5
print(txfr.transform(X).shape)
# Prints (165, 5)

Actually, just implementing get_feature_names_out from scratch for those two transformers would probably be simpler than calling the superclass and modifying the output.

I like this option better if you're OK with this. But let me know if you feel differently.

aazuspan commented 1 year ago

Did they always return n-1 dimensions based on n features?

Yep, here's the code I was experimenting with if you want to take a closer look:

import numpy as np
from sknnr.transformers import CCATransformer

for n_features in range(1, 20):
    n_samples = 30

    X = np.random.normal(loc=10, size=(n_samples, n_features))
    y = np.random.normal(loc=10, size=(n_samples, n_features))

    n_features_out = CCATransformer().fit(X, y).cca_.eigenvalues.shape[0]
    print(n_features, n_features_out)

I might play around with this a bit more to try to understand what should be expected, but can't promise that I'll be able to provide a coherent answer!

That's okay, I'm not sure I'll ever fully grok the stats, but as long as someone does and the tests pass, I'm happy!

If the test is meant to show the expected behavior (i.e. the number of features should always equal the number of axes), I think it could be misleading. Based on what you've already found along with the optimization of "meaningful" axes in both CCA and CCorA, I wouldn't necessarily expect this to be true. But if the test is meant to capture what is happening with this specific data, then I'm OK with it.

Well said! I felt slightly uneasy about the test, and I think you captured what I didn't like about it. I do think we should have some test of output shape for get_feature_names_out, since it would otherwise be very easy for a bug to go unnoticed there. What do you think about comparing the shape of the feature names to the shape of the transformed X, like below? It requires a little more computation, but probably not enough to ever notice with the small number of transformers we have.

@pytest.mark.parametrize("transformer", get_transformer_instances())
def test_transformers_get_feature_names_out(transformer, moscow_euclidean):
    """Test that all transformers get feature names out."""
    fit_transformer = transformer.fit(X=moscow_euclidean.X, y=moscow_euclidean.y)
    feature_names = fit_transformer.get_feature_names_out()
    X_transformed = fit_transformer.transform(X=moscow_euclidean.X)

    assert feature_names.shape == (X_transformed.shape[1],)

That's because I gave you the wrong attribute! Sorry about that. The number of output features should be self.ccora_.projector.shape[1] (the columns, not rows).

No problem! I hate to say it, but I think this is exactly what Github Copilot suggested when I first started writing the test...

I like this option better if you're OK with this. But let me know if you feel differently.

I have a slight reservation that none of the sklearn transformers include "Transformer" in the name, but I also don't have a better suggestion. Maybe we can revisit naming again in the future once the rest of the API is in place, but for now I'm perfectly happy implementing get_feature_names_out from scratch, since it's such a simple method.

grovduck commented 1 year ago

Yep, here's the code I was experimenting with if you want to take a closer look

OK, I think I got it. If you have more features in y than you do in X, then the number of eigenvalues should be equal to the number of features in X. Try this:

import numpy as np
from sknnr.transformers import CCATransformer

for n_features in range(1, 20):
    n_samples = 30

    X = np.random.normal(loc=10, size=(n_samples, n_features))
    y = np.random.normal(loc=10, size=(n_samples, n_features + 1))

    n_features_out = CCATransformer().fit(X, y).cca_.eigenvalues.shape[0]
    print(n_features, n_features_out)

It may be more predictable than I think, but might require doing some checking of input array shape to determine. I think if you have completely collinear columns in y, that will reduce the rank as well.

What do you think about comparing the shape of the feature names to the shape of the transformed X, like below?

I like this approach much better. Thanks for being flexible!

I hate to say it, but I think this is exactly what Github Copilot suggested

Replaced by the machines!

I have a slight reservation that none of the sklearn transformers include "Transformer" in the name

That is pretty interesting, given the prevalence of estimators with Classifier and Regressor in their names. I'm happy to revisit as well, but I think moving forward with the custom get_feature_names_out makes sense!

aazuspan commented 1 year ago

New question for you, @grovduck!

When you fit a GNNRegressor or MSNRegressor with a dataframe, what do you think feature_names_in_ should be?

Currently, test_estimators_support_dataframes assumes that it should be the name of the features from the dataframe (e.g. PIPO_BA, PSME_BA, etc). However, implementing feature names for our transformers broke that test because feature_names_in_ now returns the names of the components (e.g. cca0, cca1, etc) that were returned from the transformer.

The docstring for KNeighborsRegressor says that feature_names_in_ should return "names of features seen during fit", which is technically the transformed components in our case. However, my concern is that users will expect to see the names of the features that were used to fit the transformer, not the estimator.

EDIT: Another consideration is that once dimensionality reduction is implemented, there will also be a shape mismatch between the two sets of names.

I think I tentatively lean towards returning the names that were actually used to fit the estimator (cca...), but making that clear in the docstring and pointing users to use estimator.transform_.feature_names_in_ if they want the features that were used for the transformer, but maybe that's exposing the user too much to the implementation details of the estimator?

What do you think?

grovduck commented 1 year ago

Oof, this is a good question and getting these names to work is a bit of a pain, eh? First off, I don't think I have a good answer, but I'll ramble for a bit ...

I think fundamentally, I'm still viewing these estimators as more or less pipelines even if we're not using that workflow. In that sense, our estimators are composed of a transformer and a regressor. Each of these may have their own feature_names_in_ - the transformer has the original feature names (e.g. PIPO_BA) and the regressor has the transformed feature names (e.g. cca0), but the container itself should have the original feature names rather than the transformed feature names.

In a bit of a thought experiment, here's a pipeline with a PCA and a KNeighborsRegressor.

from sklearn.decomposition import PCA
from sklearn.neighbors import KNeighborsRegressor
from sklearn.datasets import load_linnerud
from sklearn.pipeline import Pipeline

pipe = Pipeline(
    [("PCA", PCA(n_components=2)), ("KNN", KNeighborsRegressor(n_neighbors=3))]
)

X, y = load_linnerud(return_X_y=True, as_frame=True)

# Component-wise
transformer = pipe[:-1].fit(X, y)
transformer.set_output(transform="pandas")
print(transformer.feature_names_in_)
# Reports ['Chins', 'Situps', 'Jumps']
print(transformer.get_feature_names_out())
# Reports ['pca0' 'pca1']

X_transformed = transformer.transform(X)
print(X_transformed.columns)
# Reports ['pca0', 'pca1']

regressor = pipe[-1].fit(X_transformed, y)
print(regressor.feature_names_in_)
# Reports ['pca0', 'pca1']

# All at once
model = pipe.fit(X, y)
model.set_output(transform="pandas")
print(model.feature_names_in_)
# Reports ['Chins', 'Situps', 'Jumps']
predicted = model.predict(X)
print(predicted.columns)
# AttributeError: 'numpy.ndarray' object has no attribute 'columns'

So this seems to follow what I would expect, other than the last step where we're not returning a dataframe (but instead a numpy array) from the call to predict. It seems like we can accommodate for that (i.e. returning a dataframe) by post-fixing the numpy output if we set feature_names_in_ to the original attribute names for these estimators.

I feel llike I'm totally off on a tangent, so reel me back in.

aazuspan commented 1 year ago

Oof, this is a good question and getting these names to work is a bit of a pain, eh?

Yeah, this is turning into a tougher issue than I expected!

In that sense, our estimators are composed of a transformer and a regressor. Each of these may have their own feature_namesin - the transformer has the original feature names (e.g. PIPO_BA) and the regressor has the transformed feature names (e.g. cca0), but the container itself should have the original feature names rather than the transformed feature names.

You're probably right here, and the pipeline is a good analogy. The challenge is getting our estimators to work as pipelines while also meeting all the assumptions and checks associated with regressors.

My first thought to solve this was to define a feature_names_in_ property on the TransformedKNeighborsMixin that returns the attr from the wrapped transformer (note this requires defining property setters and deleters to prevent sklearn from modifying that attr when fitting).

class TransformedKNeighborsMixin(KNeighborsRegressor):
    """
    Mixin for KNeighbors regressors that apply transformations to the feature data.
    """
    @property
    def feature_names_in_(self):
        return self.transform_.feature_names_in_

    @feature_names_in_.setter
    def feature_names_in_(self, value):
        ...

    @feature_names_in_.deleter
    def feature_names_in_(self):
        ...

The problem with this is that BaseEstimator._check_feature_names gets called during prediction and throws an error because the feature names we overrode don't match the names that were used to fit the estimator, and the only solution I've come up with is to override _check_feature_names. Maybe that's okay given that we're essentially delegating responsibility for feature names to the transformer, but it feels a little heavy handed...

Any alternatives you can think of, or does this approach seem okay to you?

So this seems to follow what I would expect, other than the last step where we're not returning a dataframe (but instead a numpy array) from the call to predict. It seems like we can accommodate for that (i.e. returning a dataframe) by post-fixing the numpy output if we set feature_namesin to the original attribute names for these estimators.

I also assumed (and maybe even claimed before) that sklearn estimators return dataframes when predicting from dataframes, but it looks like that's not the case and predict will always return an array, even outside a pipeline. They can accept dataframes and should store feature names, but never seem to output dataframes based on a few quick experiments I ran. If that's the case, I'm not sure we want to change that behavior for our estimators. What do you think?

grovduck commented 1 year ago

Any alternatives you can think of, or does this approach seem okay to you?

Yes, I definitely understand what you're saying about it feeling a bit heavy handed, but I fully trust that you've thought through the other possibilities and this might be the only way around this issue. I will defer to your judgment here.

They can accept dataframes and should store feature names, but never seem to output dataframes based on a few quick experiments I ran. If that's the case, I'm not sure we want to change that behavior for our estimators. What do you think?

Good point! I think I was incorrectly thinking about transformers when I wrote that (thinking back to the video I saw). In that video, he explicitly says that dataframe output is not yet supported for predict, predict_proba, etc. Sorry for being confusing on this one - I agree that we keep the regressor output as arrays.

Thanks for all your hard work on this one - it doesn't sound like it's been the most enjoyable dive.

aazuspan commented 1 year ago

Resolved (hopefully for good!) by #34