Closed aazuspan closed 1 year ago
Sorry to tack on to an already extensive issue, but I've got another possible option and another related problem to consider.
BaseEstimator
objects, including all our estimators, have a _check_feature_names
method that can set feature names from a dataframe. I edited the original post to include this as Option 4.
However, while experimenting with this I discovered...
Successfully adding feature names to the estimator introduces new warnings when calling predict
. This is because, like fit
, the transformation converts dataframe inputs to arrays. When predict
is run without feature names on a model that was fitted with feature names, sklearn
complains.
Short of overriding predict
to skip the check for feature names (definitely not a good approach), I'm not sure there's any way to solve this secondary issue other than modifying X
between transforming it and passing it in to super().predict()
. So with that in mind, maybe Option 3 (subclassing ndarray
to add a columns
attribute and make it look like a dataframe) is the only remaining solution that tackles both problems?
It's possible I'm too deep into this issue and I'm just getting tunnel vision, so please check my logic and let me know if there might be other solutions I haven't considered.
@aazuspan, such a great writeup of the issue and possible solutions. It takes a lot of time to put this much effort into an issue like this, so I really appreciate it. I think I fully understand the issue and offer a few responses:
I think we should move the actual transformation out of the fit method for each estimator and into a fit method for TransformedKNeighborsMixin
Yes, I agree with this approach. Just for my own edification, because now both superclasses (TransformedKNeighborsMixin
and IDNeighborsRegressor
) will have fit
methods and you'll want to call TransformedKNeighborsMixin.fit
first, will you call that explicitly (i.e. TransformedKNeighborsMixin.fit(self, X)
) and then call super().fit(X)
? But the return of fit
should be the estimator itself, so I'm a bit confused how this would work. Could you instead just create the TransformedKNeighborsMixin.transform
method?
I think that leaves us with a few options ...
This one took me a while to noodle on. Of the three options, I feel that option 3 (subclassing np.ndarray using the RealisticInfoArray
pattern on the page you linked) might be the best choice. I like that it doesn't require any tweaks to the sklearn
code at all and it should be transparent to the user. (While I was writing, I just saw your additional post which narrows in on this remedy!). I do like this approach (option 3) better than option 4.
Could this pattern also give us opportunities to store the dataframe index (IDs) as another attribute if they'd be lost otherwise with regular arrays?
I wonder if we should prioritize getting those checks to pass before we add any more functionality
π. I think this is a splendid idea. Given that it's already popped up two issues, it seems like an important step to get into place earlier rather than later.
(Just uncommented the compatibility test and, oof, that GNNRegressor
is going to take some work!)
Correct me if I'm wrong, but it doesn't seem like the check_dataframe_column_names_consistency
is actually part of the suite of checks that gets run on these estimators? How did you know that this would have tripped that check?
Just for my own edification, because now both superclasses (TransformedKNeighborsMixin and IDNeighborsRegressor) will have fit methods and you'll want to call TransformedKNeighborsMixin.fit first, will you call that explicitly (i.e. TransformedKNeighborsMixin.fit(self, X)) and then call super().fit(X)?
My thinking was that we'll call super().fit
in each case and rely on MRO to fit them in the order below (using EuclideanKNNRegressor
as an example):
EuclideanKnnRegressor.fit
- Sets the transform_
attribute.IDNeighborsRegressor.fit
- Potentially sets the index_
attribute.TransformedKNeighborsMixin.fit
- Checks for transform_
attribute, applies the transformation, and in the case of dataframes, stores the transformed X
in our array subclass.KNeighborsRegressor.fit
- Actually does the fitting.Admittedly, it is a little tough to track the flow of data there, but I think this will end up cleaner than a more functional approach, since we'll need to run step 3 when we call fit
, predict
, or kneighbors
on all of the transformed estimators.
Could you instead just create the TransformedKNeighborsMixin.transform method?
I think moving step 3 above into a single method like this is a good idea, which can be called from fit
, predict
, or kneighbors
.
I do like this approach (option 3) better than option 4.
I'm on the same page with option 3, and after putting together a quick implementation, I think it's a pretty clean solution overall. It was going to be a lot to paste in here, so I threw it into a Gist if you want to take a look and let me know if you have any thoughts!
Any preference on the name for the ndarray
subclass? As you'll see, my first thought was NamedFeatureArray
, but I'm open to ideas!
Could this pattern also give us opportunities to store the dataframe index (IDs) as another attribute if they'd be lost otherwise with regular arrays?
Good idea! My loose thinking is that we can store IDs as a fit attribute on the model before the data is transformed, but there may be a snag there that would be better handled by storing it on the arrays. I'll keep this in mind.
Given that it's already popped up two issues, it seems like an important step to get into place earlier rather than later.
Great! I have a working fix for this issue using option 3 (pending your feedback on names and implementation), but I suppose I should probably hold off on making a PR... Don't want to dig us into a deeper hole.
Correct me if I'm wrong, but it doesn't seem like the check_dataframe_column_names_consistency is actually part of the suite of checks that gets run on these estimators?
You're 100% right. I assumed check_estimator
ran everything in that module, and I'm not totally sure why it doesn't... I suppose we'll have to run some manually then, unless there's some configuration I'm missing?
After putting together a quick implementation, I think it's a pretty clean solution overall
Once again, you amaze me π€―. Super elegant solution and neat way of taking advantage of MRO. (I do always get a little baffled by calling super()
in a superclass, in this case thinking that it's going to call the superclass of TransformedKNeighborsMixin
, but it's really calling the next-in-line superclass of EuclideanKNNRegressor
, right? In this way, it's just picking up information as it goes down the MRO chain. Neat!)
I like the private _transform
method in that class as well - that looks like a good use.
As you'll see, my first thought was
NamedFeatureArray
That name sounds perfectly fine to me - faithful to the concept of features in sklearn
.
Good idea! My loose thinking is that we can store IDs as a fit attribute on the model before the data is transformed
You're too kind ... your solution (now that I understand it) seems like a better approach.
Great! I have a working fix for this issue using option 3 (pending your feedback on names and implementation), but I suppose I should probably hold off on making a PR
I'll leave this decision up to you. It seems like your fix here resolves a couple of the checks, so I can't imagine that you mess anything up by creating a PR for this issue before tackling the checks, but you're the better judge here.
I suppose we'll have to run some manually then, unless there's some configuration I'm missing
I didn't see anything that includes that check (along with a few others) as part of a wrapping function like yield_all_checks
. I'm not sure why they are left out of the set of checks.
Thanks for the deep thinking on this one.
I do always get a little baffled by calling super() in a superclass, in this case thinking that it's going to call the superclass of TransformedKNeighborsMixin, but it's really calling the next-in-line superclass of EuclideanKNNRegressor, right? In this way, it's just picking up information as it goes down the MRO chain.
Yeah, trying to figure out super
with multiple inheritance can be a little mind-bending, but I think you've got it. I think the trick is that in each of those calls to fit
, self
is still an instance of EuclideanKnnRegressor
, which is how super
is able to access the next method in MRO line.
It seems like your fix here resolves a couple of the checks, so I can't imagine that you mess anything up by creating a PR for this issue before tackling the checks
Thanks for helping me think it through--wanted to make sure I wasn't shooting us in the foot just to get a quick fix merged!
Thanks for the deep thinking on this one.
Likewise! I'm mostly used to working on code in a vacuum, so being able to bounce ideas around is a big help.
Resolved by #22
Hey @aazuspan, continuing my bad habit of responding to already closed issues ...
I found this video earlier this week. The main point of the video is that all transformers should support column names if either setting the global set_config(transform_output="pandas")
or the transformer specific transformer.set_output(transform="pandas")
and the passed X
is a dataframe. What was interesting to me was that some transformers (like PolynomialFeatures
) create new output names because they set more columns as a result of the fit and transform.
After a fairly deep dive and using PolynomialFeatures
as inspiration, I think I understand how this works a bit better. I think, at the very least, we need to support column names from our transformers (including the ones that transform the X
array into reduced dimension arrays like CCATransformer
). I think the three main things need to happen:
_SetOutputMixin
as a base class on our transformers so that set_output
is available to us.self._check_feature_names
or self._validate_data
in fit
with reset=True
. This will set self.feature_names_in_
correctly if X
is a dataframe.get_feature_names_out
method on each transformer that correctly sets the output transform names. Without it, transform
complains as I think you found out.What I'm not entirely clear on is whether this obviates the need of NamedFeatureArray
. I think it's entirely possible that you took even a deeper dive than I did π, so it's possible that I'm still missing something. I'm going to keep this issue closed for now, but let me know what you think.
Great point! I came across some articles mentioning that sklearn transformers support dataframes as a config option while researching this, but wasn't thinking about our transformers as part of the public API at that point. If we want them to be usable outside of our estimators (which I'm pretty sure we do), I think you're 100% right that they need to support those config options.
Add _SetOutputMixin as a base class on our transformers so that set_output is available to us.
After poking around, it looks like all BaseTransformer
subclasses, including ours, already subclass _SetOutputMixin
. ~In fact, StandardScalerWithDOF
~and CCATransformer
~ already have set_output
methods. MahalanobisTransformer
doesn't, but I haven't totally worked out why yet. I think it comes down to this
_auto_wrap_is_configured
check, but I haven't wrapped my head around that yet.~
EDIT: In order for a transformer that subclasses _SetOutputMixin
to access the set_output
method, it needs to pass the _auto_wrap_is_configured
check, which requires that the transformer has a get_feature_names_out
attr. So I think solving that first will automatically give us access to set_output
.
Call self._check_feature_names or self._validate_data in fit with reset=True. This will set self.feature_namesin correctly if X is a dataframe.
You may be a step ahead of me, but for this to work we need our transformer to return arrays when given arrays and dataframes when given dataframes, right? Are we doing this by calling set_output
on the transformer based on the input X
type in TransformedKNeighborsMixin._apply_transform
? That was my first thought, but I haven't worked through all the potential difficulties there.
Create a get_feature_names_out method on each transformer that correctly sets the output transform names. Without it, transform complains as I think you found out.
I noticed that StandardScalerWithDOF
already gets this method from OneToOneFeatureMixin, but I'm assuming we can't use that for our transformers that apply dimensionality reduction?
EDIT: To use OneToOneFeatureMixin.get_feature_names_out
, an estimator must set n_features_in_
when it's fit. Aside from StandardScalerWithDOF
, none of our transformers do this because they never call fit
from a superclass. We can fix this by running self._validate_data
at the start of each custom fit
method which, among other things, sets that attr. It can also be used to validate data types and other aspects of the data, so maybe we can just copy the StandardScaler
implementation? In any case, I'm pretty sure we need to set this attr regardless of how we handle get_feature_names_out
, and this will probably show up in the estimator checks in #21 as well.
What I'm not entirely clear on is whether this obviates the need of NamedFeatureArray
Good question! I think if we can get our transformers to always respect their input data types within the estimators (without requiring users to set any config), we should be able to remove NamedFeatureArray
because our X_transformed
will now be a dataframe if a dataframe goes in.
@aazuspan, great deep dive!
In order for a transformer that subclasses
_SetOutputMixin
to access theset_output
method, it needs to pass the_auto_wrap_is_configured
check, which requires that the transformer has aget_feature_names_out
attr. So I think solving that first will automatically give us access toset_output
.
Yes, that was the same conclusion that I came to as well.
To use
OneToOneFeatureMixin.get_feature_names_out
, an estimator must setn_features_in_
when it's fit. Aside fromStandardScalerWithDOF
, none of our transformers do this because they never callfit
from a superclass. We can fix this by runningself._validate_data
at the start of each customfit
method which, among other things, sets that attr. It can also be used to validate data types and other aspects of the data, so maybe we can just copy theStandardScaler
implementation? In any case, I'm pretty sure we need to set this attr regardless of how we handleget_feature_names_out
, and this will probably show up in the estimator checks in #21 as well.
Just so I'm clear, this is an explanation of how this works for those estimators that can set the output feature names to be the same as the input feature names, correct? In our case, this would (currently) be StandardScalerWithDOF
and MahalanobisTransformer
, right? You're not suggesting that we use the OneToOneFeatureMixin
on CCATransformer
or CCorATransformer
, correct? Only that we need to ensure that we call self._validate_data
in their fit
methods, right? That was where I had landed as well.
If you think you have a clear understanding of what needs to be done, I'd love for you to take a first stab at this. But I'm happy to circle back to this as well.
In our case, this would (currently) be StandardScalerWithDOF and MahalanobisTransformer, right? You're not suggesting that we use the OneToOneFeatureMixin on CCATransformer or CCorATransformer, correct? Only that we need to ensure that we call self._validate_data in their fit methods, right? That was where I had landed as well.
Exactly! Any thoughts on how to implement get_feature_names_out
for CCA
and CCorA
? What do you think about using ClassNamePrefixFeaturesOutMixin like PCA does, where the feature names would be sequentially prefixed based on the name of the transformer?
If you think you have a clear understanding of what needs to be done, I'd love for you to take a first stab at this.
Happy to! I think I have a relatively clear picture of how this will work, although that may change once I get into the nitty gritty details. One thing I'm particularly unsure on is how best to test this, so let me know if you have any thoughts there. In any case, I'll hold off until MSN is merged since that will be affected.
What do you think about using ClassNamePrefixFeaturesOutMixin like PCA does, where the feature names would be sequentially prefixed based on the name of the transformer?
That's perfect! Exactly what I was thinking in terms of naming.
I just noticed that the ClassNamePrefixFeaturesOutMixin.get_feature_names_out
method requires that an _n_features_out
attr is set on the transformer. For PCA, this is handled with this property.
My understanding is that we plan to make the number of output components (is there a more accurate term?) configurable for the CCA
and CCorA
ordinations, right? In that case, I imagine we would set an attribute when the transformer is instantiated, similar to PCA
, and we could reference that for _n_features_out
.
But maybe there's a more direct way we can do this know, just by checking the shape of an attribute on the transformer after it's fit. This is more similar to how PCA
checks the shape of its components_
attr.
Since you have a much better idea of the inner workings of these transformers, what do you think about these options, and if you want to go with the second one, can you point me in the direction of attrs that would store the output shapes for CCATransformer
and CCorATransformer
?
Since you have a much better idea of the inner workings of these transformers, what do you think about these options, and if you want to go with the second one, can you point me in the direction of attrs that would store the output shapes for
CCATransformer
andCCorATransformer
?
Great question! (and one I've meant to circle back to ...). For CCATransformer
, after the transformer is fit, we can use self.cca_.eigenvalues.shape[0]
(it's a 1-d numpy array, so we could use len
if preferred). And your intuition is right on, we'll want to have an option to set the number of output axes to use, so I think the logic goes like this:
__init__
in CCATransformer
that sets an attribute (something like n_axes
) which would default to None
fit
, either modify self.n_axes
or create a new estimator attribute that stores self.cca_.eigenvalues.shape[0]
if self.n_axes
is None
, otherwise stores min(self.n_axes, self.cca_.eigenvalues.shape[0])
. This step may not be entirely necessary and we could just code this into the property below._n_features_out
that references the attribute from above (or implements that logic)transform
to only use up to _n_features_out
axes. For CCorATransformer
, I think it's probably safest to use self.ccora_.projector.shape[0]
. There is a property in there called n_vec
which would return the same number (and is used in projector
), but I don't know how much of the internals of CCorA
(and, for that matter, CCA
) we want to expose. Because we already use projector
in transform
, that seems safest to me.
(I can't promise that I won't go back and fiddle a bit with the API of both of these classes, including adding a projector
property to CCA
, but to keep you going, I won't do anything now.)
Hopefully that gives you enough to go on now?
@aazuspan, I thought that it might be better to track the dimensionality reduction in a separate issue [#33], given that there are more considerations than just naming the axes. For now, if you just want to go with the properties I named above to get this working, I think that would be better. Thoughts?
That's exactly what I was looking for, thanks! I agree that we should track dimensionality reduction separately, but can move ahead here.
A couple more questions. First, I set up a quick test for this that fits each transformer (currently just using the linnerud
dataset) and compares the shapes of get_feature_names_out
to the number of features in the X
data, assuming that they should be the same.
from sklearn.datasets import load_linnerud
@pytest.mark.parametrize("transformer", get_transformer_instances())
def test_transformers_get_feature_names_out(transformer):
"""Test that all transformers get feature names out."""
X, y = load_linnerud(return_X_y=True)
feature_names = transformer.fit(X=X, y=y).get_feature_names_out()
assert feature_names.shape == (X.shape[1],)
This fails for CCATransformer
, and after playing around with it, it looks like CCA
will always reduce dimensionality by one feature. Am I understanding that right?
If that's the case, I'll probably switch to testing each transformer's feature names separately so we can also confirm that the names are correct, too. Does that sound like a good plan?
Second question: how do you feel about the fact that the output feature names for CCATransformer
and CCorATransformer
will be ccatransformer0
, and ccoratransformer0
, respectively (because they subclass ClassNamePrefixFeaturesOutMixin
)? Those are obviously a little wordy and it feels kind of odd to include transformer
in the feature names. We could think about shortening the names of the transformers, making them more akin to PCA
, although we would obviously have to distinguish our CCA
from the builtin CCA
. We don't necessarily need to decide on this right now, unless we wanted to modify the implementation of get_feature_names_out
to shorten feature names without renaming classes.
This fails for
CCATransformer
, and after playing around with it, it looks likeCCA
will always reduce dimensionality by one feature. Am I understanding that right?
CCA
shouldn't always reduce dimensionality by one - it should be data-dependent, e.g.:
import pandas as pd
from sklearn.datasets import load_linnerud
from sknnr.transformers import CCATransformer
X, y = load_linnerud(return_X_y=True)
txfr = CCATransformer().fit(X, y)
print(X.shape[1], txfr.cca_.eigenvalues.shape[0])
# prints (3, 2)
X = pd.read_csv("../tests/data/moscow_env.csv").iloc[:, 1:]
y = pd.read_csv("../tests/data/moscow_spp.csv").iloc[:, 1:]
txfr = CCATransformer().fit(X, y)
print(X.shape[1], txfr.cca_.eigenvalues.shape[0])
# prints (28, 28)
Was there something in the code that made you think it would always reduce dimensionality by one? It very well could be that I'm overlooking something!
Another question - did this not fail for CCorATransformer
? This is where the statistical fit test usually reduces the number of axes quite a bit.
Second question: how do you feel about the fact that the output feature names for
CCATransformer
andCCorATransformer
will beccatransformer0
, andccoratransformer0
, respectively (because they subclassClassNamePrefixFeaturesOutMixin
)?
Yeah, not ideal, eh? My preference would be that the names were cca0
and ccora0
, but I understand that issues that causes with renaming those classes (and the fact that our CCorA
is really supposed to produce the same results as sklearn's CCA
). I don't think I'm in favor of renaming the transformers. Does it feel too hacky to post-fix the output of ClassNamePrefixFeaturesOutMixin.get_feature_names_out
to splice out transformer
? Having said that, I can live with whatever feels right to you (even a rename of the transformer class names).
I see, thanks for explaining!
Was there something in the code that made you think it would always reduce dimensionality by one? It very well could be that I'm overlooking something!
No, this was my very naive empirical test of throwing a bunch of randomized numpy arrays of different shapes at it and checking the outputs!
Given that no dimensionality reduction occurs with the Moscow data, how do you feel about this as a test for get_feature_names_out
?
@pytest.mark.parametrize("transformer", get_transformer_instances())
def test_transformers_get_feature_names_out(transformer, moscow_euclidean):
"""Test that all transformers get feature names out."""
X = moscow_euclidean.X
y = moscow_euclidean.y
feature_names = transformer.fit(X=X, y=y).get_feature_names_out()
assert feature_names.shape == (X.shape[1],)
Another question - did this not fail for CCorATransformer? This is where the statistical fit test usually reduces the number of axes quite a bit.
It didn't, at least not with linnerud
or the np.random.normal
arrays I was using.
Does it feel too hacky to post-fix the output of ClassNamePrefixFeaturesOutMixin.get_feature_names_out to splice out transformer?
Actually, just implementing get_feature_names_out
from scratch for those two transformers would probably be simpler than calling the superclass and modifying the output. I would probably lean towards that, with the only downside being a little bit of duplication between the two transformers. Something like:
def get_feature_names_out(self, input_features=None) -> np.ndarray:
return np.asarray([f"ccora{i}" for i in range(self._n_features_out)], dtype=object)
No, this was my very naive empirical test of throwing a bunch of randomized numpy arrays of different shapes at it and checking the outputs!
I think this is still interesting, though. Did they always return n-1 dimensions based on n features? The actual code that sets the number of eigenvalues is here, which basically takes the minimum of the rank of the least-squares regression or the number of positive eigenvalues. Although I don't fully understand matrix rank and how it relates to least-squares regression, I think rank differs based on whether you have under-, well-, or over-determined systems, which differs on the shape of the arrays passed. I might play around with this a bit more to try to understand what should be expected, but can't promise that I'll be able to provide a coherent answer!
Given that no dimensionality reduction occurs with the Moscow data, how do you feel about this as a test for
get_feature_names_out
?
I'm struggling with this one. If the test is meant to show the expected behavior (i.e. the number of features should always equal the number of axes), I think it could be misleading. Based on what you've already found along with the optimization of "meaningful" axes in both CCA
and CCorA
, I wouldn't necessarily expect this to be true. But if the test is meant to capture what is happening with this specific data, then I'm OK with it. I just think the non-deterministic nature of these two ordination methods makes testing tricky.
It didn't, at least not with
linnerud
or thenp.random.normal
arrays I was using.
That's because I gave you the wrong attribute! Sorry about that. The number of output features should be self.ccora_.projector.shape[1]
(the columns, not rows). If you run transform
on the Moscow test data, you'll get this:
import pandas as pd
from sknnr.transformers import CCorATransformer
X = pd.read_csv("../tests/data/moscow_env.csv").iloc[:, 1:]
y = pd.read_csv("../tests/data/moscow_spp.csv").iloc[:, 1:]
txfr = CCorATransformer().fit(X, y)
print(txfr.ccora_.projector.shape[1])
# Prints 5
print(txfr.transform(X).shape)
# Prints (165, 5)
Actually, just implementing
get_feature_names_out
from scratch for those two transformers would probably be simpler than calling the superclass and modifying the output.
I like this option better if you're OK with this. But let me know if you feel differently.
Did they always return n-1 dimensions based on n features?
Yep, here's the code I was experimenting with if you want to take a closer look:
import numpy as np
from sknnr.transformers import CCATransformer
for n_features in range(1, 20):
n_samples = 30
X = np.random.normal(loc=10, size=(n_samples, n_features))
y = np.random.normal(loc=10, size=(n_samples, n_features))
n_features_out = CCATransformer().fit(X, y).cca_.eigenvalues.shape[0]
print(n_features, n_features_out)
I might play around with this a bit more to try to understand what should be expected, but can't promise that I'll be able to provide a coherent answer!
That's okay, I'm not sure I'll ever fully grok the stats, but as long as someone does and the tests pass, I'm happy!
If the test is meant to show the expected behavior (i.e. the number of features should always equal the number of axes), I think it could be misleading. Based on what you've already found along with the optimization of "meaningful" axes in both CCA and CCorA, I wouldn't necessarily expect this to be true. But if the test is meant to capture what is happening with this specific data, then I'm OK with it.
Well said! I felt slightly uneasy about the test, and I think you captured what I didn't like about it. I do think we should have some test of output shape for get_feature_names_out
, since it would otherwise be very easy for a bug to go unnoticed there. What do you think about comparing the shape of the feature names to the shape of the transformed X
, like below? It requires a little more computation, but probably not enough to ever notice with the small number of transformers we have.
@pytest.mark.parametrize("transformer", get_transformer_instances())
def test_transformers_get_feature_names_out(transformer, moscow_euclidean):
"""Test that all transformers get feature names out."""
fit_transformer = transformer.fit(X=moscow_euclidean.X, y=moscow_euclidean.y)
feature_names = fit_transformer.get_feature_names_out()
X_transformed = fit_transformer.transform(X=moscow_euclidean.X)
assert feature_names.shape == (X_transformed.shape[1],)
That's because I gave you the wrong attribute! Sorry about that. The number of output features should be self.ccora_.projector.shape[1] (the columns, not rows).
No problem! I hate to say it, but I think this is exactly what Github Copilot suggested when I first started writing the test...
I like this option better if you're OK with this. But let me know if you feel differently.
I have a slight reservation that none of the sklearn
transformers include "Transformer" in the name, but I also don't have a better suggestion. Maybe we can revisit naming again in the future once the rest of the API is in place, but for now I'm perfectly happy implementing get_feature_names_out
from scratch, since it's such a simple method.
Yep, here's the code I was experimenting with if you want to take a closer look
OK, I think I got it. If you have more features in y
than you do in X
, then the number of eigenvalues should be equal to the number of features in X
. Try this:
import numpy as np
from sknnr.transformers import CCATransformer
for n_features in range(1, 20):
n_samples = 30
X = np.random.normal(loc=10, size=(n_samples, n_features))
y = np.random.normal(loc=10, size=(n_samples, n_features + 1))
n_features_out = CCATransformer().fit(X, y).cca_.eigenvalues.shape[0]
print(n_features, n_features_out)
It may be more predictable than I think, but might require doing some checking of input array shape to determine. I think if you have completely collinear columns in y
, that will reduce the rank as well.
What do you think about comparing the shape of the feature names to the shape of the transformed
X
, like below?
I like this approach much better. Thanks for being flexible!
I hate to say it, but I think this is exactly what Github Copilot suggested
Replaced by the machines!
I have a slight reservation that none of the
sklearn
transformers include "Transformer" in the name
That is pretty interesting, given the prevalence of estimators with Classifier
and Regressor
in their names. I'm happy to revisit as well, but I think moving forward with the custom get_feature_names_out
makes sense!
New question for you, @grovduck!
When you fit a GNNRegressor
or MSNRegressor
with a dataframe, what do you think feature_names_in_
should be?
Currently, test_estimators_support_dataframes assumes that it should be the name of the features from the dataframe (e.g. PIPO_BA
, PSME_BA
, etc). However, implementing feature names for our transformers broke that test because feature_names_in_
now returns the names of the components (e.g. cca0
, cca1
, etc) that were returned from the transformer.
The docstring for KNeighborsRegressor says that feature_names_in_
should return "names of features seen during fit", which is technically the transformed components in our case. However, my concern is that users will expect to see the names of the features that were used to fit the transformer, not the estimator.
EDIT: Another consideration is that once dimensionality reduction is implemented, there will also be a shape mismatch between the two sets of names.
I think I tentatively lean towards returning the names that were actually used to fit the estimator (cca...
), but making that clear in the docstring and pointing users to use estimator.transform_.feature_names_in_
if they want the features that were used for the transformer, but maybe that's exposing the user too much to the implementation details of the estimator?
What do you think?
Oof, this is a good question and getting these names to work is a bit of a pain, eh? First off, I don't think I have a good answer, but I'll ramble for a bit ...
I think fundamentally, I'm still viewing these estimators as more or less pipelines even if we're not using that workflow. In that sense, our estimators are composed of a transformer and a regressor. Each of these may have their own feature_names_in_
- the transformer has the original feature names (e.g. PIPO_BA
) and the regressor has the transformed feature names (e.g. cca0
), but the container itself should have the original feature names rather than the transformed feature names.
In a bit of a thought experiment, here's a pipeline with a PCA
and a KNeighborsRegressor
.
from sklearn.decomposition import PCA
from sklearn.neighbors import KNeighborsRegressor
from sklearn.datasets import load_linnerud
from sklearn.pipeline import Pipeline
pipe = Pipeline(
[("PCA", PCA(n_components=2)), ("KNN", KNeighborsRegressor(n_neighbors=3))]
)
X, y = load_linnerud(return_X_y=True, as_frame=True)
# Component-wise
transformer = pipe[:-1].fit(X, y)
transformer.set_output(transform="pandas")
print(transformer.feature_names_in_)
# Reports ['Chins', 'Situps', 'Jumps']
print(transformer.get_feature_names_out())
# Reports ['pca0' 'pca1']
X_transformed = transformer.transform(X)
print(X_transformed.columns)
# Reports ['pca0', 'pca1']
regressor = pipe[-1].fit(X_transformed, y)
print(regressor.feature_names_in_)
# Reports ['pca0', 'pca1']
# All at once
model = pipe.fit(X, y)
model.set_output(transform="pandas")
print(model.feature_names_in_)
# Reports ['Chins', 'Situps', 'Jumps']
predicted = model.predict(X)
print(predicted.columns)
# AttributeError: 'numpy.ndarray' object has no attribute 'columns'
So this seems to follow what I would expect, other than the last step where we're not returning a dataframe (but instead a numpy array) from the call to predict
. It seems like we can accommodate for that (i.e. returning a dataframe) by post-fixing the numpy output if we set feature_names_in_
to the original attribute names for these estimators.
I feel llike I'm totally off on a tangent, so reel me back in.
Oof, this is a good question and getting these names to work is a bit of a pain, eh?
Yeah, this is turning into a tougher issue than I expected!
In that sense, our estimators are composed of a transformer and a regressor. Each of these may have their own feature_namesin - the transformer has the original feature names (e.g. PIPO_BA) and the regressor has the transformed feature names (e.g. cca0), but the container itself should have the original feature names rather than the transformed feature names.
You're probably right here, and the pipeline is a good analogy. The challenge is getting our estimators to work as pipelines while also meeting all the assumptions and checks associated with regressors.
My first thought to solve this was to define a feature_names_in_
property on the TransformedKNeighborsMixin
that returns the attr from the wrapped transformer (note this requires defining property setters and deleters to prevent sklearn from modifying that attr when fitting).
class TransformedKNeighborsMixin(KNeighborsRegressor):
"""
Mixin for KNeighbors regressors that apply transformations to the feature data.
"""
@property
def feature_names_in_(self):
return self.transform_.feature_names_in_
@feature_names_in_.setter
def feature_names_in_(self, value):
...
@feature_names_in_.deleter
def feature_names_in_(self):
...
The problem with this is that BaseEstimator._check_feature_names
gets called during prediction and throws an error because the feature names we overrode don't match the names that were used to fit the estimator, and the only solution I've come up with is to override _check_feature_names
. Maybe that's okay given that we're essentially delegating responsibility for feature names to the transformer, but it feels a little heavy handed...
Any alternatives you can think of, or does this approach seem okay to you?
So this seems to follow what I would expect, other than the last step where we're not returning a dataframe (but instead a numpy array) from the call to predict. It seems like we can accommodate for that (i.e. returning a dataframe) by post-fixing the numpy output if we set feature_namesin to the original attribute names for these estimators.
I also assumed (and maybe even claimed before) that sklearn estimators return dataframes when predicting from dataframes, but it looks like that's not the case and predict
will always return an array, even outside a pipeline. They can accept dataframes and should store feature names, but never seem to output dataframes based on a few quick experiments I ran. If that's the case, I'm not sure we want to change that behavior for our estimators. What do you think?
Any alternatives you can think of, or does this approach seem okay to you?
Yes, I definitely understand what you're saying about it feeling a bit heavy handed, but I fully trust that you've thought through the other possibilities and this might be the only way around this issue. I will defer to your judgment here.
They can accept dataframes and should store feature names, but never seem to output dataframes based on a few quick experiments I ran. If that's the case, I'm not sure we want to change that behavior for our estimators. What do you think?
Good point! I think I was incorrectly thinking about transformers when I wrote that (thinking back to the video I saw). In that video, he explicitly says that dataframe output is not yet supported for predict
, predict_proba
, etc. Sorry for being confusing on this one - I agree that we keep the regressor output as arrays.
Thanks for all your hard work on this one - it doesn't sound like it's been the most enjoyable dive.
Resolved (hopefully for good!) by #34
Hey @grovduck, I'm starting to sound like a broken record, but there's another issue blocking dataframe indexes in #2. This is a bit of a long one, so I tried to lay it out below.
The problem
All of our
TransformedKNeighborsMixin
estimators are incompatible with dataframes in that they don't store feature names. I didn't think to check this before, but updating the dataframe test to check for feature names fails for everything butRawKNNRegressor
.This happens because they all run
X
through transformers that convert the dataframes to arrays before they get toKNeighborsRegressor.fit
where the features would be retrieved and stored. The same thing would happen withsklearn
transformers, so I think we should probably solve this inTransformedKNeighborsMixin
rather than in the transformers.EDIT: As detailed in the next post, once we solve the issue of losing feature names when fitting, we need to also retain feature names when predicting to avoid warnings.
Possible solutions
First of all, I think we should move the actual transformation out of the
fit
method for each estimator and into afit
method forTransformedKNeighborsMixin
. That should probably be done regardless of this issue just to reduce some duplication, and also allows us to make sure everything gets fit the same way. Then, I think we need to modify thatfit
method to make sure it sets appropriate feature names after transformation.To get feature names, all that sklearn does is look for a
columns
attribute onX
. If we could copy thatcolumns
attribute onto the transformed array before passing it toKNeighborsRegressor.fit
we'd be set, but there's no way to directly set attributes on Numpy arrays because they are implemented in C.I think that leaves us with a few options:
sklearn.utils.validation._get_feature_names
to get and validate the feature names before transforming, then manually set them asfeature_names_in_
after fitting. I don't love this because it requires us to use a private method that could disappear, get renamed, change return types, etc. The upside is that we would know our feature names are retrieved consistently withsklearn
.sklearn.utils.validation._get_feature_names
into our code base. That bypasses the private method issue, but adds some maintenance cost and we would need to carefully consider how to do that consistently with thesklearn
license. As with option 1, we would still need to handle setting thefeature_names_in_
attribute.ndarray
to support acolumn
attribute and pass that in to fit. As long assklearn
doesn't change how they identify features (which seems unlikely), we could letsklearn
handle getting and setting feature names, and I think it would be transparent to users. I did confirm that the_fit_X
attribute seems to store a numpy array regardless of what goes into it. Like option 2, this adds some maintenance cost._check_feature_names
method with the non-transformedX
after fitting. This will set feature names on the model and fix the issue of losing feature names when fitting. The downside is that we're again using a private method.I don't love any of these options, so let me know what you think or if any other solutions occur to you.
Estimator checks
I noticed that the
sklearn.estimator_checks
would have caught this, so I wonder if we should prioritize getting those checks to pass before we add any more functionality? I think that may be a big lift, but would at least prevent us from accidentally breaking estimators in the future. Also, it may be easier to do now than after they get more complex and would keep us from accidentally writing a lot of redundant tests.EDIT: This would also catch warnings for predicting without feature names that I mention in the next post.