Closed aazuspan closed 1 year ago
@aazuspan - sorry to be slow to the game. [EDIT]. I've taken a look at what you've done so far and would like to help if I can. I'm wondering if you can help guide me through process. A few questions:
XPASS
are silently passing, whereas the XFAIL
ones are silently failing. If you wanted to understand why a certain test is XFAIL
, are you commenting @pytest.mark.xfail(reason="Incrementally fixing.")
so you actually get the error message? Or is there a better way to do that? Currently, I'm looking at a specific transformer at a time, e.g. hatch run test:all tests\test_transformers.py -k CCATransformer -v
.Thanks @grovduck, no rush! I figure this will probably take us a while to work through.
It looks like all the tests that are marked XPASS are silently passing, whereas the XFAIL ones are silently failing. If you wanted to understand why a certain test is XFAIL, are you commenting @pytest.mark.xfail(reason="Incrementally fixing.") so you actually get the error message? Or is there a better way to do that? Currently, I'm looking at a specific transformer at a time, e.g. hatch run test:all tests\test_transformers.py -k CCATransformer -v.
That's exactly what I'm doing - using xpass to keep the CI runs happy but disabling it for local runs so I can see what's going on. It's a little clunky, but it seems to work!
Also, I'd recommend making sure your scikit-learn
is up to date in your test environment - I just entered a shell and used pip install --upgrade
, but maybe we need to think about pinning for the development environment. There was a recent release that added new estimator checks, and it took me a while to figure out why I had fewer XFAILS than the CI did...
If I were able to provide some fixes, did you actually want me to commit them? How best not to step on each other's toes to avoid merge conflicts? Up until now, it's been one of us doing most of the commits in a particular PR.
Hmm, good question! I haven't touched CCorA
and won't be working on this much tomorrow, so if you get an itch to dive in you could start there. Otherwise, I'm roadblocked on CCA
(details at the top), so if you wanted to take a look at that it would be great!
23 of the 24 failing checks are caused by the fact that CCA only works with 2D y arrays. I assume there's no way we can add compatibility for that? So far, I haven't found any way to indicate this constraint to sklearn, and I'm not sure there's any comparable transformer or estimator that we can look at for guidance...
I saw this too (and thanks for the tip on updating scikit-learn
- I only had 23 failing checks before I did that 😄 ). I started looking at scikit-learn's CCA
for inspiration here as it's a transformer which takes X
and y
matrices. But it looks like it allows a 1D array that it will transpose into a 2D array (i.e. a single feature). I believe canonical correspondence analysis (e.g. our CCA) requires a 2D y
array and, even if it did allow a 1D array, all of those elements should be nonzero which would trip up the checks, I believe. I will do some reading and testing on whether CCA can take a single vector of non-zero values.
But I'd also like to know how transformers like StandardScaler
pass these checks. Obviously, most of these checks are designed for estimators that take both X
and y
. Is StandardScalar
passing because it simply ignores the y
being passed? One other item that might be a red herring - I noticed that scikit-learn's CCA
specifies its fit as fit(X, Y)
(note the capitalization). I assume that is to signify to the user that this is meant to be passed a matrix/array rather than a vector, but because it's a positional argument, I think the y
argument from the checks is being passed to the Y
parameter.
Everything is conjecture at this point. I'll dedicate a bit of time to this today, as I would think that many of the issues here will also be applicable to CCorA
.
I believe canonical correspondence analysis (e.g. our CCA) requires a 2D y array and, even if it did allow a 1D array, all of those elements should be nonzero which would trip up the checks, I believe.
Good to know! There are tags for requires_positive_X
and requires_positive_y
, but I don't see any options for nonzero constraints. If we find there's no solution to handling the 1D y checks, I think the _xfail_checks
tag might be our best option (although we may also want to raise an error for users about requiring 2D arrays).
EDIT: Actually, it looks like the requires_positive_
tags constrain inputs to "strictly positive" values >0, so maybe that is a solution if we can transpose 1D arrays to 2D?
But I'd also like to know how transformers like StandardScaler pass these checks. Obviously, most of these checks are designed for estimators that take both X and y. Is StandardScalar passing because it simply ignores the y being passed?
Exactly!
One other item that might be a red herring - I noticed that scikit-learn's CCA specifies its fit as fit(X, Y) (note the capitalization). I assume that is to signify to the user that this is meant to be passed a matrix/array rather than a vector, but because it's a positional argument, I think the y argument from the checks is being passed to the Y parameter.
Interesting, I hadn't noticed that. Oddly, that capitalization isn't consistent within CCA
- fit
and transform
use Y
, but fit_transform
and score
use y
, even though the docstrings for the parameter are the same. If there's some logic there, I'm not seeing it... But in any case, I think you're right that these are relying on positional args to work.
I think I have some definitive answers on CCA
and CCorA
in terms of needed input.
n x m
array with m >= 1. Values can be any numeric type but cannot have NAs (I think we currently allow NAs?). If passed a 1D array, we should be able to convert to 2Dn x m
array with m >= 2. Values can be any numeric type but row sums must be >= 0, although individual elements can be < 0 (which seems weird to be because they are supposed to represent abundances). I tried (in R) with one feature and no CCA axes are returned. No NAs allowed.n x m
arrays with m >= 1. Values can be any numeric type, but cannot have NAs.I believe canonical correspondence analysis (e.g. our CCA) requires a 2D y array and, even if it did allow a 1D array, all of those elements should be nonzero which would trip up the checks, I believe.
When I wrote this, it was because we would only have a single element per row and row sums needed to be >= 0, but as I've found out, not all elements need to be strictly non-zero, so I don't think we want to use the requires_positive_X
and requires_positive_y
tags.
So, it seems like that leaves us with using _xfail_checks
at least for CCA
? In looking through the scikit-learn code repo, it looks like estimators typically only disable a few of these tests when _xfail_checks
is used which makes me a bit queasy. But the y
that is passed is specifically supposed to act as labels or targets for classifiers or regressors -or- ignored for transformers.
I really wish we could find a transformer with a required y
that has to have at least two features to serve as inspiration. I haven't come across it yet.
This is great @grovduck! Getting a clear idea of exactly what constraints each transformer has will be a big help. Once we have a complete picture, we should probably go through the tags and _validate_data
calls for each transformer and estimator to make sure they're correct and consistent. I've been playing a little fast and loose with the parameters on those and might be over or under-constraining them.
CCA ... Values can be any numeric type but cannot have NAs (I think we currently allow NAs?) (I think we currently allow NAs?)
Yes! We'll want to update the _validate_data
call to force_all_finite=True
, and maybe add allow_nan=False
to _more_tags
just to be explicit (even though it's the default).
CCA ... Y must be n x m array with m >= 2
Great, we can enforce this by adding ensure_min_features=2
to the _validate_data
check. I added this to MahalanobisTransformer
in 3ac8471bd69fecf43e60f87bc125b7a22e66c836 (please check to make sure my logic was correct there).
CCorA ... Values can be any numeric type, but cannot have NAs.
It looks like we're already there for CCorA
, but again we may want to err on the side of explicitness both in the tags and when calling _validate_data
.
So, it seems like that leaves us with using _xfail_checks at least for CCA? In looking through the scikit-learn code repo, it looks like estimators typically only disable a few of these tests when _xfail_checks is used which makes me a bit queasy
That's where I'm landing too, but I share your discomfort. I think we're likely missing some valuable test cases because of that incompatibility. Maybe if we conclude that there's no other alternative, we can see if any checks are worth re-implementing for ourselves with 2D data.
I really wish we could find a transformer with a required y that has to have at least two features to serve as inspiration. I haven't come across it yet.
This would be great. I'll do some looking too - maybe there's a relevant 3rd party package...
Is the multioutput_only
tag essentially what we're looking for? I came across that while looking through the _MultiOutputEstimator class.
Adding that tag definitely doesn't fix the checks, but it does replace most of the axis 1 is out of bounds
errors with row sums must be greater than 0
errors, so maybe that's progress...
Is the multioutput_only tag essentially what we're looking for? I came across that while looking through the _MultiOutputEstimator class.
@aazuspan, I saw this as well. To me, it felt disingenuous to use this because it would seem that the output of this transformer would be multioutput y
values versus transformed X
scores. But I can definitely appreciate that this would solve some of the checks. I suppose that a comment explaining why we chose this tag would be helpful.
Yeah, I see your point - it's definitely not using the tag as intended. If it did solve some checks, I agree that we should at least add a clarifying comment, but honestly I'm not sure it will since it seems to just introduce a new issue with the row sums.
I'm starting to accept that _xfail_checks
might be unavoidable, at least for CCA
. If our transformers have inherent constraints that the sklearn
checks weren't designed to handle, there's not much we can do about that. I'm also noticing that sklearn
have their own share of hacky special cases to avoid problematic checks...
I'm also noticing that
sklearn
have their own share of hacky special cases to avoid problematic checks...
Funny! Yeah, that looks like something I would do 😉.
In the interest of not getting completely high-centered, I'm OK moving forward with the _xfail_checks
plan with CCA
. I think I prefer that approach to including the multioutput_only
tag. I'll let you make the final decision, but it feels like we've invested enough time trying to figure this out that it's time to move on.
In the interest of not getting completely high-centered, I'm OK moving forward with the _xfail_checks plan with CCA. I think I prefer that approach to including the multioutput_only tag. I'll let you make the final decision, but it feels like we've invested enough time trying to figure this out that it's time to move on.
Agreed, moving forward with _xfail_checks
for CCA
sounds like a plan. If we have a brainwave on how to deal with this later on we can always revisit it.
I can add those xfail
tags and remove NaN support as discussed, which should hopefully wrap up that transformer.
CCorA: Both X and Y can be n x m arrays with m <= 1.
@grovduck I'm trying to wrap up this PR and running into a complication with CCorA
that suggests y
must be 2D here as well. The offending line is here:
As we found with CCA
, most of the checks use a 1D y
array, but attempting to fit a StandardScaler
with a 1D array fails because it calls _validate_data
which calls check_array
which by default requires a 2D X array.
That can be reproduced just by running:
import numpy as np
from sklearn.preprocessing import StandardScaler
y = np.random.randint(low=0, high=4, size=100)
StandardScaler().fit(y) # ValueError: Expected 2D array, got 1D array instead
If y
does in fact need to be 2D we can take the same approach we took with CCA
of xfailing those checks, but if that's not an inherent restriction of CCorA
, maybe there's a workaround? Let me know if you have any thoughts on this!
If
y
does in fact need to be 2D we can take the same approach we took withCCA
of xfailing those checks, but if that's not an inherent restriction ofCCorA
, maybe there's a workaround? Let me know if you have any thoughts on this!
@aazuspan, I need to verify this, but I believe CCorA
can use 1-d arrays but they need to be formatted as n x 1
2-d arrays. If you look at the code associated with sklearn's CCA
fit method, they call check_array
with ensure2d=False
and then reshape the 1-D array to a 2-D array if necessary.
I'm guessing we should do the same with CCorA
, but like I said, I haven't yet tested this. I can try to do so this afternoon.
EDIT: I've tested this and we should be able to input a 1-d y
array with this change.
import numpy as np
from sklearn.base import BaseEstimator, TransformerMixin
from sklearn.utils.validation import check_is_fitted
from sklearn.utils import check_array
from . import ComponentReducerMixin, StandardScalerWithDOF
from ._ccora import CCorA
class CCorATransformer(ComponentReducerMixin, TransformerMixin, BaseEstimator):
def fit(self, X, y):
self._validate_data(X, reset=True)
self.scaler_ = StandardScalerWithDOF(ddof=1).fit(X)
y = check_array(y, input_name="Y", dtype=np.float64, ensure_2d=False)
if y.ndim == 1:
y = y.reshape(-1, 1)
y = StandardScalerWithDOF(ddof=1).fit_transform(y)
...
Are you able to test if this works? The number of xpassed
went from 253 to 300 when I made this change. Sorry that I'm not more clever than that to figure out if it's working correctly.
EDIT 2:
I think I read my own notes earlier in the thread and tested this with the @pytest.mark.xfail(reason="Incrementally fixing.")
commented out. Just one check still fails for CCorATransformer
which is check_estimators_nan_inf
. I assume you fix this with more_tags
?
Great find! That seems to resolve the 1D y issue which, as you found, was responsible for all but one of the failed checks.
Just one check still fails for CCorATransformer which is check_estimators_nan_inf. I assume you fix this with more_tags?
In this case we can fix that by adding self._validate_data(X, force_all_finite=True)
to the transform
method, which just throws the error that the check wants to see when a NaN or Inf is given. I'm assuming that's expected behavior?
The only outstanding issue with CCorATransformer
now is one zero-division warning that arises from this line:
I'm thinking about just using pytest.mark.filterwarnings
to specifically ignore that, unless you think there's some other way we can handle it? I guess I'm not sure what the circumstances are where that would occur with real data, but I'm assuming it's pretty rare if it only happens once in all our test cases.
EDIT: Sorry, this warning actually arises from MSNRegressor
, but is caused by CCorATransformer
.
Nevermind, it needs to be in the same repo as the issue.
Good to know!
I'm thinking about just using
pytest.mark.filterwarnings
to specifically ignore that, unless you think there's some other way we can handle it? I guess I'm not sure what the circumstances are where that would occur with real data, but I'm assuming it's pretty rare if it only happens once in all our test cases.
That seems like a good decision to me. I'm guessing it must be a case where no component is significant (i.e. mask = t > 0
), but I'm not sure it's worth the effort tracking down when that happens.
Down to only a few failing checks!
I'm currently working on MSNRegressor()-check_n_features_in_
, which is failing because the n_features_in_
attribute does not match the number of features that were used to fit the model. This is essentially the same problem we faced with feature_names_in_
, where the model stores metadata about the features after transformation, when it should instead be storing data on the original, untransformed features.
We can solve this in pretty much the same way we solved that, linked below. Basically, just add an n_features_in_
property and an empty _check_n_features
method. The self.transform_._validate_data
check in _apply_transform
should catch any mismatch in the number of features.
Just wanted to run that by you and make sure that seems like the right approach! We could just xfail the check, but modifying the estimator to return the original number of training features rather than the number of transformed features seems more consistent with how we're handling feature_names_in_
.
Just wanted to run that by you and make sure that seems like the right approach!
Yep, I agree that's the right approach (assuming it was the right approach for _check_feature_names
:smile:). I don't love the fact that we're overriding methods just to silence issues, but I think you've chosen the least disruptive path. Thanks for pushing on these last annoying checks!
Yeah, I feel the same way. I don't love the solution, but the only other option I can come up with would be to totally rewrite fit
and predict
, which I like even less.
I just wrapped up the last couple MSNRegressor
checks, so all of our estimators are now officially passing! I think the only remaining issues to solve are the StandardScalerWithDOF
question at the top, which I think I'll need to defer to you on. The end of this PR is finally in sight!
Sorry, I dropped the ball on the StandardScalerWithDOF
issues (I missed them at the top!)
- Sparse support and
with_
parameters: Do we want to support sparse data and/or thewith_mean
andwith_std
initialization parameters?StandardScaler
only accepts sparse data ifwith_mean=False
. I lean towards keeping things simple by dropping sparse support and not adding thewith_
parameters since those are presumably incompatible with our estimators and very rare use cases, but let me know if you disagree
Yes, let's drop sparse support. I can't say that I've ever used sparse matrices, so let's keep things simple. Is that equivalent to self._validate_data(accept_sparse=False)
?
For the with_
parameters, we don't need to do anything because they will default to True
on the call to super
, correct?
check_sample_weights_invariance(kind=zeros)
: This check ensures that sample weights of 0 are equivalent to removing the samples. I think that by recalculatingscale_
without considering sample weights, we're violating this check. The easy solution would be to remove thesample_weights
parameter, as that would disable the check and prevent the error. If we do want to support sample weights, I think we'll need a trained mathematician to figure this out
I agree with you on this as well. We're really only using this class in the most standard way possible and adjusting the degrees of freedom. So making it as restrictive as possible makes sense to me.
Making those changes looks to fix the failing test. You've done it!!!!
No problem! I had forgotten about them until now, too!
Is that equivalent to self._validate_data(accept_sparse=False)?
Yeah, I think that should do it!
For the with_ parameters, we don't need to do anything because they will default to True on the call to super, correct?
Correct!
We're really only using this class in the most standard way possible and adjusting the degrees of freedom. So making it as restrictive as possible makes sense to me.
Makes perfect sense!
Making those changes looks to fix the failing test. You've done it!!!!
Heck yeah! Definitely a group effort!! Feel free to commit those changes and remove the pytest xfail. I think it'll be worth doing a final pass to make sure all the tags and validate calls are consistent and complete across estimators and transformers before merging, but we're definitely at the finish line 🎉
@aazuspan, I took one more look at the xfails on CCATransformer
to see if there was anything that could be done. I revisited this comment and more fully understood it this time:
Is the
multioutput_only
tag essentially what we're looking for? I came across that while looking through the _MultiOutputEstimator class.Adding that tag definitely doesn't fix the checks, but it does replace most of the
axis 1 is out of bounds
errors withrow sums must be greater than 0
errors, so maybe that's progress...
My initial reaction to that was that these transformers are not y
multi-output and that it was disingenuous to treat it as such, but in looking more of how it's used (at least in the transformer context), it just asserts that y
is at least 2-D which is obviously what we want. But then, as you noticed, we get zero row sums errors instead. Even though it's not applicable, I tried using the requires_positive_y=True
and multioutput_only=True
together and actually many of the tests passed.
One set that didn't, however, was the check_transformer_*
tests, but strangely those tests are not calling _enforce_estimator_tags_y
, so the tags are being ignored in this instance.
The upshot for me is that our xfails are unsatisfying, but I think there are enough inconsistencies between transformers and estimators that the checks are trying to please both.
Other than that, I don't have more of a formal review, because we've been beating this one endlessly. I'm good to merge whenever you are. Thanks for your great work on this one.
Thanks for revisiting those checks! So it sounds like we could fix some of the xfails, but it would require using tags that aren't strictly accurate and wouldn't solve everything anyways? Let me know if I misunderstood your conclusion, but I guess I reluctantly lean towards accepting all the xfails for now, but keeping an eye out for how we might remove those down the line, whether that's making our estimators more flexible, finding a better way to trick sklearn into passing them, or rewriting the estimator checks.
If you're ready to merge, I am too! Great to finally get this one wrapped up!!
Hey @grovduck, I'm opening this draft PR to start tracking the incremental process of working through the transformer/estimator check errors. I'm tentatively thinking we can tackle all of them here, but if it ends up getting too unwieldy we could think about splitting this into just the transformer checks (that's what I'm going to start with), or even just the checks for a single transformer. I figure we can work that out as we go.
I assigned myself to #21 for now, but I'd be more than happy to tackle this PR collaboratively if you get inspired to work on it!
Anything that's going to take more thought or discussion, I'll add below (feel free to edit this) and we can check them off as we figure them out.
To Fix
StandardScalerWithDOF
with_
parameters: ~Do we want to support sparse data and/or thewith_mean
andwith_std
initialization parameters?StandardScaler
only accepts sparse data ifwith_mean=False
. I lean towards keeping things simple by dropping sparse support and not adding thewith_
parameters since those are presumably incompatible with our estimators and very rare use cases, but let me know if you disagree.~ Sparse support is disabled for simplicity.check_sample_weights_invariance(kind=zeros)
: ~This check ensures that sample weights of 0 are equivalent to removing the samples. I think that by recalculatingscale_
without considering sample weights, we're violating this check. The easy solution would be to remove thesample_weights
parameter, as that would disable the check and prevent the error. If we do want to support sample weights, I think we'll need a trained mathematician to figure this out 😉~ Sample weights are disabled for simplicity.CCATransformer
CCA
only works with 2Dy
arrays. ~I assume there's no way we can add compatibility for that? So far, I haven't found any way to indicate this constraint tosklearn
, and I'm not sure there's any comparable transformer or estimator that we can look at for guidance...~ We decided there's no workaround for this, so we're falling back to xfailing those checks.CCorATransformer
y
array, butStandardScaler
uses the defaultensure_2d
check when calling_validate_data
, causing all checks with 1D y arrays to fail. ~We can work around this by xfailing those checks like we did withCCA
, but maybe there's another solution?~ Fixed by reshaping 1D y arrays to 2D.