lemma-osu / sknnr

scikit-learn compatible estimators for various kNN imputation methods
https://sknnr.readthedocs.io
0 stars 1 forks source link

Estimator checks #44

Closed aazuspan closed 1 year ago

aazuspan commented 1 year ago

Hey @grovduck, I'm opening this draft PR to start tracking the incremental process of working through the transformer/estimator check errors. I'm tentatively thinking we can tackle all of them here, but if it ends up getting too unwieldy we could think about splitting this into just the transformer checks (that's what I'm going to start with), or even just the checks for a single transformer. I figure we can work that out as we go.

I assigned myself to #21 for now, but I'd be more than happy to tackle this PR collaboratively if you get inspired to work on it!

Anything that's going to take more thought or discussion, I'll add below (feel free to edit this) and we can check them off as we figure them out.

To Fix

StandardScalerWithDOF

CCATransformer

CCorATransformer

grovduck commented 1 year ago

@aazuspan - sorry to be slow to the game. [EDIT]. I've taken a look at what you've done so far and would like to help if I can. I'm wondering if you can help guide me through process. A few questions:

  1. It looks like all the tests that are marked XPASS are silently passing, whereas the XFAIL ones are silently failing. If you wanted to understand why a certain test is XFAIL, are you commenting @pytest.mark.xfail(reason="Incrementally fixing.") so you actually get the error message? Or is there a better way to do that? Currently, I'm looking at a specific transformer at a time, e.g. hatch run test:all tests\test_transformers.py -k CCATransformer -v.
  2. If I were able to provide some fixes, did you actually want me to commit them? How best not to step on each other's toes to avoid merge conflicts? Up until now, it's been one of us doing most of the commits in a particular PR.
aazuspan commented 1 year ago

Thanks @grovduck, no rush! I figure this will probably take us a while to work through.

aazuspan commented 1 year ago

It looks like all the tests that are marked XPASS are silently passing, whereas the XFAIL ones are silently failing. If you wanted to understand why a certain test is XFAIL, are you commenting @pytest.mark.xfail(reason="Incrementally fixing.") so you actually get the error message? Or is there a better way to do that? Currently, I'm looking at a specific transformer at a time, e.g. hatch run test:all tests\test_transformers.py -k CCATransformer -v.

That's exactly what I'm doing - using xpass to keep the CI runs happy but disabling it for local runs so I can see what's going on. It's a little clunky, but it seems to work!

Also, I'd recommend making sure your scikit-learn is up to date in your test environment - I just entered a shell and used pip install --upgrade, but maybe we need to think about pinning for the development environment. There was a recent release that added new estimator checks, and it took me a while to figure out why I had fewer XFAILS than the CI did...

If I were able to provide some fixes, did you actually want me to commit them? How best not to step on each other's toes to avoid merge conflicts? Up until now, it's been one of us doing most of the commits in a particular PR.

Hmm, good question! I haven't touched CCorA and won't be working on this much tomorrow, so if you get an itch to dive in you could start there. Otherwise, I'm roadblocked on CCA (details at the top), so if you wanted to take a look at that it would be great!

grovduck commented 1 year ago

23 of the 24 failing checks are caused by the fact that CCA only works with 2D y arrays. I assume there's no way we can add compatibility for that? So far, I haven't found any way to indicate this constraint to sklearn, and I'm not sure there's any comparable transformer or estimator that we can look at for guidance...

I saw this too (and thanks for the tip on updating scikit-learn - I only had 23 failing checks before I did that 😄 ). I started looking at scikit-learn's CCA for inspiration here as it's a transformer which takes X and y matrices. But it looks like it allows a 1D array that it will transpose into a 2D array (i.e. a single feature). I believe canonical correspondence analysis (e.g. our CCA) requires a 2D y array and, even if it did allow a 1D array, all of those elements should be nonzero which would trip up the checks, I believe. I will do some reading and testing on whether CCA can take a single vector of non-zero values.

But I'd also like to know how transformers like StandardScaler pass these checks. Obviously, most of these checks are designed for estimators that take both X and y. Is StandardScalar passing because it simply ignores the y being passed? One other item that might be a red herring - I noticed that scikit-learn's CCA specifies its fit as fit(X, Y) (note the capitalization). I assume that is to signify to the user that this is meant to be passed a matrix/array rather than a vector, but because it's a positional argument, I think the y argument from the checks is being passed to the Y parameter.

Everything is conjecture at this point. I'll dedicate a bit of time to this today, as I would think that many of the issues here will also be applicable to CCorA.

aazuspan commented 1 year ago

I believe canonical correspondence analysis (e.g. our CCA) requires a 2D y array and, even if it did allow a 1D array, all of those elements should be nonzero which would trip up the checks, I believe.

Good to know! There are tags for requires_positive_X and requires_positive_y, but I don't see any options for nonzero constraints. If we find there's no solution to handling the 1D y checks, I think the _xfail_checks tag might be our best option (although we may also want to raise an error for users about requiring 2D arrays).

EDIT: Actually, it looks like the requires_positive_ tags constrain inputs to "strictly positive" values >0, so maybe that is a solution if we can transpose 1D arrays to 2D?

But I'd also like to know how transformers like StandardScaler pass these checks. Obviously, most of these checks are designed for estimators that take both X and y. Is StandardScalar passing because it simply ignores the y being passed?

Exactly!

One other item that might be a red herring - I noticed that scikit-learn's CCA specifies its fit as fit(X, Y) (note the capitalization). I assume that is to signify to the user that this is meant to be passed a matrix/array rather than a vector, but because it's a positional argument, I think the y argument from the checks is being passed to the Y parameter.

Interesting, I hadn't noticed that. Oddly, that capitalization isn't consistent within CCA - fit and transform use Y, but fit_transform and score use y, even though the docstrings for the parameter are the same. If there's some logic there, I'm not seeing it... But in any case, I think you're right that these are relying on positional args to work.

grovduck commented 1 year ago

I think I have some definitive answers on CCA and CCorA in terms of needed input.

CCA

CCorA

I believe canonical correspondence analysis (e.g. our CCA) requires a 2D y array and, even if it did allow a 1D array, all of those elements should be nonzero which would trip up the checks, I believe.

When I wrote this, it was because we would only have a single element per row and row sums needed to be >= 0, but as I've found out, not all elements need to be strictly non-zero, so I don't think we want to use the requires_positive_X and requires_positive_y tags.

So, it seems like that leaves us with using _xfail_checks at least for CCA? In looking through the scikit-learn code repo, it looks like estimators typically only disable a few of these tests when _xfail_checks is used which makes me a bit queasy. But the y that is passed is specifically supposed to act as labels or targets for classifiers or regressors -or- ignored for transformers. I really wish we could find a transformer with a required y that has to have at least two features to serve as inspiration. I haven't come across it yet.

aazuspan commented 1 year ago

This is great @grovduck! Getting a clear idea of exactly what constraints each transformer has will be a big help. Once we have a complete picture, we should probably go through the tags and _validate_data calls for each transformer and estimator to make sure they're correct and consistent. I've been playing a little fast and loose with the parameters on those and might be over or under-constraining them.

CCA ... Values can be any numeric type but cannot have NAs (I think we currently allow NAs?) (I think we currently allow NAs?)

Yes! We'll want to update the _validate_data call to force_all_finite=True, and maybe add allow_nan=False to _more_tags just to be explicit (even though it's the default).

CCA ... Y must be n x m array with m >= 2

Great, we can enforce this by adding ensure_min_features=2 to the _validate_data check. I added this to MahalanobisTransformer in 3ac8471bd69fecf43e60f87bc125b7a22e66c836 (please check to make sure my logic was correct there).

CCorA ... Values can be any numeric type, but cannot have NAs.

It looks like we're already there for CCorA, but again we may want to err on the side of explicitness both in the tags and when calling _validate_data.

So, it seems like that leaves us with using _xfail_checks at least for CCA? In looking through the scikit-learn code repo, it looks like estimators typically only disable a few of these tests when _xfail_checks is used which makes me a bit queasy

That's where I'm landing too, but I share your discomfort. I think we're likely missing some valuable test cases because of that incompatibility. Maybe if we conclude that there's no other alternative, we can see if any checks are worth re-implementing for ourselves with 2D data.

I really wish we could find a transformer with a required y that has to have at least two features to serve as inspiration. I haven't come across it yet.

This would be great. I'll do some looking too - maybe there's a relevant 3rd party package...

aazuspan commented 1 year ago

Is the multioutput_only tag essentially what we're looking for? I came across that while looking through the _MultiOutputEstimator class.

Adding that tag definitely doesn't fix the checks, but it does replace most of the axis 1 is out of bounds errors with row sums must be greater than 0 errors, so maybe that's progress...

grovduck commented 1 year ago

Is the multioutput_only tag essentially what we're looking for? I came across that while looking through the _MultiOutputEstimator class.

@aazuspan, I saw this as well. To me, it felt disingenuous to use this because it would seem that the output of this transformer would be multioutput y values versus transformed X scores. But I can definitely appreciate that this would solve some of the checks. I suppose that a comment explaining why we chose this tag would be helpful.

aazuspan commented 1 year ago

Yeah, I see your point - it's definitely not using the tag as intended. If it did solve some checks, I agree that we should at least add a clarifying comment, but honestly I'm not sure it will since it seems to just introduce a new issue with the row sums.

I'm starting to accept that _xfail_checks might be unavoidable, at least for CCA. If our transformers have inherent constraints that the sklearn checks weren't designed to handle, there's not much we can do about that. I'm also noticing that sklearn have their own share of hacky special cases to avoid problematic checks...

grovduck commented 1 year ago

I'm also noticing that sklearn have their own share of hacky special cases to avoid problematic checks...

Funny! Yeah, that looks like something I would do 😉.

In the interest of not getting completely high-centered, I'm OK moving forward with the _xfail_checks plan with CCA. I think I prefer that approach to including the multioutput_only tag. I'll let you make the final decision, but it feels like we've invested enough time trying to figure this out that it's time to move on.

aazuspan commented 1 year ago

In the interest of not getting completely high-centered, I'm OK moving forward with the _xfail_checks plan with CCA. I think I prefer that approach to including the multioutput_only tag. I'll let you make the final decision, but it feels like we've invested enough time trying to figure this out that it's time to move on.

Agreed, moving forward with _xfail_checks for CCA sounds like a plan. If we have a brainwave on how to deal with this later on we can always revisit it.

I can add those xfail tags and remove NaN support as discussed, which should hopefully wrap up that transformer.

aazuspan commented 1 year ago

CCorA: Both X and Y can be n x m arrays with m <= 1.

@grovduck I'm trying to wrap up this PR and running into a complication with CCorA that suggests y must be 2D here as well. The offending line is here:

https://github.com/lemma-osu/scikit-learn-knn-regression/blob/0878eed7c03ad490cf782e78c5b08f1cdc936ee6/src/sknnr/transformers/_ccora_transformer.py#L13

As we found with CCA, most of the checks use a 1D y array, but attempting to fit a StandardScaler with a 1D array fails because it calls _validate_data which calls check_array which by default requires a 2D X array.

That can be reproduced just by running:

import numpy as np
from sklearn.preprocessing import StandardScaler

y = np.random.randint(low=0, high=4, size=100)
StandardScaler().fit(y) # ValueError: Expected 2D array, got 1D array instead

If y does in fact need to be 2D we can take the same approach we took with CCA of xfailing those checks, but if that's not an inherent restriction of CCorA, maybe there's a workaround? Let me know if you have any thoughts on this!

grovduck commented 1 year ago

If y does in fact need to be 2D we can take the same approach we took with CCA of xfailing those checks, but if that's not an inherent restriction of CCorA, maybe there's a workaround? Let me know if you have any thoughts on this!

@aazuspan, I need to verify this, but I believe CCorA can use 1-d arrays but they need to be formatted as n x 1 2-d arrays. If you look at the code associated with sklearn's CCA fit method, they call check_array with ensure2d=False and then reshape the 1-D array to a 2-D array if necessary.

https://github.com/scikit-learn/scikit-learn/blob/7f9bad99d6e0a3e8ddf92a7e5561245224dab102/sklearn/cross_decomposition/_pls.py#L237-L241

I'm guessing we should do the same with CCorA, but like I said, I haven't yet tested this. I can try to do so this afternoon.

EDIT: I've tested this and we should be able to input a 1-d y array with this change.

import numpy as np
from sklearn.base import BaseEstimator, TransformerMixin
from sklearn.utils.validation import check_is_fitted
from sklearn.utils import check_array

from . import ComponentReducerMixin, StandardScalerWithDOF
from ._ccora import CCorA

class CCorATransformer(ComponentReducerMixin, TransformerMixin, BaseEstimator):
    def fit(self, X, y):
        self._validate_data(X, reset=True)
        self.scaler_ = StandardScalerWithDOF(ddof=1).fit(X)
        y = check_array(y, input_name="Y", dtype=np.float64, ensure_2d=False)
        if y.ndim == 1:
            y = y.reshape(-1, 1)
        y = StandardScalerWithDOF(ddof=1).fit_transform(y)
        ...

Are you able to test if this works? The number of xpassed went from 253 to 300 when I made this change. Sorry that I'm not more clever than that to figure out if it's working correctly.

EDIT 2: I think I read my own notes earlier in the thread and tested this with the @pytest.mark.xfail(reason="Incrementally fixing.") commented out. Just one check still fails for CCorATransformer which is check_estimators_nan_inf. I assume you fix this with more_tags?

aazuspan commented 1 year ago

Great find! That seems to resolve the 1D y issue which, as you found, was responsible for all but one of the failed checks.

Just one check still fails for CCorATransformer which is check_estimators_nan_inf. I assume you fix this with more_tags?

In this case we can fix that by adding self._validate_data(X, force_all_finite=True) to the transform method, which just throws the error that the check wants to see when a NaN or Inf is given. I'm assuming that's expected behavior?

The only outstanding issue with CCorATransformer now is one zero-division warning that arises from this line:

https://github.com/lemma-osu/scikit-learn-knn-regression/blob/0878eed7c03ad490cf782e78c5b08f1cdc936ee6/src/sknnr/transformers/_ccora.py#L37

I'm thinking about just using pytest.mark.filterwarnings to specifically ignore that, unless you think there's some other way we can handle it? I guess I'm not sure what the circumstances are where that would occur with real data, but I'm assuming it's pretty rare if it only happens once in all our test cases.

EDIT: Sorry, this warning actually arises from MSNRegressor, but is caused by CCorATransformer.

Nevermind, it needs to be in the same repo as the issue.

Good to know!

grovduck commented 1 year ago

I'm thinking about just using pytest.mark.filterwarnings to specifically ignore that, unless you think there's some other way we can handle it? I guess I'm not sure what the circumstances are where that would occur with real data, but I'm assuming it's pretty rare if it only happens once in all our test cases.

That seems like a good decision to me. I'm guessing it must be a case where no component is significant (i.e. mask = t > 0), but I'm not sure it's worth the effort tracking down when that happens.

aazuspan commented 1 year ago

Down to only a few failing checks!

I'm currently working on MSNRegressor()-check_n_features_in_, which is failing because the n_features_in_ attribute does not match the number of features that were used to fit the model. This is essentially the same problem we faced with feature_names_in_, where the model stores metadata about the features after transformation, when it should instead be storing data on the original, untransformed features.

We can solve this in pretty much the same way we solved that, linked below. Basically, just add an n_features_in_ property and an empty _check_n_features method. The self.transform_._validate_data check in _apply_transform should catch any mismatch in the number of features.

https://github.com/lemma-osu/scikit-learn-knn-regression/blob/87747de75220d05a074767692616a58cd51d0cf8/src/sknnr/_base.py#L46-L57

Just wanted to run that by you and make sure that seems like the right approach! We could just xfail the check, but modifying the estimator to return the original number of training features rather than the number of transformed features seems more consistent with how we're handling feature_names_in_.

grovduck commented 1 year ago

Just wanted to run that by you and make sure that seems like the right approach!

Yep, I agree that's the right approach (assuming it was the right approach for _check_feature_names :smile:). I don't love the fact that we're overriding methods just to silence issues, but I think you've chosen the least disruptive path. Thanks for pushing on these last annoying checks!

aazuspan commented 1 year ago

Yeah, I feel the same way. I don't love the solution, but the only other option I can come up with would be to totally rewrite fit and predict, which I like even less.

I just wrapped up the last couple MSNRegressor checks, so all of our estimators are now officially passing! I think the only remaining issues to solve are the StandardScalerWithDOF question at the top, which I think I'll need to defer to you on. The end of this PR is finally in sight!

grovduck commented 1 year ago

Sorry, I dropped the ball on the StandardScalerWithDOF issues (I missed them at the top!)

  • Sparse support and with_ parameters: Do we want to support sparse data and/or the with_mean and with_std initialization parameters? StandardScaler only accepts sparse data if with_mean=False. I lean towards keeping things simple by dropping sparse support and not adding the with_ parameters since those are presumably incompatible with our estimators and very rare use cases, but let me know if you disagree

Yes, let's drop sparse support. I can't say that I've ever used sparse matrices, so let's keep things simple. Is that equivalent to self._validate_data(accept_sparse=False)?

For the with_ parameters, we don't need to do anything because they will default to True on the call to super, correct?

  • check_sample_weights_invariance(kind=zeros): This check ensures that sample weights of 0 are equivalent to removing the samples. I think that by recalculating scale_ without considering sample weights, we're violating this check. The easy solution would be to remove the sample_weights parameter, as that would disable the check and prevent the error. If we do want to support sample weights, I think we'll need a trained mathematician to figure this out

I agree with you on this as well. We're really only using this class in the most standard way possible and adjusting the degrees of freedom. So making it as restrictive as possible makes sense to me.

Making those changes looks to fix the failing test. You've done it!!!!

aazuspan commented 1 year ago

No problem! I had forgotten about them until now, too!

Is that equivalent to self._validate_data(accept_sparse=False)?

Yeah, I think that should do it!

For the with_ parameters, we don't need to do anything because they will default to True on the call to super, correct?

Correct!

We're really only using this class in the most standard way possible and adjusting the degrees of freedom. So making it as restrictive as possible makes sense to me.

Makes perfect sense!

Making those changes looks to fix the failing test. You've done it!!!!

Heck yeah! Definitely a group effort!! Feel free to commit those changes and remove the pytest xfail. I think it'll be worth doing a final pass to make sure all the tags and validate calls are consistent and complete across estimators and transformers before merging, but we're definitely at the finish line 🎉

grovduck commented 1 year ago

@aazuspan, I took one more look at the xfails on CCATransformer to see if there was anything that could be done. I revisited this comment and more fully understood it this time:

Is the multioutput_only tag essentially what we're looking for? I came across that while looking through the _MultiOutputEstimator class.

Adding that tag definitely doesn't fix the checks, but it does replace most of the axis 1 is out of bounds errors with row sums must be greater than 0 errors, so maybe that's progress...

My initial reaction to that was that these transformers are not y multi-output and that it was disingenuous to treat it as such, but in looking more of how it's used (at least in the transformer context), it just asserts that y is at least 2-D which is obviously what we want. But then, as you noticed, we get zero row sums errors instead. Even though it's not applicable, I tried using the requires_positive_y=True and multioutput_only=True together and actually many of the tests passed.

One set that didn't, however, was the check_transformer_* tests, but strangely those tests are not calling _enforce_estimator_tags_y, so the tags are being ignored in this instance.

The upshot for me is that our xfails are unsatisfying, but I think there are enough inconsistencies between transformers and estimators that the checks are trying to please both.

Other than that, I don't have more of a formal review, because we've been beating this one endlessly. I'm good to merge whenever you are. Thanks for your great work on this one.

aazuspan commented 1 year ago

Thanks for revisiting those checks! So it sounds like we could fix some of the xfails, but it would require using tags that aren't strictly accurate and wouldn't solve everything anyways? Let me know if I misunderstood your conclusion, but I guess I reluctantly lean towards accepting all the xfails for now, but keeping an eye out for how we might remove those down the line, whether that's making our estimators more flexible, finding a better way to trick sklearn into passing them, or rewriting the estimator checks.

If you're ready to merge, I am too! Great to finally get this one wrapped up!!