lemma-osu / sknnr

scikit-learn compatible estimators for various kNN imputation methods
https://sknnr.readthedocs.io
0 stars 1 forks source link

Scikit-learn 1.6 compatibility #79

Open aazuspan opened 4 days ago

aazuspan commented 4 days ago

Closes #78 by introducing compatibility fixes for breaking changes in sklearn >= 1.6, described below.

As far as our downstream packages, if anything relied on feature_names_in_ or n_features_in_ they would need to be updated, but I did a quick search and didn't see either show up in the code of sknnr-spatial, synthetic-knn, or gee-knn-python.

xfail estimator checks

Instead of marking our xfailing estimator checks in the estimator tags, they're now passed as arguments to parametrize_with_checks. I'm very happy with this change since it got a lot of testing details out of the public API, but it does mean that our tests aren't backwards compatible, so I've pinned the test environment to 1.6. It would be nice to be able to test older versions, but I couldn't think of a good solution to allow that. Might be worth a future PR, though.

Tags

Tags are now stored in __sklearn_tags__ using dataclasses instead of as a dictionary in _more_tags. This was a pretty simple switch, although I tried to re-evaluate the tags while making the switch to try to make sure they were still correct and not redundant with inherited tags. Those might be worth a closer look.

We could keep _more_tags for full backwards compatibility, but I don't think they're likely to be touched by users, so I'm not too worried about breaking that.

validate_data

This changed from an estimator method to a utility function. For backwards compatibility, I wrapped that function and dispatch to the appropriate place.

n_features_in_ and feature_names_in_

This is the big change that I could really use some feedback on. A while back we decided that these attributes on the transformed estimators should correspond to the raw features that were passed in by the user, rather than the transformed features that are actually used to fit the estimator. In other words, if you fit a GNNRegressor with 20 features, it might actually be fit with 5 features in CCA space, but we would show the user the original 20 features rather than exposing the implementation.

I don't think that was the wrong choice, but it did add complexity since our inherited classes run internal checks against those features. To solve that, we were overriding _check_feature_names and _check_n_features to just skip the check, and indirectly deferring to the corresponding checks on the transformers, but with the switch in 1.6 to utility function checks over estimator method checks, that workaround was broken.

After spending some time thinking about this and trying to come up with new workarounds, the conclusion I reached is that we're fighting a losing battle by trying to trick sklearn and changing the meaning of those attributes. I'm suggesting (and implemented here) that we just fall back to the default behavior where our estimators store the transformed features that they were actually fit with, and their transformers store the raw features that they were fit with.

If we think this is still a concern for users, we could expose a new attribute raw_features_in_ or point users to estimator.transformer_.n_features_in_, but my suspicion is that it would rare for users to rely on these attributes.