lemma-osu / sknnr

scikit-learn compatible estimators for various kNN imputation methods
https://sknnr.readthedocs.io
0 stars 1 forks source link

Independent predict and score methods for KNeighborsRegressor derived estimators #45

Closed grovduck closed 1 year ago

grovduck commented 1 year ago

The predict method for KNeighborsRegressor uses an X matrix to calculate a predicted score for each row in X. If the model was fit from this same X matrix and k is set to 1, predict will (by definition) return the y matrix used in fitting the model. This is because the nearest neighbor is always itself (ignoring duplicate values) and the y values associated with this neighbor are returned, i.e.

>>> from sklearn.neighbors import KNeighborsRegressor
>>> from sklearn.datasets import load_linnerud  
>>> X, y = load_linnerud(return_X_y=True)
>>> est = KNeighborsRegressor(n_neighbors=1).fit(X, y)
>>> print(np.all(est.predict(X) == y))
True

The kneighbors method, on the other hand, allows the option of not passing the X matrix in the call, in which case "neighbors of each indexed point are returned. In this case, the query point is not considered its own neighbor".

For kNN methods, it is common to use the "second-nearest neighbor" (truly, the first independent neighbor) to use in independent accuracy assessment. This is very similar to a leave-one-out (LOO) analysis, except that the model is fit with all plots rather than being fit n times with n-1 plots and predicting the left-out plot.

To accomplish this, we can:

  1. Call kneighbors with no X array to get independent neighbors for each row
  2. Continue with the same logic as in predict to get the predicted values for each column in X, but with independent neighbors

Currently, all of our estimators are using the predict and score methods from KNeighborsRegressors (i.e. these methods have not been overridden). We will want to retain predict and score as-is for when we send new X data to an already fit model. But we also need to provide additional methods for independent prediction and scoring, in the least disruptive way.

I don't see a particularly elegant way to do this, other than repeating most of what is already in KNeighborsRegressor.predict. In essence,

def predict_independent(self):
    # Note that kneighbors is called without `X` parameter
    if self.weights == "uniform":
        neigh_ind = self.kneighbors(return_distance=False)
        neigh_dist = None
    else:
        neigh_dist, neigh_ind = self.kneighbors()
    weights = _get_weights(neigh_dist, self.weights)
    # Remainder of predict method

def score_independent(self, y, sample_weight=None):
    # Here I'm using root mean squared error instead of R^2 which is what is returned from `score`
    y_pred = self.predict_independent()
    return mean_squared_error(
        y,
        y_pred,
        sample_weight=sample_weight,
        squared=False,
        multioutput="raw_values",
    )

These would likely be new methods in an overridden KNeighborsRegressor or possibly as a separate mix-in. @aazuspan, I'm looking for feedback on the best way to bring this functionality in if you have any thoughts. In early testing, I'm getting the correct values (based on a comparison with yaImpute), but am uneasy about the duplication of code.

aazuspan commented 1 year ago

@grovduck thanks for the detailed write-up! It took me a while to wrap my head around this, but I think I'm up to speed now.

This is a bit hacky since it's obviously not the intended behavior, but can we accomplish these independent methods by calling the standard predict and score methods with X=None? If you think this is a common use case, that might be too weird of an API to expect users to use, but we could potentially simplify the predict_independent and score_independent methods to use the signature you suggested and simply call their corresponding methods with X=None, e.g.

def predict_independent(self):
    return self.predict(X=None)

EDIT: Alternatively, if we wanted predict and score to act more like kneighbors and just default to running independently unless an X is provided (rather than having separate independent methods), we could do:

def predict(self, X=None):
    return super().predict(X)

We will want to retain predict and score as-is for when we send new X data to an already fit model.

So if the user chose to score a sample that was in the training set (e.g. a pixel that had a corresponding plot, leading to an identical X matrix), we're not currently planning to provide a "second-nearest neighbor" in that case, right? I know we've discussed that and ran into a lot of complexity around excluding neighborhoods of plot pixels, but just wanted to make sure I'm on the same page.

grovduck commented 1 year ago

This is a bit hacky since it's obviously not the intended behavior, but can we accomplish these independent methods by calling the standard predict and score methods with X=None? If you think this is a common use case, that might be too weird of an API to expect users to use, but we could potentially simplify the predict_independent and score_independent methods to use the signature you suggested and simply call their corresponding methods with X=None

Yes, good thinking! I had thought of this design as well, but wondered if the optional X would raise any issues when doing estimator checks. I just did a quick search of scikit-learn and only see of one estimator where they use this pattern (e.g. def predict(self, X=None)) and it looks to be in an outlier detection estimator. I think that scared me off, but if folks are used to passing an X then maybe just an explanation in the docstring might be sufficient?

EDIT: Alternatively, if we wanted predict and score to act more like kneighbors and just default to running independently unless an X is provided (rather than having separate independent methods), we could do:

def predict(self, X=None):
    return super().predict(X)

I think I like this pattern better for sure.

So if the user chose to score a sample that was in the training set (e.g. a pixel that had a corresponding plot, leading to an identical X matrix), we're not currently planning to provide a "second-nearest neighbor" in that case, right? I know we've discussed that and ran into a lot of complexity around excluding neighborhoods of plot pixels, but just wanted to make sure I'm on the same page.

I think there are two issues here:

  1. In the case where we have two samples that are identical (and thus identical X vectors) but only one was used in fitting, then the sample that was not used in fitting does not need an independent neighbor and would use the identical sample as its first nearest neighbor (the "identical-twin" case).
  2. In the case where a pixel is part of the "footprint" of a sample that was used in fitting, then it is no longer independent and needs to pick a different sample. (Note that there is the added complexity that a pixel in a footprint often has a different X than the average signature of the footprint itself.) But I think we discussed that this is really outside the scope of the initial sknnr package because we aren't even yet suggesting that these estimators are used in a spatial context. In a wrapping package that deals with accuracy assessment at the pixel scale in plot footprints, my current thought is that we would:
    1. Associate pixels with their plot ID (or sample index from fit)
    2. Run kneighbors(X) with all pixels and oversample k
    3. Filter the list to exclude any neighbors matching the plot ID
    4. Cut down to the desired k

Note that for map prediction, pixels are allowed to use their own samples/plots as neighbors, so the above just applies to independent AA. Have I totally confused things?

aazuspan commented 1 year ago

I had thought of this design as well, but wondered if the optional X would raise any issues when doing estimator checks

Good thought! Nothing specific comes to mind about why it would cause issues, but it's possible. I would lean towards using the design that makes the most sense, and dealing with incompatible estimator checks as needed, even if we have to xfail them. I'm starting to wonder how much value the estimator checks are actually adding, given the complexity of working around incompatible checks, but that's a discussion for #44...

I think that scared me off, but if folks are used to passing an X then maybe just an explanation in the docstring might be sufficient?

Yeah, I would think so. Another option I just thought of while trying to come up with similar use cases in sklearn is something more like random forest OOB error. In that case, the score and predictions are set as fit attributes oob_score_ and oob_predictions_ (assuming you instantiate the estimator with oob_score=True). What do you think about using attributes to store the independent accuracy score and predictions, rather than modifying the methods? I'm not sure what my preference is, between the two options...

In the case where we have two samples that are identical (and thus identical X vectors) but only one was used in fitting, then the sample that was not used in fitting does not need an independent neighbor and would use the identical sample as its first nearest neighbor (the "identical-twin" case).

Good point! I had a vague thought that we could exclude zero-distance neighbors, but of course there is a possibility for legitimately identical samples, especially with a small number of covariates.

But I think we discussed that this is really outside the scope of the initial sknnr package because we aren't even yet suggesting that these estimators are used in a spatial context. In a wrapping package that deals with accuracy assessment at the pixel scale in plot footprints, my current thought is that we would:

Yeah, I thought that's where we'd landed. Your outline for how to solve this in the wrapper package sounds like a good approach!

Note that for map prediction, pixels are allowed to use their own samples/plots as neighbors, so the above just applies to independent AA. Have I totally confused things?

There are definitely a lot of little corner cases and gotchas to watch out for, but I think you've laid them out as clearly as possible! I appreciate how you spaced them out rather than laying them all on me at the outset of the project, which would have been pretty overwhelming!

grovduck commented 1 year ago

Another option I just thought of while trying to come up with similar use cases in sklearn is something more like random forest OOB error. In that case, the score and predictions are set as fit attributes oob_score_ and oob_predictions_ (assuming you instantiate the estimator with oob_score=True). What do you think about using attributes to store the independent accuracy score and predictions, rather than modifying the methods? I'm not sure what my preference is, between the two options...

How interesting! I hadn't even considered this approach. It really feels like these developers found themselves in a similar situation as we are now and had to come up with a solution :smile:. I think I lean toward the predict(self, X=None) call a bit more, but can be convinced otherwise. For now, I'll plan to go with this approach and see what issues it brings up.

There are definitely a lot of little corner cases and gotchas to watch out for ...

It feels like my life is full of gotchas!

aazuspan commented 1 year ago

I think I lean toward the predict(self, X=None) call a bit more, but can be convinced otherwise. For now, I'll plan to go with this approach and see what issues it brings up.

Sounds good to me!

grovduck commented 1 year ago

@aazuspan, I've made some headway with this issue mostly in terms of creating data from yaImpute to support the tests, such that I can see that this code is working. But I'm currently blocked on a couple of issues that I could use your help with.

First, I was leaning toward the implementation like you suggested:

def predict(self, X=None):
    return super().predict(X)

but recognizing that I need to do the same with score, I come up against providing a keyword argument X before the required argument y, i.e.

def score(self, X=None, y):
    return super().score(X, y)

I can flip the arguments, but that causes many checks to fail. I'm not sure there is a good way to do this without violating the familiar API? For now, I went back to naming these predict_independent and score_independent and implementing them as:

class IndependentPredictionMixin:
    """
    Mixin for KNeighborsRegressor derived classes that return predictions
    that don't include itself in the nearest neighbors.
    """

    def predict_independent(self) -> NDArray:
        return super().predict(X=None)

    def score_independent(self, y) -> float:
        return super().score(X=None, y=y)

We can also revisit storing these as fit attributes like your other suggestion.

This design, however, brings up an additional issue. I wanted to add this as a mix-in, but mypy is complaining that these two methods are undefined in superclass.

src\sknnr\_base.py:95: error: "predict" undefined in superclass  [misc]
src\sknnr\_base.py:98: error: "score" undefined in superclass  [misc]

This actually makes sense to me as IndependentPredictionMixin isn't a subclass of KNeighborsRegressor, although all our estimators are. But what doesn't make sense to me is why mypy isn't complaining about similar use of super() in KNeighborsDFIndexCrosswalkMixin and TransformedKNeighborsMixin which also aren't subclasses of KNeighborsRegressor. I feel like I'm missing something obvious, but I'm not seeing it.

EDIT: Ah, finally realized it was because I'm giving return type information to these methods, which is causing mypy to actually check them. I'm guessing we'll run into this issue down the road with #7. For now, I'm going to leave off the typing information and get a draft PR up so you can see it.

aazuspan commented 1 year ago

I come up against providing a keyword argument X before the required argument y ... I can flip the arguments, but that causes many checks to fail. I'm not sure there is a good way to do this without violating the familiar API?

Yeah, I didn't think about that! I think you're right that there's no good workaround here that would allow us to override score. Reversing the order of X and y would be very weird, even if we could get the tests to pass. We could make y "optional" and raise an error if it's not provided, but that also seems too odd.

We can also revisit storing these as fit attributes like your other suggestion.

I have a slight preference for this just because there's a precedent for it with the OOB attributes, but if you think the independent methods are clearer or would have some practical advantages, I'm happy to go with those instead.

Ah, finally realized it was because I'm giving return type information to these methods, which is causing mypy to actually check them. I'm guessing we'll run into this issue down the road with https://github.com/lemma-osu/scikit-learn-knn-regression/issues/7.

Good diagnosing! I didn't realize mypy would ignore methods without type info. That's definitely going to bite us once we start getting serious about type annotations... I think your solution of skipping those mypy checks for now is good, but we will need to give this some thought down the road.

The fact that the mixins refer to their superclass methods without actually having a superclass has always felt a little weird. Thinking out loud with untested code, I wonder if something like this would make mypy happy (and be a little more explicit):

    def score_independent(self, y) -> float:
        sup: KNeighborsRegressor = super()
        return sup.score(X=None, y=y)
grovduck commented 1 year ago

Thinking out loud with untested code, I wonder if something like this would make mypy happy (and be a little more explicit):

    def score_independent(self, y) -> float:
        sup: KNeighborsRegressor = super()
        return sup.score(X=None, y=y)

Beautiful, that actually does it (at least in this method)! I'm guessing we'll want to do this on all mixins, so perhaps hold off on this issue/PR, but implement as part of #7?

aazuspan commented 1 year ago

Cool! There may be a better approach, but it's good to know there's an easy workaround.

I'm guessing we'll want to do this on all mixins, so perhaps hold off on this issue/PR, but implement as part of https://github.com/lemma-osu/scikit-learn-knn-regression/issues/7?

Agreed!

grovduck commented 1 year ago

Closed via #50