Closed grovduck closed 1 year ago
@grovduck thanks for the detailed write-up! It took me a while to wrap my head around this, but I think I'm up to speed now.
This is a bit hacky since it's obviously not the intended behavior, but can we accomplish these independent methods by calling the standard predict
and score
methods with X=None
? If you think this is a common use case, that might be too weird of an API to expect users to use, but we could potentially simplify the predict_independent
and score_independent
methods to use the signature you suggested and simply call their corresponding methods with X=None
, e.g.
def predict_independent(self):
return self.predict(X=None)
EDIT: Alternatively, if we wanted predict
and score
to act more like kneighbors
and just default to running independently unless an X is provided (rather than having separate independent methods), we could do:
def predict(self, X=None):
return super().predict(X)
We will want to retain predict and score as-is for when we send new X data to an already fit model.
So if the user chose to score a sample that was in the training set (e.g. a pixel that had a corresponding plot, leading to an identical X matrix), we're not currently planning to provide a "second-nearest neighbor" in that case, right? I know we've discussed that and ran into a lot of complexity around excluding neighborhoods of plot pixels, but just wanted to make sure I'm on the same page.
This is a bit hacky since it's obviously not the intended behavior, but can we accomplish these independent methods by calling the standard
predict
andscore
methods withX=None
? If you think this is a common use case, that might be too weird of an API to expect users to use, but we could potentially simplify thepredict_independent
andscore_independent
methods to use the signature you suggested and simply call their corresponding methods withX=None
Yes, good thinking! I had thought of this design as well, but wondered if the optional X
would raise any issues when doing estimator checks. I just did a quick search of scikit-learn
and only see of one estimator where they use this pattern (e.g. def predict(self, X=None)
) and it looks to be in an outlier detection estimator. I think that scared me off, but if folks are used to passing an X
then maybe just an explanation in the docstring might be sufficient?
EDIT: Alternatively, if we wanted
predict
andscore
to act more likekneighbors
and just default to running independently unless an X is provided (rather than having separate independent methods), we could do:def predict(self, X=None): return super().predict(X)
I think I like this pattern better for sure.
So if the user chose to score a sample that was in the training set (e.g. a pixel that had a corresponding plot, leading to an identical X matrix), we're not currently planning to provide a "second-nearest neighbor" in that case, right? I know we've discussed that and ran into a lot of complexity around excluding neighborhoods of plot pixels, but just wanted to make sure I'm on the same page.
I think there are two issues here:
X
vectors) but only one was used in fitting, then the sample that was not used in fitting does not need an independent neighbor and would use the identical sample as its first nearest neighbor (the "identical-twin" case).X
than the average signature of the footprint itself.) But I think we discussed that this is really outside the scope of the initial sknnr
package because we aren't even yet suggesting that these estimators are used in a spatial context. In a wrapping package that deals with accuracy assessment at the pixel scale in plot footprints, my current thought is that we would:
fit
) kneighbors(X)
with all pixels and oversample k
k
Note that for map prediction, pixels are allowed to use their own samples/plots as neighbors, so the above just applies to independent AA. Have I totally confused things?
I had thought of this design as well, but wondered if the optional X would raise any issues when doing estimator checks
Good thought! Nothing specific comes to mind about why it would cause issues, but it's possible. I would lean towards using the design that makes the most sense, and dealing with incompatible estimator checks as needed, even if we have to xfail them. I'm starting to wonder how much value the estimator checks are actually adding, given the complexity of working around incompatible checks, but that's a discussion for #44...
I think that scared me off, but if folks are used to passing an X then maybe just an explanation in the docstring might be sufficient?
Yeah, I would think so. Another option I just thought of while trying to come up with similar use cases in sklearn is something more like random forest OOB error. In that case, the score and predictions are set as fit attributes oob_score_
and oob_predictions_
(assuming you instantiate the estimator with oob_score=True
). What do you think about using attributes to store the independent accuracy score and predictions, rather than modifying the methods? I'm not sure what my preference is, between the two options...
In the case where we have two samples that are identical (and thus identical X vectors) but only one was used in fitting, then the sample that was not used in fitting does not need an independent neighbor and would use the identical sample as its first nearest neighbor (the "identical-twin" case).
Good point! I had a vague thought that we could exclude zero-distance neighbors, but of course there is a possibility for legitimately identical samples, especially with a small number of covariates.
But I think we discussed that this is really outside the scope of the initial sknnr package because we aren't even yet suggesting that these estimators are used in a spatial context. In a wrapping package that deals with accuracy assessment at the pixel scale in plot footprints, my current thought is that we would:
Yeah, I thought that's where we'd landed. Your outline for how to solve this in the wrapper package sounds like a good approach!
Note that for map prediction, pixels are allowed to use their own samples/plots as neighbors, so the above just applies to independent AA. Have I totally confused things?
There are definitely a lot of little corner cases and gotchas to watch out for, but I think you've laid them out as clearly as possible! I appreciate how you spaced them out rather than laying them all on me at the outset of the project, which would have been pretty overwhelming!
Another option I just thought of while trying to come up with similar use cases in sklearn is something more like random forest OOB error. In that case, the score and predictions are set as fit attributes
oob_score_
andoob_predictions_
(assuming you instantiate the estimator withoob_score=True
). What do you think about using attributes to store the independent accuracy score and predictions, rather than modifying the methods? I'm not sure what my preference is, between the two options...
How interesting! I hadn't even considered this approach. It really feels like these developers found themselves in a similar situation as we are now and had to come up with a solution :smile:. I think I lean toward the predict(self, X=None)
call a bit more, but can be convinced otherwise. For now, I'll plan to go with this approach and see what issues it brings up.
There are definitely a lot of little corner cases and gotchas to watch out for ...
It feels like my life is full of gotchas!
I think I lean toward the predict(self, X=None) call a bit more, but can be convinced otherwise. For now, I'll plan to go with this approach and see what issues it brings up.
Sounds good to me!
@aazuspan, I've made some headway with this issue mostly in terms of creating data from yaImpute
to support the tests, such that I can see that this code is working. But I'm currently blocked on a couple of issues that I could use your help with.
First, I was leaning toward the implementation like you suggested:
def predict(self, X=None):
return super().predict(X)
but recognizing that I need to do the same with score
, I come up against providing a keyword argument X
before the required argument y
, i.e.
def score(self, X=None, y):
return super().score(X, y)
I can flip the arguments, but that causes many checks to fail. I'm not sure there is a good way to do this without violating the familiar API? For now, I went back to naming these predict_independent
and score_independent
and implementing them as:
class IndependentPredictionMixin:
"""
Mixin for KNeighborsRegressor derived classes that return predictions
that don't include itself in the nearest neighbors.
"""
def predict_independent(self) -> NDArray:
return super().predict(X=None)
def score_independent(self, y) -> float:
return super().score(X=None, y=y)
We can also revisit storing these as fit attributes like your other suggestion.
This design, however, brings up an additional issue. I wanted to add this as a mix-in, but mypy is complaining that these two methods are undefined in superclass.
src\sknnr\_base.py:95: error: "predict" undefined in superclass [misc]
src\sknnr\_base.py:98: error: "score" undefined in superclass [misc]
This actually makes sense to me as IndependentPredictionMixin
isn't a subclass of KNeighborsRegressor
, although all our estimators are. But what doesn't make sense to me is why mypy isn't complaining about similar use of super()
in KNeighborsDFIndexCrosswalkMixin
and TransformedKNeighborsMixin
which also aren't subclasses of KNeighborsRegressor
. I feel like I'm missing something obvious, but I'm not seeing it.
EDIT: Ah, finally realized it was because I'm giving return type information to these methods, which is causing mypy to actually check them. I'm guessing we'll run into this issue down the road with #7. For now, I'm going to leave off the typing information and get a draft PR up so you can see it.
I come up against providing a keyword argument X before the required argument y ... I can flip the arguments, but that causes many checks to fail. I'm not sure there is a good way to do this without violating the familiar API?
Yeah, I didn't think about that! I think you're right that there's no good workaround here that would allow us to override score
. Reversing the order of X and y would be very weird, even if we could get the tests to pass. We could make y "optional" and raise an error if it's not provided, but that also seems too odd.
We can also revisit storing these as fit attributes like your other suggestion.
I have a slight preference for this just because there's a precedent for it with the OOB attributes, but if you think the independent methods are clearer or would have some practical advantages, I'm happy to go with those instead.
Ah, finally realized it was because I'm giving return type information to these methods, which is causing mypy to actually check them. I'm guessing we'll run into this issue down the road with https://github.com/lemma-osu/scikit-learn-knn-regression/issues/7.
Good diagnosing! I didn't realize mypy would ignore methods without type info. That's definitely going to bite us once we start getting serious about type annotations... I think your solution of skipping those mypy checks for now is good, but we will need to give this some thought down the road.
The fact that the mixins refer to their superclass methods without actually having a superclass has always felt a little weird. Thinking out loud with untested code, I wonder if something like this would make mypy happy (and be a little more explicit):
def score_independent(self, y) -> float:
sup: KNeighborsRegressor = super()
return sup.score(X=None, y=y)
Thinking out loud with untested code, I wonder if something like this would make mypy happy (and be a little more explicit):
def score_independent(self, y) -> float: sup: KNeighborsRegressor = super() return sup.score(X=None, y=y)
Beautiful, that actually does it (at least in this method)! I'm guessing we'll want to do this on all mixins, so perhaps hold off on this issue/PR, but implement as part of #7?
Cool! There may be a better approach, but it's good to know there's an easy workaround.
I'm guessing we'll want to do this on all mixins, so perhaps hold off on this issue/PR, but implement as part of https://github.com/lemma-osu/scikit-learn-knn-regression/issues/7?
Agreed!
Closed via #50
The
predict
method for KNeighborsRegressor uses anX
matrix to calculate a predicted score for each row in X. If the model was fit from this sameX
matrix andk
is set to 1,predict
will (by definition) return they
matrix used in fitting the model. This is because the nearest neighbor is always itself (ignoring duplicate values) and they
values associated with this neighbor are returned, i.e.The
kneighbors
method, on the other hand, allows the option of not passing theX
matrix in the call, in which case "neighbors of each indexed point are returned. In this case, the query point is not considered its own neighbor".For kNN methods, it is common to use the "second-nearest neighbor" (truly, the first independent neighbor) to use in independent accuracy assessment. This is very similar to a leave-one-out (LOO) analysis, except that the model is fit with all plots rather than being fit n times with n-1 plots and predicting the left-out plot.
To accomplish this, we can:
kneighbors
with noX
array to get independent neighbors for each rowpredict
to get the predicted values for each column inX
, but with independent neighborsCurrently, all of our estimators are using the
predict
andscore
methods from KNeighborsRegressors (i.e. these methods have not been overridden). We will want to retainpredict
andscore
as-is for when we send newX
data to an already fit model. But we also need to provide additional methods for independent prediction and scoring, in the least disruptive way.I don't see a particularly elegant way to do this, other than repeating most of what is already in KNeighborsRegressor.predict. In essence,
These would likely be new methods in an overridden
KNeighborsRegressor
or possibly as a separate mix-in. @aazuspan, I'm looking for feedback on the best way to bring this functionality in if you have any thoughts. In early testing, I'm getting the correct values (based on a comparison withyaImpute
), but am uneasy about the duplication of code.