Decide how to handle fitting, prediction, and ID data

lemma-osu / sknnr

scikit-learn compatible estimators for various kNN imputation methods

https://sknnr.readthedocs.io

0 stars 1 forks source link

Decide how to handle fitting, prediction, and ID data #2

Closed aazuspan closed 1 year ago

aazuspan commented 1 year ago

See discussion in #1.

Some estimators can be fit with different data than they will predict (e.g. fitting with species composition data and predicting structural attributes using GNN), so we need to be able to set that data (tentatively called y_predict) somehow. We also need to the flexibility to set different predicted data without requiring refitting.

A few options discussed:

Accept y_predict as a parameter of fit to store for later use. If this is the only way to pass y_predict, it would require refitting when switching predicted attributes, so this alone is not a complete solution.
Use a set_target_data method on estimators after fitting. This gives the flexibility of being able to switch data without refitting, but would interrupt the normal fit then predict then score workflow.
Accept y_predict as a parameter of predict. Need to think this through further.

grovduck commented 1 year ago

Hey @aazuspan, I've been thinking about this and how it relates to #3. My thinking is still a bit muddy, but here is where I am currently.

In order for predict and score to work somewhat seamlessly for these estimators without fundamentally changing the signature for those methods, I keep coming back to the thought that y matrix passed to fit should include the attributes we want to predict (what we've called y_predict above) and not passing it as a separate keyword argument of fit. Looking at the linnerud multi-output data has changed my mind a bit on this. Here's a toy example mimicking the format of what the linnerud data would give you:

# Assume the y are the attributes we want to predict, not the same as the species matrix
>>> from sklearn.neighbors import KNeighborsClassifier, KNeighborsRegressor
>>> X = [[1, 2, 3], [1, 2, 4], [2, 2, 3]]   # like load_linnerud().data
>>> y = [[5, 10, 15], [3, 6, 9], [4, 8, 12]]  # like load_linnerud().target
>>> X_trg = [[2, 2, 3.5]]

# With KNeighborsClassifier, this is the same as k=1
>>> clf = KNeighborsClassifier(n_neighbors=2, weights="distance").fit(X, y)
>>> clf.kneighbors(X_trg)
(array([[0.5       , 1.11803399]]), array([[2, 1]]))
>>> clf.predict(X_trg)
[[ 4  8 12]]

# With KNeighborsRegressor, this is the same as k > 1 (in this case using inverse distance as weights)
>>> clf = KNeighborsRegressor(n_neighbors=2, weights="distance").fit(X, y)
>>> clf.kneighbors(X_trg)
(array([[0.5       , 1.11803399]]), array([[2, 1]]))
>>> clf.predict(X_trg)
[[ 3.69098301  7.38196601 11.07294902]]

So, more or less, this is what we want for output from these calls. Importantly, for KNeighborsRegressor, the same weights are applied to all attributes of y (although we'd still have to cleverly distinguish what set of attributes we would want reported as k=1 vs k>1).

So, that leaves us with two design challenges as I see it:

We still need to somehow distinguish IDs. My current thought is that we pass IDs as just another column in the y matrix and have a keyword argument specifying the id_column (defaulting to 0). I'm not necessarily sold on this idea, but I think it works in most use cases (in fact, one could make the argument that a user would need to take care to remember to strip off the ID column before passing to fit). We could verify that this column is distinct and is of integer data type as well and then we would strip it off the array within fit.
Associated with my concern with flexibility of passing new attributes to a fitted estimators, I'm starting to change my mind on that as well and am willing to just refit with a new y_predict array. Yes, it would be nice to not have to run fit again as new y_predict data wouldn't change the fit at all, but 1) it's very rare that I would pass new attributes in practice (although I do pass new targets); 2) fitting is really fast for most of these methods; and 3) this follows what I assume you would need to do in the above example when providing new y data as well.

So, I guess I'm leaning toward option 1 above as it seems to disrupt the normal fit-predict-score workflow the least. The tweaks would be two additional keyword arguments on fit, e.g.:

def fit(self, X, y, spp=None, id_column=0)

I feel like we might need to resolve this choice before thinking about the load functionality. I'd love to hear your thoughts.

aazuspan commented 1 year ago

I feel like we might need to resolve this choice before thinking about the load functionality.

Agreed, thanks for bringing this up so we don't have to change that later one!

I keep coming back to the thought that y matrix passed to fit should include the attributes we want to predict

I think you're right, generating predictions on something other than the y data would be too surprising of a change. I'm fully in favor of removing the idea of y_predict.

I'm starting to change my mind on that as well and am willing to just refit with a new y_predict array

Yeah, I think you're 100% right. There is a little bit of internal redundancy in having to re-run fit when the model won't technically be re-fit, but if we think of it more as re-setting the fit attributes which include y, it makes the most sense. I also think that redundancy wouldn't be very apparent to the user since, as you mentioned, you would need to refit any sklearn estimator if you wanted to add new y data. The only qualm I have is that I'm not convinced about using spp for the modeling attributes given the concerns with generalizing to other use cases.

My current thought is that we pass IDs as just another column in the y matrix and have a keyword argument specifying the id_column

I do like that this approach wouldn't modify the required arguments for fitting, predicting, and scoring. I think there's a tradeoff in that it would be easier to accidentally misuse as opposed to a separate ids parameter, but validating the selected column like you suggested should remove most of that risk. I can't say I love this approach since there's some added complexity both in the implementation and in the mental model for users because we're treating different columns differently, but I definitely prefer this over separating y and y_predict, and I think I prefer it over a separate required ids argument, too.

The more I'm thinking about this, the more I'm convincing myself that IDs should be optional. If people are coming to this package looking for a multi-output estimator, most will not have any kind of ID associated with their observations, and having to add an ID column will just be an obstacle with no benefit, since we can of course use array indices to internally identify neighbors without ever exposing that to users. I know there are use cases where a user would need plot IDs as outputs, e.g. outputting a nearest neighbor grid to use elsewhere, but what if we were to handle that as more of a special use case? In fact, if we accepted an optional ids parameter to fit, would it work to just use the kneighbor_ids method to expose the nearest neighbor IDs for a given set of environmental variables? I haven't totally worked that through in my head, but if so then IDs would just be an internal detail as far as predict is concerned. It's possible I'm misunderstanding the typical use case or missing an obvious reason why this wouldn't work, so let me know if I'm way off base!

grovduck commented 1 year ago

Excellent conversation! Thanks for your great thoughts.

The only qualm I have is that I'm not convinced about using spp for the modeling attributes given the concerns with generalizing to other use cases.

Just so I'm clear, you're not convinced about using spp as a keyword argument to pass a separate array, correct? Would the alternative be to pass those attributes as part of the y matrix, such that the y matrix would be both the attributes used to create the ordination and those that you want to predict? I've thought about this as well, but I haven't sorted out how best to specify what subset of attributes would be used for the ordination. I don't think it's a problem to calculate predictions for those attributes (even though I don't tend to report those out), but it would be a problem to use all attributes to create the ordination. I might be missing your point here, though.

I can't say I love this approach since there's some added complexity both in the implementation and in the mental model for users because we're treating different columns differently, but I definitely prefer this over separating y and y_predict, and I think I prefer it over a separate required ids argument, too.

Yep, that's exactly the feedback I needed. I agree that I don't love this approach either!

The more I'm thinking about this, the more I'm convincing myself that IDs should be optional. If people are coming to this package looking for a multi-output estimator, most will not have any kind of ID associated with their observations, and having to add an ID column will just be an obstacle with no benefit, since we can of course use array indices to internally identify neighbors without ever exposing that to users. I know there are use cases where a user would need plot IDs as outputs, e.g. outputting a nearest neighbor grid to use elsewhere, but what if we were to handle that as more of a special use case? In fact, if we accepted an optional ids parameter to fit, would it work to just use the kneighbor_ids method to expose the nearest neighbor IDs for a given set of environmental variables? I haven't totally worked that through in my head, but if so then IDs would just be an internal detail as far as predict is concerned. It's possible I'm misunderstanding the typical use case or missing an obvious reason why this wouldn't work, so let me know if I'm way off base!

Bingo! I think you're absolutely right. This is absolutely one of those cases where I've been doing it the same way for far too long that I'm not seeing it clearly. Yes, IDs are essential in the use cases you mention (e.g. neighbor raster generation and some other files that I use but haven't yet brought up), but for prediction and scoring, we don't need them. Riffing on your idea a bit, do we really even need to supply ids as an optional parameter to fit or can that be a (required) parameter to kneighbors_ids? I think that might need to be the only time that we would need to crosswalk between array indices and IDs?

I hesitate to add one more bit of complexity, but I think it's the right time to bring it up if I haven't already. We often run models where we use two (or more) plots that are measured at different times, but at the same plot locations. Other researchers use FIA subplots as their imputation "unit", so that any of the four subplots in a plot could be used as candidate neighbors. This is all fine except when it comes to assessing accuracy and those plots that are "co-located" should not be able to impute to one another. In pynnmap, I create a dictionary of plot ID to location ID and then pass that to the function that does independent accuracy assessment - in essence, it filters the NN list for any neighbors that share a location ID, subsets to the desired k, and then calculates a (weighted) mean of the independent neighbors for each continuous attribute. I only bring this up because this would be another place where we would at least have to track shared locations among IDs which would impact predict and score. In sklearn, the behavior of kneighbors when passed with no arguments is to exclude the neighbor itself which is great, but we would need to consider this additional requirement.

At some point, if it's easier to talk this through, that would be fine as well.

aazuspan commented 1 year ago

Just so I'm clear, you're not convinced about using spp as a keyword argument to pass a separate array, correct?

Sorry, I should've elaborated! I only meant the name spp as opposed to something more general like y_fit (or similar), with the thought that this should be usable outside of an ecological context.

do we really even need to supply ids as an optional parameter to fit or can that be a (required) parameter to kneighbors_ids? I think that might need to be the only time that we would need to crosswalk between array indices and IDs?

Yeah, I think you're right. My only concern with that was that the IDs corresponded with the arrays we pass in to fit rather than the array we pass in to kneighbor_ids which could cause a little confusion for users, but I think that's probably outweighed by a) avoiding an extra parameter that most users won't touch and b) not having to pass in IDs before we actually use them. A third option would be to ignore plot IDs entirely and leave it up to the user to crosswalk the array indices returned by kneighbor_ids back to their plot IDs. We could provide an example to demonstrate that, but it's easy enough that I doubt it would stump many users.

This is all fine except when it comes to assessing accuracy and those plots that are "co-located" should not be able to impute to one another.

Ah, interesting. So it's effectively equivalent to leaking "training data" into the test set? My first thought from an sklearn perspective is that that's up to the user to avoid, and shouldn't be a problem outside of an FIA context, but I think it's definitely worth some more consideration so we don't shoot ourselves in the foot when we need to deal with it later. Do you have any thoughts on how we could handle that with the setup we're narrowing in on? The only thing I can think of is some kind of utility function that would remove rows from one array by comparing the location IDs with another array.

At some point, if it's easier to talk this through, that would be fine as well.

Thanks! These topics sometimes take me a few hours to think through and digest, so I would probably be a lot less coherent in real time. But I would like to hear some more of those GNN war stories, so we'll have to find an excuse to Zoom every now and then :)

grovduck commented 1 year ago

Sorry, I should've elaborated! I only meant the name spp as opposed to something more general like y_fit (or similar), with the thought that this should be usable outside of an ecological context.

👍

Yeah, I think you're right. My only concern with that was that the IDs corresponded with the arrays we pass in to fit rather than the array we pass in to kneighbor_ids which could cause a little confusion for users, but I think that's probably outweighed by a) avoiding an extra parameter that most users won't touch and b) not having to pass in IDs before we actually use them. A third option would be to ignore plot IDs entirely and leave it up to the user to crosswalk the array indices returned by kneighbor_ids back to their plot IDs. We could provide an example to demonstrate that, but it's easy enough that I doubt it would stump many users.

Again, you're too kind. My suggestion definitely doesn't work, because you could be passing in an unrelated X with the IDs from the X and y you provided to fit. Scratch that.

I think I like option 3. Can we think of the IDs in the same way we would think about feature_names and target_names from the load... function, in that we won't be using them directly in the estimators themselves, but they will be useful to have around to associate with the output of the estimators? I could see using pandas functionality in convenience functions around these, but I agree to keep them out of the estimators themselves. I think we both agree that keeping as true to the sklearn workflow, the better.

Ah, interesting. So it's effectively equivalent to leaking "training data" into the test set?

That's an interesting perspective. I hadn't thought about it in that way, but I think it works as an analogy.

My first thought from an sklearn perspective is that that's up to the user to avoid, and shouldn't be a problem outside of an FIA context.

I agree with you (I need to trust the user more!). And perhaps this nuance stays outside of this package completely.

To me, there seems to be an evolving theme that we need to handle some of the subtleties of ecological forest mapping in a package that wraps this one (e.g. IDs, independent accuracy assessment, specialized AA like area estimation). Does that make sense to you? It seems like some of these issues could be handled by decorators around or subclasses from the estimators we're introducing here?

Thanks! These topics sometimes take me a few hours to think through and digest, so I would probably be a lot less coherent in real time. But I would like to hear some more of those GNN war stories, so we'll have to find an excuse to Zoom every now and then :)

I'm with you on this. Even after thinking about things, I second-guess myself quite a bit.

aazuspan commented 1 year ago

Can we think of the IDs in the same way we would think about feature_names and target_names from the load... function

I don't want to keep dragging this out indefinitely, especially when had just about reached a consensus, but this suggestion (which I like) made me consider another possibility. When you fit an sklearn estimator with a dataframe instead of an array, it will store the feature names in feature_names_in_, like below:

import pandas as pd
from sklearn.neighbors import KNeighborsClassifier

X = pd.DataFrame({"a": [1, 2, 3], "b": [4, 5, 6]})
y = [1, 2, 3]
clf = KNeighborsClassifier().fit(X, y)

clf.feature_names_in_
# ["a", "b"]

If a IDNeighborsClassifier estimator is fit with a dataframe, should we think about storing the index of that dataframe as a feature_index_in_ or something equivalent? We could then provide something like a return_dataframe_index=False option in kneighbor_ids to return those instead of array indices. The user would still be responsible for setting their plot IDs as the dataframe index, but we could provide some convenience by storing and crosswalking IDs, and it would be completely transparent for users who don't care about IDs. The downside of course is implementation and maintenance complexity on our end.

I'm not sold on this option and could happily be talked back into just keeping it simple, so I'm curious what you think.

To me, there seems to be an evolving theme that we need to handle some of the subtleties of ecological forest mapping in a package that wraps this one (e.g. IDs, independent accuracy assessment, specialized AA like area estimation). Does that make sense to you? It seems like some of these issues could be handled by decorators around or subclasses from the estimators we're introducing here?

I think that would be my approach. There is definitely some maintenance overhead for each package we add, so we may not want to get too granular. Maybe these could be integrated into the raster imputation package, which would have a scope similar to yaImpute?

grovduck commented 1 year ago

If a IDNeighborsClassifier estimator is fit with a dataframe, should we think about storing the index of that dataframe as a feature_index_in_ or something equivalent? We could then provide something like a return_dataframe_index=False option in kneighbor_ids to return those instead of array indices.

I think this could work, but I want to be explicit to make sure I'm tracking all the changes here. I was reading through the earlier comments on this issue and we were talking about returning array indexes from kneighbors_ids. But I think that's the whole reason for the IDNeighborsClassifier subclass was to provide the kneighbors_id method and if we had come to the (tentative) decision to exclude IDs altogether, there wouldn't have been a need for the IDNeighborsClassifier subclass any more because we would just be returning indexes, correct?

So by passing a dataframe where the IDs are stored the index, the option to return_dataframe_index would now be on just KNeighborsClassifier.kneighbors() instead (ignoring for a minute the issue you raise in #13), right? If arrays are passed to fit instead of a dataframe, but return_dataframe_index was set to True in kneighbors, would that return None or raise an error?

If you think you're willing, I would find it helpful to see code that would implement this. It seems that we have a lot of inter-related issues right now that we need to untangle, but it feels like we need to first tackle this and #13, because those are both impacting the fundamental class hierarchy.

aazuspan commented 1 year ago

the whole reason for the IDNeighborsClassifier subclass was to provide the kneighbors_id method and if we had come to the (tentative) decision to exclude IDs altogether, there wouldn't have been a need for the IDNeighborsClassifier subclass any more because we would just be returning indexes, correct?

This is a good point that I hadn't fully thought through yet. Yes, I think if we fully exclude IDs, we can do away with the IDNeighborsClassifier and subclass directly from the appropriate sklearn estimator. On the other hand, if we go with the idea of using optional dataframe indexes, we'll need a class where we can override kneighbors with the new functionality, and IDNeighbors still seems like a reasonably accurate name for that. In either case, I agree that kneighbor_ids is no longer needed.

If arrays are passed to fit instead of a dataframe, but return_dataframe_index was set to True in kneighbors, would that return None or raise an error?

I would definitely raise an error. I don't love the idea of having an argument that is only usable under some conditions, and I'm not sure if there's any precedent for that already in sklearn, so that scenario you mentioned is my biggest reservation about this approach.

If you think you're willing, I would find it helpful to see code that would implement this. It seems that we have a lot of inter-related issues right now that we need to untangle, but it feels like we need to first tackle this and https://github.com/lemma-osu/scikit-learn-knn/issues/13

This is a good idea, and would help solidify my thinking here too. I think trying to tackle #13 before this probably makes more sense than vice versa, although neither is perfect, and trying to do both at once might turn into a monster. If you're good with me moving ahead there, I propose that we merge that, then take a closer look at what it would take to get this up and running.

aazuspan commented 1 year ago

Hey @grovduck, I think we might have finally cleared up enough issues that we can reconsider the idea of using dataframe indexes in kneighbors. Obviously we've made a lot of progress on the package since we last discussed this, so I'm curious if you have any new thoughts on the topic.

I put together a quick implementation below that would optionally swap out the array indexes for dataframe indexes if they were seen during fitting, just to see what it would potentially look like.

class IDNeighborsRegressor(KNeighborsRegressor):
    """
    A KNeighborsRegressor subclass that supports dataframe indexes.
    """

    def fit(self, X, y):
        if hasattr(X, "index"):
            self.dataframe_index_in_ = np.asarray(X.index)

        return super().fit(X, y)

    def kneighbors(
        self,
        X=None,
        n_neighbors=None,
        return_distance=True,
        return_dataframe_index=False,
    ):
        neigh_dist, neigh_ind = super().kneighbors(
            X=X, n_neighbors=n_neighbors, return_distance=True
        )

        if return_dataframe_index:
            msg = "Dataframe indexes can only be returned when fitted with a dataframe."
            check_is_fitted(self, "dataframe_index_in_", msg=msg)
            neigh_ind = self.dataframe_index_in_[neigh_ind]

        if return_distance:
            return neigh_dist, neigh_ind
        return neigh_ind

grovduck commented 1 year ago

Hey @aazuspan, thanks for getting back to this issue. I had intended to figure out what decisions still needed to be made on this issue because we've changed things around since the last time we visited this.

This implementation looks awesome. I can say that I fully understand it :wink:.

I just read again through the whole issue and I think this is a really elegant solution that you've fleshed out. I didn't see anything else outstanding in this issue other than this IDs decision (and a few details on truly independent neighbor finding which we agreed to shunt to a wrapping project), so if this implementation works and tests pass, I'm happy to move forward.

It's so satisfying that the hard work you put in into the guts of dataframe integration leads to such an elegant outcome!

aazuspan commented 1 year ago

Great! One more question before I move ahead with this. Do you think we should keep IDNeighborsRegressor as a subclass or turn it into a mixin? I don't think there would be any impact other than readability, so it's probably just a question of which method for defining our estimator classes is more intuitive.

# IDNeighbors as a subclass
class GNNRegressor(IDNeighborsRegressor, TransformedKNeighborsMixin):
    ...

# IDNeighbors as a mixin
class GNNRegressor(IDNeighborsMixin, TransformedKNeighborsMixin, KNeighborsRegressor):
    ...

I sort of like that the second option explicitly marks them as KNeighborsRegressor subclasses, but since we'll be using the IDNeighborsMixin for all our estimators, maybe that's just unnecessary complexity?

EDIT: I guess there's also a question of whether IDNeighbors... is still the best name. We could make it more explicitly related to dataframes, e.g. DataframeIndexKNeighbors..., but that does start to get a little wordy.

grovduck commented 1 year ago

I sort of like that the second option explicitly marks them as KNeighborsRegressor subclasses, but since we'll be using the IDNeighborsMixin for all our estimators, maybe that's just unnecessary complexity?

I agree with you on the second option and marking them specifically as KNeighborsRegressor is nice. And the whole idea of mixins is becoming more intuitive to me rather than creating these deep hierarchies. I don't necessarily see it as complexity if you view it through the lens that mixins provide additional functionality that the base class wouldn't have.

I guess there's also a question of whether IDNeighbors... is still the best name

For me, the heart of the mixin (if we go that route) is to facilitate crosswalking between array indices and whatever, so does something like KNeighborsCrosswalkMixin get to that? Still a bit wordy. I like your nod to dataframes to make it expliict that this functionality relies on this data structure as well. Not much help.

aazuspan commented 1 year ago

I agree with you on the second option and marking them specifically as KNeighborsRegressor is nice. And the whole idea of mixins is becoming more intuitive to me rather than creating these deep hierarchies. I don't necessarily see it as complexity if you view it through the lens that mixins provide additional functionality that the base class wouldn't have.

Great, I agree with this and will move ahead with the mixin route!

For me, the heart of the mixin (if we go that route) is to facilitate crosswalking between array indices and whatever, so does something like KNeighborsCrosswalkMixin get to that?

This is a helpful way of thinking about it, and I like the idea of including crosswalk in the name. The fact that this is specifically for dataframes makes me think we should reference that somehow to make it clear what we're crosswalking to, but I'm also struggling to come up with something that captures all the important aspects without turning into word salad. I guess we could shorten KNeighbors to KNN (but I prefer to stay consistent with the estimator name) or dataframe to DF...

KNeighborsDFIndexCrosswalkMixin? I don't love it, but it hits all the points and is technically the same length as sklearn's longest mixin, ClassNamePrefixFeaturesOutMixin.

Or we could go all in on acronyms with KNNDFIDCWMI 😉

grovduck commented 1 year ago

KNeighborsDFIndexCrosswalkMixin?

Honestly, I think that's just fine ;) Other than the totally intuitive KNNDFIDCWMI, which should be our new package name!

aazuspan commented 1 year ago

Resolved by #37