Support crosswalking dataframe indexes in `kneighbors`

aazuspan commented 1 year ago

This would close #2 by implementing the dataframe index crosswalk mixin for KNeighbors estimators.

To do that, we swap the placeholder IDNeighborsRegressor with the new KNeighborsDFIndexCrosswalkMixin, which will attempt to store dataframe indices from X in the dataframe_index_in_ attribute during fitting. Then, a return_dataframe_index parameter is added to the kneighbors method, allowing it to return the indices associated with the training dataframe rather than just the array indices. This will allow users to get plot IDs out just by setting them as the index of their training dataframe.

A few other notable changes:

All estimators now inherit from KNeighborsRegressor directly, rather than through the crosswalk mixin.
TransformedKNeighborsMixin no longer inherits from KNeighborsRegressor. If mixins and estimators all inherit from this class, it causes a circular MRO that breaks Python. This seems to be consistent with sklearn, where mixins rarely inherit from anything, except when there's a hierarchy of related mixins.
I removed the disabled neighbor index checks in test_port.py. I think we were keeping those in until we decided how we were handling indexes, but correct me if I'm wrong @grovduck or if you think those still have value.

grovduck commented 1 year ago

I removed the disabled neighbor index checks in test_port.py. I think we were keeping those in until we decided how we were handling indexes, but correct me if I'm wrong @grovduck or if you think those still have value.

I think they have value in the short term (while we're porting), because we can explicitly test the k=5 neighbor IDs (both references and targets). The issue is that we're currently passing arrays into test_port.py so we can't use return_dataframe_index=True. I changed conftest.py around a bit so that we're passing X and y as dataframes and using the first column as an index:

self.X = pd.read_csv(f"./tests/data/{project}_env.csv", index_col=0)
self.y = pd.read_csv(f"./tests/data/{project}_spp.csv", index_col=0) # unneeded, but consistent
# No longer need local vars env_df, spp_df or attribute self.ids

Then in test_port.py, we can do:

dist, nn = clf.kneighbors(return_dataframe_index=True)
assert_array_equal(nn, moscow_euclidean.ref_neighbors, decimal=3)
assert_array_almost_equal(dist, moscow_euclidean.ref_distances, decimal=3)

dist, nn = clf.kneighbors(X_test, return_dataframe_index=True)
assert_array_equal(nn, moscow_euclidean.trg_neighbors, decimal=3)
assert_array_almost_equal(dist, moscow_euclidean.trg_distances, decimal=3)

BUG: Unfortunately, I have to get out of here today, but testing this made all kinds of things break. So I hope you can get the gist of what I'm thinking about, but also let me know if I'm thinking about this wrong.

aazuspan commented 1 year ago

So I hope you can get the gist of what I'm thinking about, but also let me know if I'm thinking about this wrong.

Yes, I follow your thinking and I'm on board!

I haven't totally thought it through yet, but #3 may simplify this as well since we'll be able to get dataframe X and y's with indexes directly using load_moscow_stjoes(as_frame=True). Maybe it's worth pushing that through before this? I think we were holding off on that until we decided how to handle the different dataset attributes and IDs, but it seems like we've got a pretty good handle on that now, and there are a few things in this PR that would be simplified by that.

Enjoy the weekend, and we can decide how best to proceed next week!

grovduck commented 1 year ago

I haven't totally thought it through yet, but https://github.com/lemma-osu/scikit-learn-knn-regression/issues/3 may simplify this as well since we'll be able to get dataframe X and y's with indexes directly using load_moscow_stjoes(as_frame=True). Maybe it's worth pushing that through before this?

That would be awesome! I agree that would make this easier to test for the IDs. I'm fully aware that you're doing all the heavy lifting right now, so please let me know if there is an obvious place I can help (other than review).

aazuspan commented 1 year ago

I'm not 100% sure I handled that merge in the cleanest way possible, but I think all the changes from the dataset branch are now merged and the neighbor ID checks are reactivated and passing!

Assuming I didn't accidentally remove any changes we wanted to keep, I think that wraps this up?

grovduck commented 1 year ago

Awesome, @aazuspan. Just reviewed the changes and I couldn't find anything to quibble with! LGTM!

lemma-osu / sknnr

Support crosswalking dataframe indexes in `kneighbors` #37