Closed aazuspan closed 1 year ago
I removed the disabled neighbor index checks in test_port.py. I think we were keeping those in until we decided how we were handling indexes, but correct me if I'm wrong @grovduck or if you think those still have value.
I think they have value in the short term (while we're porting), because we can explicitly test the k=5 neighbor IDs (both references and targets). The issue is that we're currently passing arrays into test_port.py
so we can't use return_dataframe_index=True
. I changed conftest.py
around a bit so that we're passing X
and y
as dataframes and using the first column as an index:
self.X = pd.read_csv(f"./tests/data/{project}_env.csv", index_col=0)
self.y = pd.read_csv(f"./tests/data/{project}_spp.csv", index_col=0) # unneeded, but consistent
# No longer need local vars env_df, spp_df or attribute self.ids
Then in test_port.py
, we can do:
dist, nn = clf.kneighbors(return_dataframe_index=True)
assert_array_equal(nn, moscow_euclidean.ref_neighbors, decimal=3)
assert_array_almost_equal(dist, moscow_euclidean.ref_distances, decimal=3)
dist, nn = clf.kneighbors(X_test, return_dataframe_index=True)
assert_array_equal(nn, moscow_euclidean.trg_neighbors, decimal=3)
assert_array_almost_equal(dist, moscow_euclidean.trg_distances, decimal=3)
BUG: Unfortunately, I have to get out of here today, but testing this made all kinds of things break. So I hope you can get the gist of what I'm thinking about, but also let me know if I'm thinking about this wrong.
So I hope you can get the gist of what I'm thinking about, but also let me know if I'm thinking about this wrong.
Yes, I follow your thinking and I'm on board!
I haven't totally thought it through yet, but #3 may simplify this as well since we'll be able to get dataframe X and y's with indexes directly using load_moscow_stjoes(as_frame=True)
. Maybe it's worth pushing that through before this? I think we were holding off on that until we decided how to handle the different dataset attributes and IDs, but it seems like we've got a pretty good handle on that now, and there are a few things in this PR that would be simplified by that.
Enjoy the weekend, and we can decide how best to proceed next week!
I haven't totally thought it through yet, but https://github.com/lemma-osu/scikit-learn-knn-regression/issues/3 may simplify this as well since we'll be able to get dataframe X and y's with indexes directly using
load_moscow_stjoes(as_frame=True)
. Maybe it's worth pushing that through before this?
That would be awesome! I agree that would make this easier to test for the IDs. I'm fully aware that you're doing all the heavy lifting right now, so please let me know if there is an obvious place I can help (other than review).
I'm not 100% sure I handled that merge in the cleanest way possible, but I think all the changes from the dataset
branch are now merged and the neighbor ID checks are reactivated and passing!
Assuming I didn't accidentally remove any changes we wanted to keep, I think that wraps this up?
Awesome, @aazuspan. Just reviewed the changes and I couldn't find anything to quibble with! LGTM!
This would close #2 by implementing the dataframe index crosswalk mixin for
KNeighbors
estimators.To do that, we swap the placeholder
IDNeighborsRegressor
with the newKNeighborsDFIndexCrosswalkMixin
, which will attempt to store dataframe indices fromX
in thedataframe_index_in_
attribute during fitting. Then, areturn_dataframe_index
parameter is added to thekneighbors
method, allowing it to return the indices associated with the training dataframe rather than just the array indices. This will allow users to get plot IDs out just by setting them as the index of their training dataframe.A few other notable changes:
KNeighborsRegressor
directly, rather than through the crosswalk mixin.TransformedKNeighborsMixin
no longer inherits fromKNeighborsRegressor
. If mixins and estimators all inherit from this class, it causes a circular MRO that breaks Python. This seems to be consistent with sklearn, where mixins rarely inherit from anything, except when there's a hierarchy of related mixins.test_port.py
. I think we were keeping those in until we decided how we were handling indexes, but correct me if I'm wrong @grovduck or if you think those still have value.