lemma-osu / sknnr

scikit-learn compatible estimators for various kNN imputation methods
https://sknnr.readthedocs.io
0 stars 1 forks source link

`list.index` accidentally stored as `dataframe_index_in_` #55

Closed aazuspan closed 1 year ago

aazuspan commented 1 year ago

I just came across scikit-learn/scikit-learn#27037 that points out that using hasattr or getattr to retrieve a DataFrame index will unintentionally retrieve the list.index method. For us, this means that an estimator fitted with a list will store that method as dataframe_index_in_. Then, if you used return_dataframe_index=True with kneighbors, you'd get a weird IndexError.

from sknnr import EuclideanKNNRegressor
from sknnr.datasets import load_swo_ecoplot

X, y = load_swo_ecoplot(return_X_y=True)
X = X.tolist()

est = EuclideanKNNRegressor().fit(X, y)
print(est.dataframe_index_in_)  # <built-in method index of list object at 0x000001E8BB330BC0>
est.kneighbors(return_dataframe_index=True)  # IndexError: too many indices for array

This is the offending line:

https://github.com/lemma-osu/scikit-learn-knn-regression/blob/96a4230cbbd113136aa619f869afabe1d648e054/src/sknnr/_base.py#L14-L15

They solve this in scikit-learn/scikit-learn#27044 by using a new _is_pandas_df helper, but as far I can tell that's not released yet, so I think it'll be a while before we want to rely on that.

We could add a similar function to sknnr, but I wonder if we might break some implicit duck-typed support for DataFrame-like objects by adding an explicit class check here. An alternative would just be to check that index is an array using _is_arraylike, which would prevent this bug and should cover us for just about any other case.

grovduck commented 1 year ago

Not sure HOW you came across this bug, but great explanation of the issue. I like your solution because we really only care about X's index rather than checking whether X is a dataframe. Checking that it's array-like makes sense to me.

aazuspan commented 1 year ago

You haven't read every scikit-learn issue?! Just kidding, I got lucky when their latest changelog showed up in my Github feed 😉

Sounds like a plan!

aazuspan commented 1 year ago

Resolved by #56