lemma-osu / sknnr

scikit-learn compatible estimators for various kNN imputation methods
https://sknnr.readthedocs.io
0 stars 1 forks source link

Finalize naming conventions for estimators and package #6

Closed grovduck closed 1 year ago

grovduck commented 1 year ago

Currently, our estimators are named to follow method names introduced in R's yaImpute package. We likely want to conform to scikit-learn's model of being more explicit about whether estimators are classifiers or regressors. kNN methods fit into a funny space here - most often they are thought of as classifiers as they use the k nearest neighbors in prediction. However, some attributes are continuous and are calculated as (weighted) means from the k neighbors attributes, which is more similar to KNeighborsRegressor.

Also need to decide whether it makes sense to stick with scikit-learn-knn as a package name as knn is already a meaningful term in scikit-learn. We could also decide to "pay homage" to yaImpute which is serving as the inspiration for this package.

aazuspan commented 1 year ago

Regarding estimator names, what do you think about using Imputer as opposed to Classifier or Regressor? To me this seems like the most technically accurate term, but I'm guessing you shied away from it to avoid confusion with the imputers (including KNNImputer) built in to sklearn? The tricky thing seems to be that our estimators actually very closely match the functionality of the KNNImputer, so we will need to be careful to distinguish how this package differs from the built-in features of sklearn...

grovduck commented 1 year ago

@aazuspan, honestly I hadn't come across the Imputer classes in sklearn until a few weeks ago when we were diving in. I definitely agree with you that our estimators basically do both straight-forward and fancy imputation and really closely mimic the behavior of KNNImputer (which is basically our Raw estimator).

The only reason I've shied away from using this terminology on our estimators is because they seem to treat imputation as a way to fill missing values on the way to transformation/classification/regression. That is, it's a necessary pre-filling step such that other estimators down the line don't choke.

The expectation for our estimators is that the X we pass to fit is complete whereas in KNNImputer, the expectation is that the X passed to fit will be transformed to get rid of missing values.

But recognize that I haven't yet come up with a better term yet ... how about GNNClassifierAndRegressorAndReallyAnImputer :smile:.

I'm willing to go with Imputer if you think we can effectively explain the differences.

aazuspan commented 1 year ago

GNNClassifierAndRegressorAndReallyAnImputer

Perfect, I'll just close this issue as resolved 😆

they seem to treat imputation as a way to fill missing values on the way to transformation/classification/regression

I completely agree. It's a challenging distinction to make because they function almost identically, but have very different purposes.

I'm willing to go with Imputer if you think we can effectively explain the differences.

No, I think your instincts were right. The association between imputation and filling missing data is just too strong for sklearn users, even if it's technically an accurate description of what we're doing.

Back to the drawing board!

grovduck commented 1 year ago

Based on discussions in #2 and #13, starting to learn toward using Regressor here to describe these custom estimators. If we do indeed subclass our estimators from KNearestRegressor, logical names would be:

RawRegressor
EuclideanRegressor
MahalanobisRegressor
MSNRegressor
GNNRegressor

I might be tempted to give a nod to being explicit that these are KNearest regressors, so we could consider names like:

RawKnnRegressor
...

It gets a bit awkward/redundant with MSN (most similar neighbor) and GNN (gradient nearest neighbor) because they already have NN implied. So, my current thinking is:

RawKnnRegressor
EuclideanKnnRegressor
MahalanobisKnnRegressor
MSNRegressor
GNNRegressor
(and eventually RFNNRegressor)

Does that make scikit-learn-knn-regression (package) and sklearn-knn-regression (imported module) too unwieldy as a names?

aazuspan commented 1 year ago

So, my current thinking is:

I'm happy with these names, and I'm also in favor of being more explicit by including Knn while avoiding the redundancy with the MSN and GNN estimators.

In the interest of thinking through every option, would there be any advantage to combining similar estimators into a single estimator that could be selected with a parameter like method? For example, say a KNNRegressor where method could be raw, Euclidean, or Mahalanobis? I'm not sure I like this idea, but I'm curious if you have any inclination.

Does that make scikit-learn-knn-regression (package) and sklearn-knn-regression (imported module) too unwieldy as a names?

They are a little wordy... I think the package name would be okay since it's only used once or twice, but the module name is definitely a mouthful to import every time. I don't want to criticize without tossing out some alternatives, so just to put a few ideas on the board for consideration:

grovduck commented 1 year ago

In the interest of thinking through every option, would there be any advantage to combining similar estimators into a single estimator that could be selected with a parameter like method? For example, say a KNNRegressor where method could be raw, Euclidean, or Mahalanobis? I'm not sure I like this idea, but I'm curious if you have any inclination.

I definitely agree that there is very little difference between the estimators we've introduced and that they likely could be combined into a single estimator. My aversion to this is mainly based on trying to decipher the yai function in yaImpute which takes exactly this approach (method is a parameter). The complexity of that function is quite high with a lot of conditionals (granted it has even more parameters than just method that causes this complexity). So it's entirely possible that I've gone overboard (too many distinct classes) when porting these estimators over. I'm curious what you feel is the right balance.

I think one goal that I have in mind is inter-comparison between the methods along with their hyperparameters through GridSearchCV or RandomizedSearchCV. Perhaps having method as a hyperparameter for a single estimator versus different estimators makes this comparison easier? Is that what you were thinking?

I don't want to criticize without tossing out some alternatives, so just to put a few ideas on the board for consideration

I agree we'd probably want to avoid an import name that has been used before (unfortunately "nearest-neighbors" and "neural-networks" share the same abbreviation). All of your suggestions are better than mine, but thought I'd throw one more out - sknnr (aka "skinner" - a nod to the Simpsons principal). It looks like there is a package on PyPi called "skinner", but not under active development. I like "skneighbors" as well.

aazuspan commented 1 year ago

My aversion to this is mainly based on trying to decipher the yai function in yaImpute which takes exactly this approach (method is a parameter). The complexity of that function is quite high with a lot of conditionals

Yes, I think that kind of parameter spaghetti is definitely something we should try to avoid. The sklearn framework gives us an advantage there since it forces separation of the model configuration from the fitting, predicting, and scoring parameters, but I still think you're right to err on the side of atomic over configurable.

I'm curious what you feel is the right balance.

I think either approach is reasonable, but I'm leaning towards the way you have it set up now. I agree that simpler functions are better, and in particular I'm not a fan of string params with predefined options because they're harder for users to remember and force us to handle invalid inputs. In any case, I don't think we're locking ourselves into anything and can always reconsider later. I imagine if we wanted to switch to a method param, we would probably leave the estimators as is and just add a KNNRegressor function that dispatches to the appropriate estimator, so no big code changes needed.

I think one goal that I have in mind is inter-comparison between the methods along with their hyperparameters through GridSearchCV or RandomizedSearchCV. Perhaps having method as a hyperparameter for a single estimator versus different estimators makes this comparison easier?

I don't have a ton of experience with those, but the conclusion I came to was that a method parameter would only help for comparing within that single estimator, and might make it harder to compare between other estimators. This is a good thing to keep in mind though.

All of your suggestions are better than mine, but thought I'd throw one more out - sknnr (aka "skinner" - a nod to the Simpsons principal). It looks like there is a package on PyPi called "skinner", but not under active development. I like "skneighbors" as well.

I like sknnr! That matches the proposed package name better and looks like it should be more searchable than skneighbors, which Google thinks is a typo for kneighbors. That's at the top of my list.

grovduck commented 1 year ago

All sounds good to me, @aazuspan. I agree that we can revisit the decision of the method parameter, but let's stick with what we have now. Are we ready to implement this? I'm happy to take a cut at implementing it, although I could use some advice on the cleanest way to rename the package without really messing things up. I guess that's probably a simple rename on Github and it shouldn't impact any local repos (because you can give the local repo whatever name you want, right?)

aazuspan commented 1 year ago

Are we ready to implement this?

I think so! It will be a little painful to merge that change in with any branches we have ongoing since I think git will consider all the old src files deleted, but we'll have to deal with it at some point, so I'm not sure we gain anything by waiting.

I could use some advice on the cleanest way to rename the package without really messing things up

I think locally it should be as simple as a find-replace in VS Code and renaming the module folder. There could be some other snags I'm not remembering, but I've done this a few times and can't remember any major issues. It's also possible we'll have to rebuild our hatch environments using hatch env prune, but I'm not sure.

I guess that's probably a simple rename on Github and it shouldn't impact any local repos (because you can give the local repo whatever name you want, right?)

Yeah, that should be pretty painless. You can continue to push to the old repo name and Github will redirect it, although it might raise a warning. I think the long-term fix is to run git remote set-url origin ... with the new URL.

grovduck commented 1 year ago

It will be a little painful to merge that change in with any branches we have ongoing since I think git will consider all the old src files deleted, but we'll have to deal with it at some point, so I'm not sure we gain anything by waiting.

I may be showing my ignorance here, but considering our flow to this point, it seems like we only have one or maybe two active feature branches going and right now they are branching off fb_add_estimators (itself supposedly a feature branch). I don't necessarily think we need to retain a develop branch based on how we've been working (likely more trouble than its worth). If we make the naming change now, continue to develop on fb_add_estimators until we've got the first five estimators implemented and then rebase onto main, do we avoid the problem you mention?

grovduck commented 1 year ago

I propose renaming MyStandardScaler to StandardScalerWithDOF as part of this as well. What do you think of this implementation?

import numpy as np

class StandardScalerWithDOF(StandardScaler):
    def __init__(self, ddof=0):
        super().__init__()
        self.ddof_ = ddof
    def fit(self, X, y=None, sample_weight=None):
        scaler = super().fit(X, y, sample_weight)
        scaler.scale_ = np.std(X, axis=0, ddof=self.ddof_)
        return scaler
aazuspan commented 1 year ago

If we make the naming change now, continue to develop on fb_add_estimators until we've got the first five estimators implemented and then rebase onto main, do we avoid the problem you mention?

The problem I'm thinking of is with unpushed local branches. For example, if you're working on src/sklearn_knn/msn.py and we rename that to src/sknnr/msn.py, I think that merging those changes back to your local branch will require manually pulling code out of the "deleted" original file and into the renamed file. Does that sound right to you?

I don't necessarily think we need to retain a develop branch based on how we've been working (likely more trouble than its worth).

I've never worked with a develop branch before, so unless you think there would be some advantages once we start making releases, I'm happy to do without it.

I propose renaming MyStandardScaler to StandardScalerWithDOF as part of this as well. What do you think of this implementation?

I like the name and the added flexibility with the implementation. My only concern would be how that fits in with #15. If it's a quick fix to make it dataframe compatible, maybe we could tackle two issues with one commit, so to speak.

I was also wondering if that should be moved out of _base and into the transformers module?

grovduck commented 1 year ago

The problem I'm thinking of is with unpushed local branches

Ah, yep, I totally agree with this. At this time, I don't have any unpushed local branches (I stashed the work on MSN and deleted that branch before bringing in your work on use_regressors. You're right that I'll still need to do some manual transfer after popping the stash, but I think it should be fairly painless at this stage. And, like you say, better now than later.

I've never worked with a develop branch before, so unless you think there would be some advantages once we start making releases, I'm happy to do without it.

Again, we can always introduce it down the line if we find that it becomes necessary.

My only concern would be how that fits in with https://github.com/lemma-osu/scikit-learn-knn/issues/15. If it's a quick fix to make it dataframe compatible, maybe we could tackle two issues with one commit, so to speak. I was also wondering if that should be moved out of _base and into the transformers module?

Good thoughts for both of these. I'm leaning toward a separate PR once we do the renaming. I'll do the renaming now.

aazuspan commented 1 year ago

I'm leaning toward a separate PR once we do the renaming. I'll do the renaming now.

Sounds good to me!

grovduck commented 1 year ago

Resolved all but StandardScaler via #18