Closed arose13 closed 4 years ago
I am open to this actually and am happy to hear you’ve considered this project. @MBrouns thoughts? There’s a few thoughts on my side.
@koaning very cool! kk.
By docs I mean a short thing in a jupyter notebook. If you look at, per example, our fairness documentation or our meta model stuff you might get an impression of what we're going for. One possibility is to demonstrate how it is different to a GMM approach and how a choice of a kernel influences the prediction. It would also be nice if the doc-string explains what bandwidth does.
Since you're open to helping maintain the feature I certainly wouldn't mind seeing a PR for this but I'll check with @MBrouns in person tomorrow to make sure he's onboard.
One thing to keep an eye on is the way we test. We have loads to standard tests and these things fail unless _check_X_y
is used in .fit()
and unless check_is_fitted
is used in .predict_proba()
. You should see plenty of examples in our library, it's a scikit-learn internal thing. It would also be good to add simple tests that use easy/obvious datasets.
Another thought popped in. Could the usage of the prior be something that can be turned off or set manually? There are moments when a prior from domain knowledge is better than what the dataset can provide.
There's an experimental meta-estimator with this idea here.
I think I’m open to this as well. I would like to see a notebook though comparing it to the gmm version as I’m having a hard time reasoning through the differences in behavior atm.
Also, there’s some sklearn specific things that you currently don’t adhere to. I’ve written down my thoughts on this here: https://www.mbrouns.com/posts/2020-01-26-sklearns-init/. It might be good to read through that and adjust the implementation accordingly.
@koaning 100% the prior could be made into any arbitrary function and set manually.
@MBrouns and @koaning I'll whip up a notebook today by end of day today. The difference are subtle but clear depending on the situation so I'll make sure to demonstrate that
Haven't heard from this. No rush. Just wanted to check :)
Whoops sorry.
https://colab.research.google.com/drive/12z28LCt2Y76smB01w2QizK3_cCo-6y2G
I never noticed how bad scaling is on larger than of the academic datasets and that actually has been a big thing. I've used it on 2D species distributional data (I'm trying to see if I am allowed to use that dataset for the notebook).
You notebook states;
The time complexity infact is 𝑂(𝑝3) where 𝑝 is the number of dimensions.
For training the GMM. I don't know how sklearn does it internally, but I wonder how the EM algorithm has that bound. Got a reference for it? Mostly out of my own curiosity.
I don't know precisely for sklearn either but I figured since the expectation step has to perform a dot product as the most expensive step to compute the expectation that would be p^3 (and apparently ~p^2.3). It should also scale linearly with the number of clusters.
(Not an academic level citation I know but https://en.wikipedia.org/wiki/Computational_complexity_of_mathematical_operations#Matrix_algebra)
Strange. I copied your notebook and seem to have a significantly different result at the bottom. The colab can be found here. I seem to get roughly the same scores across all approaches. I increased the granularity of the plot a bit.
That is extremely strange. I've requested access to your colab notebook, but I'll rerun to see what is going on. This make more sense anyway because when I was trying to rerun it, again and again, I couldn't figure out why everything was losing performance for very large sample sizes.
EM is typically stochastic so there's a risk of getting stuck in a bad local. That's my 5 cents.
I didn't see an access request. Maybe this link?
That link worked. While that could explain the GMM results I observed I couldn't explain the KDE results also.
Update: I was able to get similar results to you on a fresh start of my notebook this time. I guess it was a strange Heisenbug.
https://colab.research.google.com/drive/12z28LCt2Y76smB01w2QizK3_cCo-6y2G
I add a species example. Turns out sklearn has a species dataset so you can see a real expected use case
at the end all algs seem to drop again with a large ‘n’ ... maybe increase the number of iterations to be sure it converges properly?
I finally found the real bug causing the problem. It's fixed now tho.
Problem was a line effectively like the following was run when subsetting the data. x_subset = x[:len(x)]
Also, @MBrouns which part of the code specific does the class not adhere to sklearn
style? I assume something is wrong with the __init__
method but after comparing it to a few sklego
classes I don't see what you're seeing.
PS: I do know have to add all the fancy check_array()
and what not
It's quite subtle, but in the __init__
, the classes_
, models_
, and priors_logp_
attributes are initialized. This causes check_is_fitted
to think that the model has already been fitted (because there exist attributes with a trailing underscore on the class).
@arose13 Feel free to start working on an implementation and PR.
As for the docs, it would be preferable if you could condense it down. Mostly in terms of compute time, we might want to have these docs auto generate and we prefer to have the docs build in mere minutes.
As for tests, feel free to look at how we test the GMM
-style classifiers. If you do something similar then it's fine.
Quick review on implementation;
class BayesianKernelDensityClassifier(BaseEstimator, ClassifierMixin):
"""
NOTE: This class is still being developed to make it work well with all the sklearn components.
Bayesian Classifier that uses Kernel Density Estimations to generate the joint distribution
Parameters:
- bandwidth: float
- kernel: for scikit learn KernelDensity
"""
def __init__(self, bandwidth=0.2, kernel='gaussian'):
self.classes_, self.models_, self.priors_logp_ = [None] * 3
self.bandwidth = bandwidth
self.kernel = kernel
def fit(self, X, y):
self.classes_ = np.sort(np.unique(y))
training_sets = [X[y == yi] for yi in self.classes_]
self.models_ = [KernelDensity(bandwidth=self.bandwidth, kernel=self.kernel).fit(x_subset)
for x_subset in training_sets]
self.priors_logp_ = [np.log(x_subset.shape[0] / X.shape[0]) for x_subset in training_sets]
return self
def predict_proba(self, X):
logp = np.array([model.score_samples(X) for model in self.models_]).T
result = np.exp(logp + self.priors_logp_)
return result / result.sum(1, keepdims=True)
def predict(self, X):
return self.classes_[np.argmax(self.predict_proba(X), 1)]
Looking at this implementation;
__init__
should not have a trailing underscore __init__
should be verified (and perhaps raise ValueError
) in the .fit()
method .fit()
method need to end in an underscore (like self.models_
), note that if you add something in .fit()
that we should not have it added in __init__
as None
.fit()
need to be checked if they exist in the predict
methods with check_if_fitted
Okay cool, I'll open a work in progress pull request. I'm not sure where to put it actually. I was going to put it in the naive_bayes
submodule since similar classifiers are there but this nor any of the classes there implement very 'naive' Bayes. But that's a small potato.
I could also make a new submodule called neighbours
and now for something completely different
@arose13 if there are any lessons you've learned while implementing this feature, feel free to add it here;
https://github.com/koaning/scikit-lego/issues/106
This library offers a poem with words of wisdom that will appear once you type;
from sklego import this
When you make a meaningful contribution to our library, you're also invited to add something to the poem.
lol! That's very cute. I'm not much of a poet, so I'll to think very hard about this one ^_^
This is now merged.
The prior is $P(y=0)$. I primarily use it for spatial problems.
It is similar to the GMM Classifier with only 2 caveats I can think of.