[FEATURE] Bayesian Kernel Density Classifier

arose13 commented 4 years ago

I've been using this Bayesian kernel density classifier for a few years and I thought I should move it out from my poorly organized project to this one here.

The prior is $P(y=0)$. I primarily use it for spatial problems.

It is similar to the GMM Classifier with only 2 caveats I can think of.

Hyperparameters are easier to decide on.
Scaling is worse as I believe due to the KDE part scaling linearly with the sample size.

# noinspection PyPep8Naming
class BayesianKernelDensityClassifier(BaseEstimator, ClassifierMixin):
    """
    Bayesian Classifier that uses Kernel Density Estimations to generate the joint distribution
    Parameters:
        - bandwidth: float
        - kernel: for scikit learn KernelDensity
    """
    def __init__(self, bandwidth=0.2, kernel='gaussian'):
        self.classes_, self.models_, self.priors_logp_ = [None] * 3
        self.bandwidth = bandwidth
        self.kernel = kernel

    def fit(self, X, y):
        self.classes_ = np.sort(np.unique(y))
        training_sets = [X[y == yi] for yi in self.classes_]
        self.models_ = [KernelDensity(bandwidth=self.bandwidth, kernel=self.kernel).fit(x_subset)
                        for x_subset in training_sets]

        self.priors_logp_ = [np.log(x_subset.shape[0] / X.shape[0]) for x_subset in training_sets]
        return self

    def predict_proba(self, X):
        logp = np.array([model.score_samples(X) for model in self.models_]).T
        result = np.exp(logp + self.priors_logp_)
        return result / result.sum(1, keepdims=True)

    def predict(self, X):
        return self.classes_[np.argmax(self.predict_proba(X), 1)]

koaning commented 4 years ago

I am open to this actually and am happy to hear you’ve considered this project. @MBrouns thoughts? There’s a few thoughts on my side.

do you have an example for the docs that can help defend this approach? it is similar to gmm so it would be nice to demonstrate the difference and where it might be better.
does it make sense to also offer an outlier detector with this approach?
could you rephrase what you mean with the prior?
is this feature something you might be willing to help maintain?

arose13 commented 4 years ago

@koaning very cool! kk.

By docs you mean a paper or just a jupyter notebook or something like that?
It is possible in principle to use it as an outlier detector. Especially since your other outlier detectors don't mind using arbitrary thresholds of likelihoods for calling something an outlier.
The prior in this case is the probability of y=0 being the proportion of y=0 in the training dataset. So for example, if the probability of someone having breast cancer is 2% then the prior to the breast cancer class is 0.02 and the prior to the benign case is 0.98. These priors adjust the posterior
This feature is something I'd have no problem maintaining.

koaning commented 4 years ago

By docs I mean a short thing in a jupyter notebook. If you look at, per example, our fairness documentation or our meta model stuff you might get an impression of what we're going for. One possibility is to demonstrate how it is different to a GMM approach and how a choice of a kernel influences the prediction. It would also be nice if the doc-string explains what bandwidth does.

Since you're open to helping maintain the feature I certainly wouldn't mind seeing a PR for this but I'll check with @MBrouns in person tomorrow to make sure he's onboard.

One thing to keep an eye on is the way we test. We have loads to standard tests and these things fail unless _check_X_y is used in .fit() and unless check_is_fitted is used in .predict_proba(). You should see plenty of examples in our library, it's a scikit-learn internal thing. It would also be good to add simple tests that use easy/obvious datasets.

koaning commented 4 years ago

Another thought popped in. Could the usage of the prior be something that can be turned off or set manually? There are moments when a prior from domain knowledge is better than what the dataset can provide.

There's an experimental meta-estimator with this idea here.

MBrouns commented 4 years ago

I think I’m open to this as well. I would like to see a notebook though comparing it to the gmm version as I’m having a hard time reasoning through the differences in behavior atm.

Also, there’s some sklearn specific things that you currently don’t adhere to. I’ve written down my thoughts on this here: https://www.mbrouns.com/posts/2020-01-26-sklearns-init/. It might be good to read through that and adjust the implementation accordingly.

arose13 commented 4 years ago

@koaning 100% the prior could be made into any arbitrary function and set manually.

@MBrouns and @koaning I'll whip up a notebook today by end of day today. The difference are subtle but clear depending on the situation so I'll make sure to demonstrate that

koaning commented 4 years ago

Haven't heard from this. No rush. Just wanted to check :)

arose13 commented 4 years ago

Whoops sorry.

https://colab.research.google.com/drive/12z28LCt2Y76smB01w2QizK3_cCo-6y2G

I never noticed how bad scaling is on larger than of the academic datasets and that actually has been a big thing. I've used it on 2D species distributional data (I'm trying to see if I am allowed to use that dataset for the notebook).

koaning commented 4 years ago

You notebook states;

The time complexity infact is 𝑂(𝑝3) where 𝑝 is the number of dimensions.

For training the GMM. I don't know how sklearn does it internally, but I wonder how the EM algorithm has that bound. Got a reference for it? Mostly out of my own curiosity.

arose13 commented 4 years ago

I don't know precisely for sklearn either but I figured since the expectation step has to perform a dot product as the most expensive step to compute the expectation that would be p^3 (and apparently ~p^2.3). It should also scale linearly with the number of clusters.

(Not an academic level citation I know but https://en.wikipedia.org/wiki/Computational_complexity_of_mathematical_operations#Matrix_algebra)

koaning commented 4 years ago

Strange. I copied your notebook and seem to have a significantly different result at the bottom. The colab can be found here. I seem to get roughly the same scores across all approaches. I increased the granularity of the plot a bit.

arose13 commented 4 years ago

That is extremely strange. I've requested access to your colab notebook, but I'll rerun to see what is going on. This make more sense anyway because when I was trying to rerun it, again and again, I couldn't figure out why everything was losing performance for very large sample sizes.

koaning commented 4 years ago

EM is typically stochastic so there's a risk of getting stuck in a bad local. That's my 5 cents.

I didn't see an access request. Maybe this link?

arose13 commented 4 years ago

That link worked. While that could explain the GMM results I observed I couldn't explain the KDE results also.

Update: I was able to get similar results to you on a fresh start of my notebook this time. I guess it was a strange Heisenbug.

arose13 commented 4 years ago

https://colab.research.google.com/drive/12z28LCt2Y76smB01w2QizK3_cCo-6y2G

I add a species example. Turns out sklearn has a species dataset so you can see a real expected use case

koaning commented 4 years ago

at the end all algs seem to drop again with a large ‘n’ ... maybe increase the number of iterations to be sure it converges properly?

arose13 commented 4 years ago

I finally found the real bug causing the problem. It's fixed now tho.

Problem was a line effectively like the following was run when subsetting the data. x_subset = x[:len(x)]

arose13 commented 4 years ago

Also, @MBrouns which part of the code specific does the class not adhere to sklearn style? I assume something is wrong with the __init__ method but after comparing it to a few sklego classes I don't see what you're seeing.

PS: I do know have to add all the fancy check_array() and what not

MBrouns commented 4 years ago

It's quite subtle, but in the __init__, the classes_, models_, and priors_logp_ attributes are initialized. This causes check_is_fitted to think that the model has already been fitted (because there exist attributes with a trailing underscore on the class).

koaning commented 4 years ago

@arose13 Feel free to start working on an implementation and PR.

As for the docs, it would be preferable if you could condense it down. Mostly in terms of compute time, we might want to have these docs auto generate and we prefer to have the docs build in mere minutes.

As for tests, feel free to look at how we test the GMM-style classifiers. If you do something similar then it's fine.

koaning commented 4 years ago

Quick review on implementation;

class BayesianKernelDensityClassifier(BaseEstimator, ClassifierMixin):
    """ 
    NOTE: This class is still being  developed to make it work well with all the sklearn components.

    Bayesian Classifier that uses Kernel Density Estimations to generate the joint distribution
    Parameters:
        - bandwidth: float
        - kernel: for scikit learn KernelDensity
    """
    def __init__(self, bandwidth=0.2, kernel='gaussian'):
        self.classes_, self.models_, self.priors_logp_ = [None] * 3
        self.bandwidth = bandwidth
        self.kernel = kernel

    def fit(self, X, y):
        self.classes_ = np.sort(np.unique(y))
        training_sets = [X[y == yi] for yi in self.classes_]
        self.models_ = [KernelDensity(bandwidth=self.bandwidth, kernel=self.kernel).fit(x_subset)
                        for x_subset in training_sets]

        self.priors_logp_ = [np.log(x_subset.shape[0] / X.shape[0]) for x_subset in training_sets]
        return self

    def predict_proba(self, X):
        logp = np.array([model.score_samples(X) for model in self.models_]).T
        result = np.exp(logp + self.priors_logp_)
        return result / result.sum(1, keepdims=True)

    def predict(self, X):
        return self.classes_[np.argmax(self.predict_proba(X), 1)]

Looking at this implementation;

all the things you're adding in the __init__ should not have a trailing underscore
all the things you're adding in the __init__ should be verified (and perhaps raise ValueError) in the .fit() method
all of the attributes added in the .fit() method need to end in an underscore (like self.models_), note that if you add something in .fit() that we should not have it added in __init__ as None
the attributes added in .fit() need to be checked if they exist in the predict methods with check_if_fitted

arose13 commented 4 years ago

Okay cool, I'll open a work in progress pull request. I'm not sure where to put it actually. I was going to put it in the naive_bayes submodule since similar classifiers are there but this nor any of the classes there implement very 'naive' Bayes. But that's a small potato.

I could also make a new submodule called neighbours

koaning commented 4 years ago

https://github.com/koaning/scikit-lego/pull/283

koaning commented 4 years ago

and now for something completely different

@arose13 if there are any lessons you've learned while implementing this feature, feel free to add it here;

https://github.com/koaning/scikit-lego/issues/106

This library offers a poem with words of wisdom that will appear once you type;

from sklego import this

When you make a meaningful contribution to our library, you're also invited to add something to the poem.

arose13 commented 4 years ago

lol! That's very cute. I'm not much of a poet, so I'll to think very hard about this one ^_^

koaning commented 4 years ago

This is now merged.

koaning / scikit-lego

[FEATURE] Bayesian Kernel Density Classifier #279