mennthor / awkde

Adaptive Width KDE with Gaussian Kernels
MIT License
39 stars 21 forks source link

Does it works with 2 dimension Kernel Density? #5

Open mauriziopinna opened 3 years ago

mauriziopinna commented 3 years ago

Hi! Does this library also works for two dimensional data for Kernel Density Estimate? I've tried it with a dataset on which i'm working separately with the Sklearn KDE. The Awkde library leds to an incredible oversmoothing? Is it possible that I'm doing something wrong?

Here the comparison between the sample of 2000 points done with Sklearn KDE in first image and with Awkde (oversmooth) in second image. I want to underline that i've used the same global bandwidth.

image

image

image

mauriziopinna commented 3 years ago

UPDATE: If I set a global bandwidth that is the bandwidth used with scikit learn KDE divided by 1000 (i.e. bw/1000) I can optain a distribution that is comparable to the one of scikit learn. Why this?

(I want to specify that the dataset is composed by spatial data, longitude and latitude expressed in meters. So a bandwidth that does make sense is about 300 meters, as I used in scikit learn KDE, while 0.03 meter is quite wired)

Figure: density with awkde GaussianKDE, glob_bw = 0.03

image

mennthor commented 2 years ago

Hi, My best guess is because the normalized sample is used for the bandwidth calculation, but I'm not sure. I think I did not have the need to have bandwidths making actually sense as in your case, because I just tried many and validated against simulated distributions until they matched closely.

        self._std_X, self._mean, self._cov = standardize_nd_sample(
            X, cholesky=True, ret_stats=True, diag=self._diag_cov)

        # Get global bandwidth number
        self._glob_bw = self._get_glob_bw(self._glob_bw)

        # Build local bandwidth parameter if alpha is set
        if self._adaptive:
            self._kde_values = self._evaluate(self._std_X, adaptive=False)
            self._calc_local_bandwidth()

would this make sense with the scales you are experiencing in your dataset?

matfax commented 1 year ago

So the algorithm, though designed to handle multivariate data, has never been validated as such?

Or is this just a use case-specific issue where the localized bandwidth with scott/silverman estimation isn't accurate?

mennthor commented 1 year ago

I'd say that's true :D I used it for a very specific 2D case I had, but I never took the time to properly twst and verify everything. As said before, I just cross-validated the bandwidth and didn't bother what the actual value was or how it matched with other KDE libs.