Thanks again for a great library! We've run into an issue where the estimated PDFs are scaled incorrectly for the ISJ method under the following conditions:
The data have a narrow standard deviation
We provide a wide grid spacing relative to the observed data
This error does not occur with Silverman's bandwidth calculation.
Below I've pasted code for replicating the issue, along with the output. Note how the ISJ approach greatly overestimates the PDF when providing our own evenly-spaced grid points. This does not occur if we narrow the range of the grid points or expand the standard deviation of the random data.
I tried to figure out what might be happening in the source, but I was unable to track down the issue, but it might have something to do with the real_bw calculation giving rise to the L factor. Thus, I'm unable to provide a PR, just an example for replicating the error, which will hopefully be useful in figuring out what's happening.
Thanks!
import numpy as np
from KDEpy import FFTKDE
import matplotlib.pyplot as plt
# set up some params for the example
rmin = -2
rmax = 2
xvals = np.linspace(rmin, rmax, 512)
dat = np.random.normal(loc=0, scale=.05, size=1500)
# calculate with ISJ
isj = FFTKDE(kernel='epa', bw="ISJ").fit(np.array(dat))
# eval with xvals
yval_pdf_ISJ=isj.evaluate(xvals)
# eval with default values
gps, yval_pdf_ISJ_e = isj.evaluate()
# calc with silverman
silver = FFTKDE(kernel='epa', bw="silverman").fit(np.array(dat))
yval_pdf_silverman = silver.evaluate(xvals)
# plot the data
plt.hist(dat, density=True, bins='auto', alpha=0.5)
# plot the fits
plt.plot(xvals, yval_pdf_ISJ, alpha=0.8, label='ISJ')
plt.plot(gps, yval_pdf_ISJ_e, alpha=0.8, label='ISJe')
plt.plot(xvals, yval_pdf_silverman, alpha=0.8, label='silverman')
# set the range
plt.xlim(-0.5, .5)
# add legend
plt.legend()
Hi!
Thanks again for a great library! We've run into an issue where the estimated PDFs are scaled incorrectly for the ISJ method under the following conditions:
This error does not occur with Silverman's bandwidth calculation.
Below I've pasted code for replicating the issue, along with the output. Note how the ISJ approach greatly overestimates the PDF when providing our own evenly-spaced grid points. This does not occur if we narrow the range of the grid points or expand the standard deviation of the random data.
I tried to figure out what might be happening in the source, but I was unable to track down the issue, but it might have something to do with the
real_bw
calculation giving rise to theL
factor. Thus, I'm unable to provide a PR, just an example for replicating the error, which will hopefully be useful in figuring out what's happening.Thanks!