statsmodels / statsmodels

Statsmodels: statistical modeling and econometrics in Python
http://www.statsmodels.org/devel/
BSD 3-Clause "New" or "Revised" License
10.19k stars 2.97k forks source link

ENH: kernel density estimation with asymmetric kernels, on R+, unit interval #7346

Open josef-pkt opened 3 years ago

josef-pkt commented 3 years ago

part of #7338

beta, gamma, invgauss and recipinvgauss kernels can be obtained through sccipy's distributions, with appropriate parameterization. Birnbaum-Saunders (fatiguelife) should also be possible but I haven't tried yet.

I don't know if MultivariateKDE is setup for asymmetric kernels. In contrast to symmetric kernels, the kernel does not depend on a the distance t - x, but it's a lonlinear function K(t, x) where x becomes a shape parameter and bandwidth is a scale parameter.

easy:

maybe easy

extras

The last part needs kernel specific formulas, i.e. we need kernel classes with extras. Currently, I'm working with just simple functions that compute kernel-pdf or kernel-cdf

other targets

multivariate

performance, speedups, not clear to me yet

kernels

tails

I haven't seen any references yet. But the kernels should imply different tail behaviors, e.g. can we get heavy tails? I guess we might want to choose kernels depending on behavior around x=0 boundary and on the behavior in tails.

status Currently I have mainly the scipy distribution parameterization of the kernels, which gives pdf and cdf (and maybe rvs) They work if I choose the bandwidth by visual inspection.

I would like to park those functions and leave the rest for another year.

(list of references coming later)

josef-pkt commented 3 years ago

found an R package with BS, Gamma, Erlang and LN kernels while looking for Birnbaum-Saunders kernel articles https://cran.r-project.org/web/packages/DELTD/index.html

I haven't done a systematic search for functions in R yet.

josef-pkt commented 3 years ago

browsing current nonparametric kde code

MultivariateKDE does not assume symmetric, distance kernels K(x - xi), aitchison_aitken for categorical does not use simple distance The univariate kde and kernels defined in sandbox uses distance measure |x - xi| and so will not work for asymmetric kernels.

That means that it should be possible to add asymmetric kernels and additional data types to MultivariateKDE (eg. "u" for unit interval and "p" for R+)

josef-pkt commented 3 years ago

binning again

I ran my notebook with all kernels using histogram binning. With 50 bins, several kernels show spikes, gamma and beta look fine in the example. With 100 bins, all kde and kernel-cdf look good, weibull has some wiggles and might need larger bw than in my original example.

examples use nobs=1000, and the binning function, where rvs_ is the original random sample

def get_bins(rvs, bins=100):
    count, edges = np.histogram(rvs, bins=bins)
    center = edges[:-1] + np.diff(edges) / 2
    probs = count / count.sum()
    return center, probs

rvs, weights = get_bins(rvs_)
kde = kern.pdf_kernel_asym(x_plot, rvs, bw, "gamma2", weights=weights)
kce = kern.cdf_kernel_asym(x_plot, rvs, bw, "gamma2", weights=weights)

(this comment was supposed to be in the PR, but ok here)

josef-pkt commented 3 years ago

some kernels require density at zero is zero: f(0) = 0, Those kernels cannot estimate an f(0) > 0. I didn't keep track of which kernels require that (gamma, or gamma2 does not). I'm starting to add references as I see them again

kernels with f(0) = 0

log-normal : Igarashi 2016 with changes to kernel to allow f(0) > 0 (generalized) bs: mentioned in Igarashi 2016

Gaku Igarashi (2016): Weighted log-normal kernel density estimation, Communications in Statistics - Theory and Methods, DOI: 10.1080/03610926.2014.963623

josef-pkt commented 3 years ago

followup : R package evmix beta1 beta2 and gamma1, gamma2 kernel estimators, and several traditional boundary correction methods for symmetric kernels

Hu, Yang, and Carl Scarrott. 2018. “Evmix: An R Package for Extreme Value Mixture Modeling, Threshold Estimation and Boundary Corrected Kernel Density Estimation.” Journal of Statistical Software 84 (1): 1–27. https://doi.org/10.18637/jss.v084.i05.

kdensity also has gamma and beta, and a kernel based on gaussian copula by Jones and Henderson (I didn't read that article) https://cran.r-project.org/web/packages/kdensity/readme/README.html