The Julia package for nonparametric kernel density estimate and regression. This package currently includes univariate kernel density estimate, local constant regression (Nadaraya-Watson regression) and local linear regression. It can also compute the Bootstrap confidence band [4].
This package provides Gamma and Beta kernel to deal with bounded density estimation and regression. These two kernels are free of boundary bias for one side and two sides bounded data respectively, see [2, 3]. In particular, this package provide least square cross validation (LSCV) bandwidth selection functions for Gamma and Beta kernels.
Bandwidth selection is critical in kernel estimation. LSCV is always recommended. Likelihood cross validation is provided but should be avoided because of known drawbacks. For regression problem, the bandwidth of local constant regression is selected using LSCV while that for local linear regression is chosen by AIC [6].
To install and use this package in Julia,
Pkg.add("KernelEstimator")
using KernelEstimator
See usage under examples/
directory.
This package calculate densities via direct approach, i.e. adding kernel functions together. To define new kernel, you need to define a new function takes the same arguments as gaussiankernel
and output the kernel weights at given point. If no bandwidth selection function is provided, lscv with numeric integration will be used by default.
This package provides two major functions, kerneldensity
for kernel density estimation and npr
for nonparametric regression. For kernel density, you can simply use
xdata = randn(1000)
kerneldensity(xdata)
or specify some options
xeval = linspace(-3, 3, 100)
bw = bwlscv(xdata, gaussiankernel)
kerneldensity(xdata, xeval=xeval, lb=-Inf, ub=Inf, kernel=gaussiankernel,h = bw)
xeval
specifies the position you want to evaluate the density at. Default to be the same as xdata
. lb
and ub
means lower bound and upper bound of the data. If you specify either of them to be some finite value, user choice of kernel function will be suppressed and gammakernel
will be used with a warning. If you specify both, betakernel
is used with a warning if user's choice is different.
For kernel regression, you can use
x = rand(Beta(4,2), 500) * 10
y=2 .* x.^2 + x .* rand(Normal(0, 5), 500)
npr(x, y)
or change the default by
npr(x, y, xeval=x, reg=locallinear, kernel=betakernel,lb=0.0, ub=10.0)
reg
specifies the order of local polynomial regression. You can choose localconstant
, local constant regression or locallinear
, local linear regression. locallinear
has better theoretical properties in prediction y
and is used by default but is more computing intensive.
There is also a function computing the bootstrap confidence interval for regression.
bootstrapCB(x, y; xeval=x, B=500, reg=locallinear, lb=-Inf, ub=Inf, kernel=gaussiankernel)
B
specifies the number of bootstrap sampling.
The following functions are also provided:
localconstant(xdata::RealVector, ydata::RealVector; xeval::RealVector=xdata, kernel::Function=gaussiankernel, h::Real=bwlocalconstant(xdata,ydata,kernel))
, local constant regression (or Nadaraya-Watson)
locallinear(xdata::RealVector, ydata::RealVector; xeval::RealVector=xdata, kernel::Function=gaussiankernel, h::Real=bwlocalconstant(xdata,ydata,kernel))
, local linear regression
and bandwidth selection functions:
bwnormal(xdata::Vector)
, bandwidth selection for density estimate by referencing to normal distribution
bwlscv(xdata::RealVector, kernel::Function)
, bandwidth selection for density estimate by least square cross validation
bwlcv(xdata::RealVector, kernel::Function)
, bandwidth selection for density estimate by likelihood cross validation
bwlocalconstant(xdata, ydata::Vector, kernel)
, bandwidth selection for local constant regression using LSCV
bwlocallinear(xdata, ydata::Vector, kernel)
, bandwidth selection for local linear regression using corrected AIC. See reference [6]
The meaning of arguments:
xeval
is the point(s) where the density or fitted value is calculated
xdata
is the input X
ydata
is the response vector y; should have same length with xdata
reg
is the regression function, localconstant
or locallinear
kernel
defaults to be gaussiankernel
; should be a function
h
is the bandwidth, should be a real scalar; If negative, the default bandwidth selection method will be used to find the bandwidth and replace it
lb
, ub
are the boundary for x. Must provide if use Beta or Gamma kernel
Kernel density estimate
using Distributions
x=rand(Normal(), 500)
xeval=linspace(minimum(x), maximum(x), 100)
den=kerneldensity(x, xeval=xeval)
Local regression
y=2 .* x.^2 + rand(Normal(), 500)
yfit0=localconstant(x, y, xeval=xeval)
yfit1=locallinear(x, y, xeval=xeval)
yfit0=npr(x, y, xeval=xeval, reg=localconstant)
yfit1=npr(x, y, xeval=xeval, reg=locallinear)
Confidence Band
cb=bootstrapCB(x, y, xeval=xeval)
using Gadfly
plot(layer(x=x, y=y, Geom.point), layer(x=xeval, y=yfit1, Geom.line, Theme(default_color=color("black"))),
layer(x = xeval, y = cb[1,:], Geom.line, Theme(default_color=color("red"))),
layer(x=xeval, y=cb[2,:], Geom.line, Theme(default_color=color("red"))))
[1] Lecture notes from Dr. Song Xi Chen
[2] Chen, Song Xi. "Beta kernel estimators for density functions." Computational Statistics & Data Analysis 31, no. 2 (1999): 131-145.
[3] Chen, Song Xi. "Probability density function estimation using gamma kernels." Annals of the Institute of Statistical Mathematics 52, no. 3 (2000): 471-480.
[4] W. Hardle and J. S. Marron (1991). Bootstrap Simultaneous Error Bars for Nonparametric Regression. The Annals of Statistics. Vol. 19, No. 2 (Jun., 1991), pp. 778-796
[5] W.Hardle and E. Mammen (1993). Comparing Nonparametric Versus Parametric Regression Fits. The Annals of Statistics. Vol. 21, No. 4 (Dec., 1993), pp. 1926-1947
[6] Clifford M. Hurvich, Jeffrey S. Simonoff and Chih-Ling Tsai (1998). Smoothing Parameter Selection in Nonparametric Regression Using an Improved Akaike Information Criterion. Journal of the Royal Statistical Society. Series B (Statistical Methodology), Vol. 60, No. 2 (1998), pp. 271-293