markrogoyski / math-php

Powerful modern math library for PHP: Features descriptive statistics and regressions; Continuous and discrete probability distributions; Linear algebra with matrices and vectors, Numerical analysis; special mathematical functions; Algebra
MIT License
2.35k stars 241 forks source link

Kernel Density Estimation: Improved Sheather-Jones algorithm #457

Open orlandothoeny opened 2 years ago

orlandothoeny commented 2 years ago

The KernelDensityEstimation class currently includes the normal distribution approximation bandwidth estimator (see KernelDensityEstimation::getDefaultBandwith()) when no bandwidth is passed to the constructor.

It would be useful to have the possibility to choose the Improved Sheather-Jones algorithm as the bandwidth function. Especially when working with non-normal-distributed datasets.

Some resources about Sheather-Jones :

markrogoyski commented 2 years ago

Hi @orlandothoeny,

Thanks for your interest in MathPHP.

Thanks for the suggestion for a feature improvement for a new kernel function. We'll look into it and see if this is something we can add.

In the meantime, you are able add your own custom kernel function by supplying a PHP callable to the setKernelFunction method of a KernelDensityEstimation object.

Mark

Beakerboy commented 2 years ago

@markrogoyski I believe this request is referring to the bandwidth, not the kernel. Currently the object accepts a float or null. If null, a default bandwidth is calculated and used.

@orlandothoeny if you are able to implement the calculation, we could easily add it add a static method, such that a user could call something like:

$bandwidth = KernelDensityEstimation::ISJBandwidth($data);
$kde->setBandwidth($bandwidth);

this would be the most backward-compatible strategy.

orlandothoeny commented 2 years ago

@Beakerboy Yes, that's correct. This would be an additional method for calculating the bandwidth.

That would be one option regarding backward compatibility, another option would be to allow callables as an additional type for the $bandwith parameter. But the option you described is probably simpler.

I'd have to brush up on my math a bit to implement it myself, a few years have passed since I last used that stuff :) Not sure if I have the time to do that though.

I understand that it's an open-source project, so no pressure on you guys. It's your free time. But if someone wants to implement it, I'm grateful.

markrogoyski commented 2 years ago

@orlandothoeny,

What could help speed up an implementation is providing test data to write unit tests against.

For example:

Having data to write unit tests allows us to be confident we are building the write calculation.

Another option is to research and provide instructions on how to produce test data using a trustworthy tool like R or NumPy for instance.