scipy / scipy

SciPy library main repository
https://scipy.org
BSD 3-Clause "New" or "Revised" License
13.05k stars 5.18k forks source link

ENH: Custom PDF analytical integral though CDF rather than PPF #14702

Closed acampove closed 3 years ago

acampove commented 3 years ago

Is your feature request related to a problem? Please describe.

When implementing a custom PDF in order to draw random samples like in:

class pos_tanh(st.rv_continuous):
    def __init__(self, m=0, s=1, *args, **kwargs):
        self.m = m 
        self.s = s 
        super().__init__(*args, **kwargs)

        upper = self.__integral(self.b)
        lower = self.__integral(self.a)
        normalization = upper - lower

        if normalization <= 0:
            log.error('Found normalization {} = {:.3e} - {:.3e}, in range ({:.3e},{:.3e})'.format(normalization, upper, lower, self.a, self.b))
            raise
        else:
            self.normalization = normalization 
    #--------------------------
    def __integral(self, x): 
        u = self.s * (x - self.m)

        return x + numpy.log(numpy.cosh(u)) / self.s
    #--------------------------
    def _cdf(self, x): 
        area = self.__integral(x) - self.__integral(self.a)

        cdf = area / self.normalization 

        return cdf 

I can calculate myself the integral and implement _cdf. However the generation of samples is extremely slow. Apparently I need _ppf also implemented, which I cannot do easily given that the integral is not invertible:

        u = self.s * (x - self.m)

        return x + numpy.log(numpy.cosh(u)) / self.s

Describe the solution you'd like.

The integral should be all that is needed in the form of a CDF. Why do I have to invert the CDF myself? At the end the only thing you need is the normalization.

Describe alternatives you've considered.

The alternative seems to either invert in an approximate way, which brings numerical precision problems or do the integral numerically, which will be terribly inneficient and slow.

Additional context (e.g. screenshots, GIFs)

No response

tirthasheshpatel commented 3 years ago

Hi @acampove,

If you are using SciPy >= 1.7.0, you can use the new fast numerical inversion method NumericalInverseHermite to draw samples from your custom distribution.

It approximates the PPF using cubic Hermite splines and provides a very fast sampling method. All you need to do is implement the CDF and optionally the PDF method.

import numpy as np
from scipy.stats import NumericalInverseHermite, rv_continuous
from scipy import stats

class MyDistribution(rv_continuous):
    def _pdf(self, x, shape_parameters):
        # implement the pdf here
        return pdf
    def _cdf(self, x, shape_parameters):
        # implement the cdf here
        return cdf

distribution = MyDistribution(name="distribution")
dist = distribution(shape_parameters)  # freeze the distribution

# calculate the PPF
fni = NumericalInverseHermite(dist)

# sample from the distribution
rvs = fni.rvs(100000)

# we can also calculate the approximated PPF
ppf = fni.ppf(np.linspace(0.01, 0.99, num=100000))

# you can also sample quasi monte carlo samples from the given distribution
qrvs = fni.qrvs(100_000, qmc_engine=stats.qmc.Halton(2))
andyfaff commented 3 years ago

I'm looking forward to using NumericalInverseHermite.

To elaborate on your original post:

To generate random numbers from an rv_continuous you should at least be defining _pdf and _cdf, you didn't define _pdf above.

In order to generate random variables from any rv_continuous distribution a random number is generated in the interval [0, 1). This is then given to rv_continuous.ppf (inverse of _cdf) to obtain the random variate. You don't need to override the _ppf method yourself. However, if you don't override it then the inverse CDF has to be calculated in rv_continuous using a numerical root finding algorithm, which can take a long time. This is just the way of the world, there's no way around it. Or you can use @tirthasheshpatel approach.

mdhaber commented 3 years ago

NumericalInverseHermite currently needs a ppf method and isf method. Each is evaluated at exactly one point, just to find practical boundaries of the support. To get around this, you may be interested in: https://github.com/scipy/scipy/issues/14568#issuecomment-906602863

The alternative seems to either invert in an approximate way, which brings numerical precision problems or do the integral numerically, which will be terribly inneficient and slow.

Yes, those are your main options if you don't define a _ppf or rvs method yourself. The approximate approach is what NumericslInverseHermite does (but this can often be accurate almost to machine precision), and the slow, numerical inversion approach is what rv_continuous does by default. If you just need random variates, you can also look at rvs_ratio_uniforms.

@tirthasheshpatel has been working on a series of PRs that will add several more options here in 1.8.0.