Add support for long tails (and other distributions) to prior_from_idata

perrette commented 5 months ago

Hi, I wrote a small function that does a job similar to prior_from_idata, but can also handle log-normal distributions. @ricardoV94 suggested it could find its place in pymc-experimental, or at least contribute to the discussion.

import numpy as np
from scipy.stats import norm, lognorm
import pymc as pm
import arviz

def define_params_from_trace(trace, names, label, log_params=[]):
    """Define prior parameters from existing trace

    (must be called within a pymc.Model context)

    Parameters
    ----------
    trace: pymc.InferenceData from a previous sampling
    names: parameter names to be redefined in a correlated manner
    label: used to add a "{label}_params" dimension to the model, and name a new "{label}_iid" for an i.i.d Normal distribution used to generate the prior
    log_params: optional, list of parameter names (subset of `names`)
        this parameters are assumed to have a long tail and will be normalized with a log-normal assumption

    Returns
    -------
    list of named tensors defined in the function

    Aim
    ---
    Conserve both the marginal distribution and the covariance across parameters

    Method
    ------
    The specified parameters are extracted from the trace and are fitted 
    with a scipy.stats.norm or scipy.stats.lognorm distribution (as specified by `log_params`).
    They are then normalized to that their marginal posterior distribution follows ~N(0,1).
    Their joint posterior is approximated as MvNormal, and the parameters are subsequently 
    transformed back to their original mean and standard deviation 
    (and the exponent is taken in case of a log-normal parameter)
    """

    # normalize the parameters so that their marginal distribution follows N(0, 1)
    params = arviz.extract(trace.posterior[names])
    fits = []
    for name in names:
        # find distribution parameters and normalize the variable
        x = params[name].values
        if name in log_params:
            sigma, loc, exp_mu = lognorm.fit(x)
            mu = np.log(exp_mu)
            x[:] = (np.log(x - loc) - mu) / sigma
            fits.append((mu, sigma, loc))
        else:
            fitted = mu, sigma = norm.fit(x)
            x[:] = (x - mu) / sigma
            fits.append((mu, sigma))

    # add a model dimension 
    model = pm.modelcontext(None)
    dim = f'{label}_params'
    if dim not in model.coords:
        model.add_coord(dim, names)

    # Define a MvNormal
    # mu = params.mean(axis=0) # zero mean
    cov = np.cov([params[name] for name in names], rowvar=True)
    chol = np.linalg.cholesky(cov)
    # MvNormal may lead to convergence issues
    # mv = pm.MvNormal(label+'_mv', mu=np.zeros(len(names)), chol=chol, dims=dim)
    # Better to sample from an i.i.d R.V.
    coparams = pm.Normal(label+'_iid', dims=dim)
    # ... and introduce correlations by multiplication with the cholesky factor
    mv = chol @ coparams

    # Transform back to parameters with proper mu, scale and possibly log-normal
    named_tensors = []
    for i,(name, fitted) in enumerate(zip(names, fits)):
        if name in log_params:
            mu, sigma, loc = fitted
            tensor = pm.math.exp(mv[i] * sigma + mu) + loc
        else:
            mu, sigma = fitted
            tensor = mv[i] * sigma + mu
        named_tensors.append(pm.Deterministic(name, tensor))

    return named_tensors

As I discuss here (and copula references therein), the function could be extended to support any distribution, by first transforming the trace posterior to a uniform distribution, and then from uniform to normal. The approach could be applied whenever pymc implements the inverse CDF method. However, I don't know how that could (negatively) affect convergence. In particular, if Interpolated would implement the inverse CDF method (it does not currently, but that is probably not a big deal to implement ?), the approach could handle any distribution empirically, while taking into account the correlations across parameters (see copula contributions (here and there)).

perrette commented 3 months ago

And here the more general case:


import numpy as np
import arviz
from scipy.interpolate import interp1d, InterpolatedUnivariateSpline
from scipy.stats import norm, lognorm
import pytensor.tensor as pt
import pymc as pm
from pymc.distributions.continuous import Interpolated, SplineWrapper

def compute_empirical_cdf(data):
    """Compute the empirical CDF of the given data
    """
    sorted_data = np.sort(data)
    cdf = np.arange(.5, len(sorted_data)) / len(sorted_data)
    # TODO: adds the end points 0 and 1
    return sorted_data, cdf

def define_empirical_params_from_trace(trace, names, label):
    """Define prior parameters from existing trace

    (must be called from pymc.Model context)

    Parameters
    ----------
    trace: pymc.InferenceData from a previous sampling
    names: parameter names to be redefined in a correlated manner
    label: used to add a "{label}_params" dimension to the model, and name a new "{label}_mv" MvNormal distribution

    Returns
    -------
    list of named tensors defined in the function

    Aim
    ---
    Conserve both the marginal distribution and the covariance across parameters

    Method
    ------
    The specified parameters are extracted from the trace and resampled using normalization 
    of the marginal distribution to N(0,1) and a MvNormal distribution for the joint posterior.

    The parameters are subsequently transformed back:
    - Normal's cdf to transform from N(0,1) to U(0,1)
    - the inverse CDF of the interpolated distribution is applied from U(0, 1) to the original (empirical) distribution
    """

    # normalize the parameters so that their marginal distribution follows N(0, 1)
    params = arviz.extract(trace.posterior[names])

    fits = []

    for name in names:
        # find distribution parameters and normalize the variable
        x = params[name].values

        # First transform the marginal distribution to a standard normal
        sorted_data, sorted_cdf = compute_empirical_cdf(x)

        # Interpolate the inverse CDF
        cdf = interp1d(sorted_data, sorted_cdf, bounds_error=False, fill_value=(sorted_data[0], sorted_data[-1]))

        # Transform the data to a standard normal 
        x_uniform = cdf(x) # U(0,1) from the empirical distribution
        x_normal = norm.ppf(x_uniform) # N(0,1)

        # Replace the original value (in-place)
        x[:] = x_normal

        # Create the inverse CDF function with tensor values (thanks to SplineWrapper)
        icdf = SplineWrapper(InterpolatedUnivariateSpline(sorted_cdf, sorted_data, k=1, ext="const"))

        fits.append(icdf)

    # add a model dimension 
    model = pm.modelcontext(None)
    dim = f'{label}_coparams'
    if dim not in model.coords:
        model.add_coord(dim, names)

    # Define a MvNormal
    cov = np.cov([params[name] for name in names], rowvar=True)
    chol = np.linalg.cholesky(cov)
    coparams = pm.Normal(label+'_iid', dims=dim)
    mv = chol@coparams

    # Transform back to the empirical distribution
    named_tensors = []
    for i,(name, fitted) in enumerate(zip(names, fits)):
        icdf = fitted
        uniform_tensor = pm.math.exp(pm.logcdf(pm.Normal.dist(), mv[i]))  # Normal's cdf (IS THERE ANYTHING BETTER THAN THIS?)
        tensor = icdf(uniform_tensor)
        named_tensors.append(pm.Deterministic(name, tensor))

    return named_tensors

This seems to work and I believe this can be very useful to make sequential Bayesian constraining a reality in pymc, for any posterior distribution and accounting for cross-correlations. However improvements for accurary (add 0 and 1 end points to the empirical quantiles) and possible optimizations may be needed (exp(logcdf(...) is not very elegant).

ferrine commented 3 months ago

The approach looks very interesting. It resonated with the idea to have an additional postprocessing step for the trace before the MV normal distribution is approximated.

after the trace has been transformed and flattened

https://github.com/pymc-devs/pymc-experimental/blob/946d855211291ecb60ab252d4a263b32b8687888/pymc_experimental/utils/prior.py#L104

indeed, normalization of marginals makes a lot of sense since we are going to calculate Cholesky matrix for the covariance of MV Normal

The transform of the posterior as described in this issue can happen here,

https://github.com/pymc-devs/pymc-experimental/blob/946d855211291ecb60ab252d4a263b32b8687888/pymc_experimental/utils/prior.py#L197-L198

right after the flattening step. Some additional information may apper to make backward transform. as a pseudocode I imagine it can look like this:

def prior_from_idata(
    idata: arviz.InferenceData,
    name="trace_prior_",
    *,
    var_names: Sequence[str] = (),
    **kwargs: Union[ParamCfg, Transform, str, Tuple]
) -> Dict[str, pt.TensorVariable]:
    param_cfg = _parse_args(var_names=var_names, **kwargs)
    if not param_cfg:
        return {}
    flat_info = _flatten(idata, **param_cfg)
    normalized_flat_info = ???
    return _mvn_prior_from_normalized_flat_info(name, normalized_flat_info)

However, it is in sort of conflict with what Transform is doing there.

So far the transform is simple there, but some learnable one might be better suited

https://github.com/pymc-devs/pymc-experimental/blob/946d855211291ecb60ab252d4a263b32b8687888/pymc_experimental/utils/prior.py#L93-L95

The code here is very simple and assumes you are passing transform bijectors from pymc. https://github.com/pymc-devs/pymc-experimental/blob/946d855211291ecb60ab252d4a263b32b8687888/pymc_experimental/utils/prior.py#L22C1-L22C46

If a stateful transform is an option, it could look like this:

import abc

class StatefulTransform(pymc.logprob.transforms.Transform):
    @abc.abstractmethod
    def fit(self, data: np.ndarray):
        ...

Which could initialize some constants to be used in forward and backward

if cfg["transform"] is not None:
    # some transforms need original shape
    if hasattr(cfg["transform"], "fit"):
        cfg["transform"].fit(data)
    data = cfg["transform"].forward(data).eval()

The same transform is reused in backward pass, so you can easily assume the correct state is used to create the prior from posterior. Minor changes are needed to the existing codebase, maybe new transforms that have this feature will be a nice addition.

perrette commented 3 months ago

Thanks for paying attention and for laying out where what could go. Unfortunately at the moment I don't have the resources (time) to invest in proper implementation (I deleted by mistake my message of yesterday saying that).

One more thing, I now performed a real-world experiment and it took such a long time to run that it makes it useless for me (it took much longer than doing all in one go, despite having to do less calculations in total in a multi-step approach in the application I'm working on). I'll run another experiment with more resources for the sake of at least validating the results, but the performance issue is crippling (timeout at 11 hours, with 5h left to go, versus 5h30 in the original one-step appraoch ; that's surprising because when I used the log approximation posted some time ago it took less time, about 3h). I'm not sure whether I'll dedicate time to investigate what's going on, but I wanted to stress the possible performance issues. Here what I envisage is to carefully select which parameter needs which approximation. E.g. use normal distribution by default, parametric distributions for things that look like it (for now I only implemented log-normal as originally posted) because of its proximity with the normal distribution), and empirical distributions as above when there is no good match.

perrette commented 3 months ago

I could check in my real-world use-case that the approach is working, which is the satisfying part. Unfortunately further checks on better machines still yield a longer sampling time for the sequential sampling than sampling all together. There had been a speed-up (but inaccurate results) when using normal and log-normal approximation, so that means the following lines are responsible for the slow sampling (rewrite of the above code):

normal_tensor = mv[i]
uniform_tensor = pm.math.exp(pm.logcdf(pm.Normal.dist(), normal_tensor))
tensor = SplineWrapper(InterpolatedUnivariateSpline(sorted_cdf, sorted_data, k=1, ext="const")))(uniform_tensor)

Perhaps the underlying gradient calculation is not great?

I wonder if a Kernel Density Estimate (which is smoother) could yield faster results. That would be easier to calculate the inverse cdf from, wouldn't it? @ricardoV94 @ferrine

perrette commented 3 months ago

The screenshot is from the KDE article on Wikipedia:

One way forward would be to use statsmodels (or another tool, scipy, I don't know) to fit the posterior from the trace, with normal Kernel, and then retrieve the individual kernels, and build a custom distribution in pymc which is the sum of a set of Normal distributions. The missing link here is the inverse CDF. It's straighforward for a Normal distribution, but is it for a sum of Normal? The whole point of the approach would be to have something more efficient than the code above with SplineWrapper.

ferrine commented 3 months ago

I would also look into box cox transform and related ones

ricardoV94 commented 3 months ago

Can the Spline be tweaked to have less knots? We can probably just write it directly in PyTensor without a wrapper Op, which should be faster. Same for the kde

ferrine commented 2 months ago

There is already one implemented https://github.com/pymc-devs/pymc-experimental/blob/main/pymc_experimental/utils/spline.py

pymc-devs / pymc-experimental

Add support for long tails (and other distributions) to prior_from_idata #330