sktime / skpro

A unified framework for tabular probabilistic regression and probability distributions in python
https://skpro.readthedocs.io/en/latest
BSD 3-Clause "New" or "Revised" License
231 stars 45 forks source link

Intervals/quantiles can be negative for models that can only make non-negative predictions #422

Open fsaforo1 opened 1 month ago

fsaforo1 commented 1 month ago

For models with log-link like Tweedie, Poisson, etc., that are constrained to only make non-negative predictions, one would expect the intervals to be non-negative as well. Below is a reproducible example.

#pip install tweedie

from tweedie import tweedie
import numpy as np
import pandas as pd

TWEEDIE_VARIANCE_POWER = 1.5
TWEEDIE_DISPERSION_PARAMETER = 20

def simulate_tweedie_glm_data():
    # Number of parameters for model
    P = 10

    N = 5000

    SEED = 2022
    rng = np.random.RandomState(SEED)
    X = rng.rand(N, P - 1)
    X = np.hstack((np.ones((N, 1)), X))
    beta = np.concatenate(([370], rng.randint(-10, 200, P - 1))) / 100
    eta = np.dot(X, beta)
    mu = np.exp(eta)

    tweedie_variance_power = TWEEDIE_VARIANCE_POWER
    dispersion_parameter = TWEEDIE_DISPERSION_PARAMETER

    y = tweedie(mu=mu, p=tweedie_variance_power, phi=dispersion_parameter).rvs(N, random_state = SEED)
    X = pd.DataFrame(X)
    X.columns = ['intcpt' if i ==0 else 'feat_' + str(i) for i in range(P)]
    X.drop('intcpt', axis=1, inplace=True)

    return X, y

X, y = simulate_tweedie_glm_data()

(y<0).mean()
# 0.0

from sklearn.ensemble import RandomForestRegressor
from sklearn.linear_model import TweedieRegressor
from skpro.regression.residual import ResidualDouble

reg_mean = TweedieRegressor(power=TWEEDIE_VARIANCE_POWER)
reg_resid = RandomForestRegressor()
reg_proba = ResidualDouble(reg_mean, reg_resid)
reg_proba.fit(X, y)

y_pred_quantiles = reg_proba.predict_quantiles(X, alpha=[0.05, 0.5, 0.95])

(y_pred_quantiles < 0).mean(axis=0)

#array([0.1868, 0.    , 0.    ])
fkiraly commented 1 month ago

@fsaforo1, this is expected because the model you are constructing actually has a normal predictive distribution, with reg_mean predicting the variance, and reg_resid the residual, so it wil not have non-negative support.

If you want the predictive distribution to be Tweedie, and hence non-negative, the two closest things are:

  1. using the GLMRegressor, this has links which are sub-cases of Tweedie (but not arbitrary Tweedie)

  2. using a Tweedie distribution as the base distribution, for distr_type

A general Tweedie distribution is not yet implemented, but if you'd want to make a PR, that would be appreciated! I can then check that it works with the ResidualDouble strategy. This PR is adding another distribution, you could take it as a template: https://github.com/sktime/skpro/pull/421

Also, if you can describe the model that you would like to fit, I can also see whether ResidualDouble needs to be extended, e.g., with parameter transformations.

fkiraly commented 1 month ago

I'm also a bit confused by the sklearn Tweedie regressor.

How does one retrieve the two parameters from the fitted Tweedie distribution? Namely, mean and scale.

fkiraly commented 1 month ago

I opened an issue for interfacing the Tweedie regressor here: https://github.com/sktime/skpro/issues/423

I looked at it briefly, and do not know how to get the variance from it. Do you have an idea, @fsaforo1? If so, a reply on #423 would be appreciated!