statsmodels / statsmodels

Statsmodels: statistical modeling and econometrics in Python
http://www.statsmodels.org/devel/
BSD 3-Clause "New" or "Revised" License
9.95k stars 2.87k forks source link

ENH: conformal, distribution-free, nonparametric prediction intervals #9005

Open josef-pkt opened 11 months ago

josef-pkt commented 11 months ago

This looks like a recent hot topic mainly for machine learning.

Basic idea: use calibration data, separate from estimation/training data, to estimate quantiles and prediction sets or intervals for new observations.

simplest case: In regression with additive residual, we can add/subtract quantile of calibration residuals to the get conformal prediction interval (several alternatives of splitting data, ex-ante split, jacknife/looo, k-fold crossvalidation, ...)

https://mapie.readthedocs.io/en/latest/theoretical_description_regression.html mapie seems to be the main python package for this

Lei, Jing, Max G’Sell, Alessandro Rinaldo, Ryan J. Tibshirani, and Larry Wasserman. “Distribution-Free Predictive Inference for Regression.” Journal of the American Statistical Association 113, no. 523 (July 3, 2018): 1094–1111. https://doi.org/10.1080/01621459.2017.1307116.

large number of references (to much to figure out for now what would be useful for statsmodels)

limitation: It looks like the coverage "guarantee" is only for average, marginal coverage (over sample of exog, similar to ATE), but there is no distribution-free "guarantee" for the coverage conditional on a specific exog value.

Foygel Barber, Rina, Emmanuel J Candès, Aaditya Ramdas, and Ryan J Tibshirani. “The Limits of Distribution-Free Conditional Predictive Inference.” Information and Inference: A Journal of the IMA 10, no. 2 (June 15, 2021): 455–82. https://doi.org/10.1093/imaiai/iaaa017.

Related: I saw some articles (abstracts) that include heteroscedasticity (var(exog)). More literature looks at time series, i.e. without independence.

Count data: I did not find any references for conformal prediction interval with count data (based on google scholar search) This might be related to the above "limits" of conformal prediction intervals that we only have coverage with respect to marginal distribution. I guess that the main problem to get prediction interval conditional on x would be to find a ranking statistic (prediction "score" function) that is distribution-free, e.g. does not depend on higher moments that depend on exog through e.g. mean.

I guess we can still get conditional (on x) prediction intervals under stronger assumptions, e.g. Assume regression model with additive i.i.d. residuals. That would still be non-parametric for the residual distribution and better than the current prediction interval based on normal or t distributed residuals.

Aside: There is a literature for adaptive local conformal prediction (or something like that) but I did not even skim those.

Related: diagnostics for this ? maybe homogeneity, i.i.d. assumption (e.g. subsample homogeneity like Hosmer-Lemeshow type of test) Does calibration set exog reflect population or sample exog? quantile regression on residual (to see whether quantiles do not depend on exog) Compare conformal prediction interval with parametric interval to see whether the latter is "good enough".

clarification: in (general) regression setting (y, x) is assumed to be i.i.d. and coverage statement is over probability space of (y, x) and not for conditional (y | x)

so far we want the much stronger conditional prediction interval for y | x

josef-pkt commented 11 months ago

This reference might be useful: mentions PIT (probability integral transform) in the abstract

Chernozhukov, Victor, Kaspar Wüthrich, and Yinchu Zhu. “Distributional Conformal Prediction.” Proceedings of the National Academy of Sciences 118, no. 48 (November 30, 2021): e2107794118. https://doi.org/10.1073/pnas.2107794118.

Aside: for basic control charts we assume constant parameters (mean, ...). If this is the case, then i.i.d. assumption works and we can use standard conformal prediction interval for control limits. (That's similar to in-sample nonparametric quantiles of residuals, but in this case on calibration sample).

Aside 2: conformal prediction for subsets of observations (no references, just analogy) For count diagnostic and results interpretation, I split the sample by a binary variable (e.g. gender) to check whether the predictive distribution varies by the levels of the categorical variable. (maybe this does not really help for conformal prediction intervals with count data, except possibly as diagnostic. If the (categorical) variable has an effect on the mean, then it will also affect the conditional prediction interval limits.)

josef-pkt commented 11 months ago

another possibility:

jacknife+ in MAPIE: looo loop residuals we can get them from the outlier-influence classes. For linear models, we can get them directly, For nonlinear models we have one-step approximation to d_params as faster version (although, as above, our discrete nonlinear models don't have iid error term, residuals)

Barber, Rina Foygel, Emmanuel J. Candès, Aaditya Ramdas, and Ryan J. Tibshirani. “Predictive Inference with the Jackknife+.” The Annals of Statistics 49, no. 1 (February 2021): 486–507. https://doi.org/10.1214/20-AOS1965.

josef-pkt commented 11 months ago

link to treatment effect: individual (conditional) treatment effect ITE versus average (marginal) effect ATE, ATT

Lei, Lihua, and Emmanuel J. Candès. “Conformal Inference of Counterfactuals and Individual Treatment Effects.” Journal of the Royal Statistical Society Series B: Statistical Methodology 83, no. 5 (November 1, 2021): 911–38. https://doi.org/10.1111/rssb.12445.

Yin, Mingzhang, Claudia Shi, Yixin Wang, and David M. Blei. “Conformal Sensitivity Analysis for Individual Treatment Effects.” Journal of the American Statistical Association 0, no. 0 (2022): 1–14. https://doi.org/10.1080/01621459.2022.2102503.

josef-pkt commented 11 months ago

calibrated conditional prediction intervals:

Lei, Jing, Max G’Sell, Alessandro Rinaldo, Ryan J. Tibshirani, and Larry Wasserman. “Distribution-Free Predictive Inference for Regression.” Journal of the American Statistical Association 113, no. 523 (July 3, 2018): 1094–1111. (reference in initial comment above)

section 5.2 locally weighted conformal inference looks at heteroscedasticity, scale is function of x uses scaled, (pearson) residuals as score function with scale function estimated by MAD. This provides prediction intervals that vary with x, but still have the average (over sample and new y, x probability space)

This means we can get better conditional prediction intervals (y | x) but still have correct average coverage. (*1) We can calibrate a prediction interval function for correct average coverage, even if we don't have the "true" score function and "true" shape of prediction intervals for conditional prediction (y|x). But we can get the correct intervals if the underlying model specification and distributional assumption is correct.

e.g. GLM, poisson with large(r) mean use scaled residuals (pearson residuals divided by scale, i.e. including excess dispersion, or empirical variance function) in the calibration score function. This would correct for variance as function of mean, but not for higher (asymmetric?) moments (like skew).

This would be like making prediction intervals robust to model misspecification, and still have good properties if model is correctly specified.

related: asymmetry, skew Barber, Rina Foygel, Emmanuel J. Candès, Aaditya Ramdas, and Ryan J. Tibshirani. “Predictive Inference with the Jackknife+.” The Annals of Statistics 49, no. 1 (February 2021): 486–507. (reference in 3rd comment)

Appendix A: asymmetric prediction intervals with jackknife+ and cv+ uses separate lower and upper limits/quantiles for prediction interval instead of those based on absolute values of residuals.

transformation of endog also Barber et al 2021 section 7.2 application to real data uses log(1 + y) transformation because both datasets are highly skewed. using endog nonlinear transformation and endpoint transformation for intervals looses E(y | x) as prediction target, but we might still get good quantiles (median and interval limits) in the prediction.

update (*1) We establish ... 1) Asymptotic conditional validity under consistent estimation of the conditional CDF 2) Unconditional validity under model misspecification: • Finite-sample validity with iid (or exchangeable) data • Asymptotic validity with time series data

in Chernozhukov et al 2021 "Distributional conformal prediction"

josef-pkt commented 11 months ago

getting closer to useful for conditional prediction intervals (based on partial skimming)

quantiles and predictive distribution Fhat(y | x)

Romano et al conformalized QR needs predicted quantile function (this could be implied by predictive distribution instead of quantile regression. adjusts estimated quantile function to have average coverage in calibration sample reference found through https://mindfulmodeler.substack.com/p/week-3-conformal-prediction-for-regression#%C2%A7conformalized-quantile-regression

Chernozhukov et al use PIT which sounds the same

Candes et al have similar applied to conditional survival distribution

Chernozhukov, Victor, Kaspar Wüthrich, and Yinchu Zhu. “Distributional Conformal Prediction.” Proceedings of the National Academy of Sciences 118, no. 48 (November 30, 2021): e2107794118. https://doi.org/10.1073/pnas.2107794118.

Romano, Yaniv, Evan Patterson, and Emmanuel Candes. “Conformalized Quantile Regression.” In Advances in Neural Information Processing Systems, Vol. 32. Curran Associates, Inc., 2019. https://proceedings.neurips.cc/paper_files/paper/2019/hash/5103c3584b063c431bd1268e9b5e76fb-Abstract.html.

Candès, Emmanuel, Lihua Lei, and Zhimei Ren. “Conformalized Survival Analysis.” Journal of the Royal Statistical Society Series B: Statistical Methodology 85, no. 1 (February 1, 2023): 24–45. https://doi.org/10.1093/jrsssb/qkac004.

related: we have several issues and some code already using PIT for diagnostics (using estimation sample), e.g.

7153 quantile residuals

7873 add which="cdf" to predict

note: predicted cdf, quantiles ignore estimation uncertainty (confint for cdf, ppf not or only partially available) calibration with conformal prediction would improve coverage of prediction intervals to take account of modelling uncertainty and distributional misspecification.

6979 tolerance intervals

josef-pkt commented 11 months ago

related by topic but separate literature (AFAIR)

calibration of prediction to new dataset brief search of issues only finds it for the binary endog case, e.g. #6430 (recalibrate predicted conditional probability of an endog event, AFAIR)

josef-pkt commented 11 months ago

rough plan (AFAIU so far)

What's the overlap with other packages like mapie? What can statsmodels do that they cannot? e.g. integrated jackknife+ without explicit loop, PIT

update possible API results method get_conformal_predictor takes calibration dataset (unless method is jackknife) and options for method (e.g. mean, quantiles, ....) and sub-options within main conformalizing method. returns an instance of a class that hold the calibration values (interval limit corrections) and delegates prediction (and get_distribution) tasks to the results and model instances.

First step only for which="mean", but we could add conformalized prediction intervals also to other which available in predict and get_prediction.
(Maybe which, when available, needs to be a get_confomalized_predictor argument, so it is fixed in the calibrated predict class.)

josef-pkt commented 11 months ago

detail: tail probabilities

Chernozhukov et al 2021 "Distributional conformal prediction" have "optimal DCP" prediction intervals, which are essentially minlike (for skewed distributions)

Their calibration score function uses absolute value of pit values F(x) which are uniform in [0,1] if F(x) is consistently estimated ([0, 0.5] for absolute values. score function is |x - 0.5| in base DCP and with shifted "center" in optimal DCP

They do not mention or look at asymmetric case directly. i.e. separate score function for lower and upper limits of prediction interval.

My guess: For equal tail (alpha/2) prediction intervals it would be better to use two separate, "asymmetric" intervals instead of absolute value score function. Under correct specification we have a consistent estimate of F(y|x) and rank/pit values are uniformly distributed, but under misspecification that will not be the case and separate equal tail interval limits should be better. related: one-sided prediction intervals (which I have not seen yet in the references that I skimmed) topic: equal tail versus minlike/shortest interval for confidence intervals in statsmodels.stats.

aside: Chernozhukov et al assume continuous distribution. My guess is that it extends to discrete distribution, but we only get weak inequalities in coverage. Coverage >= 1-alpha but often it will be > because of discreteness.

update Romano et al (2019) Conformalized quantile regression Theorem 2 has separate lower and upper limit in calibration correction. i.e. 2 one-sided quantiles/tails.

josef-pkt commented 11 months ago

another application: prediction interval in underdetermined models, parameters not identified e.g. k_params > nobs and estimation by pinv, penalized or feature selection (e.g. sure independence screening)

If parameters are not identified, we do not get inference on parameters (cov_params, ...). However, all parameters in optimal set lead to the same prediction, and same predictive distribution (if we don't have or pick nuisance parameters like scale) Using conformal prediction we can still get valid prediction intervals. Using conformalized distributional prediction, we can also get approximate conditional prediction intervals.

problem: scale estimation In sample residuals are zero and in-sample scale is zero. Instead we can use out-of-sample residuals, e.g. either jackknife or calibration set, to estimate scale for predictive distribution. (in discrete, one-parameter family like poisson we don't have scale as nuisance or extra distribution parameter.)

This gets closer to machine learning and the popularity of conformal prediction there.

This means we should make e.g. jackknife/looo residuals and scale (based on one-step approximation or linear dparams) a more prominent feature in the results classes.