SUMM: prediction results (outside tsa)

josef-pkt commented 9 years ago

Bringing together notes for different cases of prediction results. This does not include conditional forecasting as in TSA, but includes non-linear and non-normal models.

example, simplest case OLS: predict mean, confidence interval for predicted mean, confidence interval for predicting a new observation, see #719 extends to WLS with weights or estimated variance function HetGLS or (non-existing) HetMLE.

issue for GLM #938

models, cases

linear prediction x * params and residual is normal (or t distributed)
non-linear: single index/link function with linear part: we can separate link transformation from properties of linear prediction
- models as in GLM, discrete and if
- the endog has been transformed by user, e.g. 'log(y) = x * beta' with OLS for log-normal model
non-linear: general non-linear mean function
- non-linear least squares, ... (not covered yet out of the box)
multi-parameter models:
- simplest case estimated variance function, HetGLS, for prediction this might be similar to implied variance functions as in LEF/GLM family
- BetaRegression
panel/mixed : not clear, BLUE or similar, with or without random effect estimate
multi-equation models: not clear, follow Stata just predict mean and confidence interval for mean
- different equations: like simultaneous equations, heckman
multivariate endog: related to multi-equation models but equations have the same structure
- Multinomial Logit and similar
- multivariate endog as in balanced panel, SUR, (not yet available) similar to tsa.VAR without time series properties
  Which Results
predicted mean, linear or non-linear
confidence interval
distribution, used in MICE for random sampling from predictive distribution.
- In discrete case we can also predict
higher moments, skew, kurtosis, expectation for limited range, tail probabilities (extreme value and risk analysis)
predict_at and average treatment effect, ATE, and similar
Margins: derivatives/changes with respect to explanatory variables in case of non-linear models, already implemented for discrete (and GLM ?)

problem: convolution

We have two sources of uncertainty, one coming from parameter uncertainty, one coming from the assumed distribution of endog or residual. It will be relatively straightforward to treat them separately, but the convolution is difficult to compute except in the simple normal model with additive error.

This will cause a difference in the implementation/API pattern between linear model and all other models, RegressionResults prediction interval for new observations is calculated as convolution, but we won't have it in other models. We can get it by MonteCarlo or Bootstrap for other models but that should be separate methods since they are not cheap. (I ran some experimental functions for this for the Poisson case.)

Implementation

for some interface comment see #719

In terms of internal implementation, confidence intervals for new observations are different, inference and confidence interval on mean prediction is the same as testing, e.g. t_test can calculate everything for predict_mean.

Stata has model specific predict and lincom, nlcom and test, testnl.

In terms of algorithm we have three cases for inference on function of parameters

(1) linear in parameters: implemented in t_test (lincom/test in Stata) linear transformation of normal or t-distribution
(2) univariate non-linearity: univariate transformation of a univariate normal distributed result, example: inverse link transformation of linear prediction, non-linear transformation of single parameter, special feature: confidence interval with endpoint transformation
(3) non-linear function of multivariate parameters: here we cannot use 1-D endpoint transformation, the only available method is the delta method, (besides bootstrap or similar)

(2) is the main tool for prediction in GLM and discrete, one-parameter LEF (if variance is estimated then it is still asymptotically uncorrelated with mean, block diagonal information matrix. I guess NB1 and BetaRegression won't fit into this.)

other

in-sample versus out-of-sample: the main difference here is that we have a corrected residual estimate, in-sample is currently mostly covered by influence_outliers

josef-pkt commented 9 years ago

Kerby's comment on extensions and requirements for prediction https://github.com/statsmodels/statsmodels/pull/2151#issuecomment-73418785

josef-pkt commented 9 years ago

One of Kerby's comments

" Multiple testing: this is basically a convenience method for doing lots of tests/intervals, so simultaneity issues are relevant. Someone could just take the p-values and feed them through a multitest procedure. But since the tests/intervals all come from the same model they are likely to be quite correlated, so approaches like Scheffe (when applicable) should be less conservative. "

While looking at the difference between confidence interval for mean and confidence interval for observations, I think I figured out what was bothering me in #2172 about multiple testing and Scheffe:

In a parametric setting with a fixed finite number of parameters, the distribution of the mean predictions are perfectly correlated, they are all just functions of the same few parameters. So, I think we don't need any multiple testing correction for each prediction. If we want to predict observations, then each observations includes a separate noise, which often is independent across observations, and in that case we would have a largely or partially uncorrelated prediction/hypothesis for each additional observation.

example: I was thinking about adding a test to the prediction of a new observation in #2151 which would be similar to the outlier test for in-sample observations where we do need the multiple testing correction.

(I still haven't read the articles mentioned in #2172, but I was partially catching up on Scheffe in general for mutliple testing of many parameters.)

josef-pkt commented 9 years ago

and maybe I'm still wrong and need to "debug" my intuition: all pairwise comparison and similar have multiple testing correction by the number of comparisons and not by the number of underlying parameters.

josef-pkt commented 9 years ago

Ok, I was wrong. The analogy are to the supremum tests, where we check the worst case. If we don't reject the worst case, then we also don't reject any of the other cases. This limits the familywise type 1 error rate. I'm still a bit vague on this, and skipped the literature on gate-keeping multiple testing procedures.

josef-pkt commented 5 years ago

random find

https://www.reddit.com/r/datascience/comments/bu3kr3/from_academia_to_real_world_how_do_you_present/

"If you have a full model, I will often make hypothetical people and show how the predicted risk changes. Eg if we have two people one who is 50 and one who is 65 (and the same on all other variables), the first person might have a predicted risk of 50% and the second person have a predicted risk of 90%. This puts everything into very intuitive numbers" comment by jlienert

statsmodels / statsmodels

SUMM: prediction results (outside tsa) #2150

models, cases

Which Results

Implementation