statsmodels / statsmodels

Statsmodels: statistical modeling and econometrics in Python
http://www.statsmodels.org/devel/
BSD 3-Clause "New" or "Revised" License
10.16k stars 2.89k forks source link

data handling for multiple equation models #940

Open josef-pkt opened 11 years ago

josef-pkt commented 11 years ago

The data handling in the __init__ of the base.Model is completely oriented towards single equation models, it assumes a single endog and a single exog.

system of equation models and versions of discrete choice models that work with different exog for each choice are better served with multiple exog, and in the former case possibly with multiple endog.

possibilities without rewriting handle_data to allow for lists/tuples of endog and exog:

0) rewrite handle_data to handle tuples of exog 1) call handle_data multiple times, and save them in make model.data attribute into a list 2) instantiate several base model classes, that work as stub models mainly to hold the data and metainformation across equations 3) possible in some cases: merge all exog into one big exog (same for endog if it applies, not sure if handle_data can handle 2d endog) 4) ignore exog and use data (argument) directly, and map equations to data columns by a dictionary (this seems to be what I had in mind for nested logit). Or maybe better, use exog argument but just as collection of data variables without any direct relation to the actual design matrices for each equation.

3) is the simplest solution where it can be used, as for example with panel data.

1) calling handle_data several times: This looks pretty flexible. It requires to include endog each time which would be redundant in the conditional logit case. Also none of the generic support code would know how to handle several data elements in a model.

The model would be fully responsible to handle the list of data instances for supporting code.

One extra problem is that missing handling would in most cases need to drop observations across calls to handle_data. (It might be possible by given all other exog to the first call to handle_data as extra arrays.)

4) dictionary plus data This might actually be the easiest for the user, especially if the dictionary can be replaced by a formula. Similar to case 1) we would have to call handle_data or handle_formula_data several times, building the design matrices internally similar to from_formula.

One followup problem, that we have the number of variables different from the number of parameters, does already exist in MNLogit and in ARMAX (and in VAR/SVAR to some extend), but these cases have more structure than general multi-equation models.

josef-pkt commented 11 years ago

Sysreg in PR #361 subclasses LikelihoodModel, but does some data handling in __init__ without calling super(...).__init__ and doesn't use the generic data handling (and no formulas).