The data handling in the __init__ of the base.Model is completely oriented towards single equation models, it assumes a single endog and a single exog.
system of equation models and versions of discrete choice models that work with different exog for each choice are better served with multiple exog, and in the former case possibly with multiple endog.
possibilities without rewriting handle_data to allow for lists/tuples of endog and exog:
0) rewrite handle_data to handle tuples of exog
1) call handle_data multiple times, and save them in make model.data attribute into a list
2) instantiate several base model classes, that work as stub models mainly to hold the data and metainformation across equations
3) possible in some cases: merge all exog into one big exog (same for endog if it applies, not sure if handle_data can handle 2d endog)
4) ignore exog and use data (argument) directly, and map equations to data columns by a dictionary (this seems to be what I had in mind for nested logit).
Or maybe better, use exog argument but just as collection of data variables without any direct relation to the actual design matrices for each equation.
3) is the simplest solution where it can be used, as for example with panel data.
Panel, longitudinal data: can be stacked vertically with unbalanced panels and horizontally in SUR type of balanced panels
sysreg: standard simultaneous equation models, stacking doesn't work, internally we have a possible sparse block structure
conditional logit (different exog per choice) could be horizontally stacked, or use long format where however several rows would correspond to one equation (Greene's data csv file format)
Horizontal stacking could require inefficient array copying since we always need to break up the exog into the individual exogs for each choice. But horizontal stacking might be worth a try. Supporting code like k_vars and exog_names would still be wrong without splitting.
1) calling handle_data several times:
This looks pretty flexible. It requires to include endog each time which would be redundant in the conditional logit case. Also none of the generic support code would know how to handle several data elements in a model.
The model would be fully responsible to handle the list of data instances for supporting code.
One extra problem is that missing handling would in most cases need to drop observations across calls to handle_data. (It might be possible by given all other exog to the first call to handle_data as extra arrays.)
4) dictionary plus data
This might actually be the easiest for the user, especially if the dictionary can be replaced by a formula. Similar to case 1) we would have to call handle_data or handle_formula_data several times, building the design matrices internally similar to from_formula.
One followup problem, that we have the number of variables different from the number of parameters, does already exist in MNLogit and in ARMAX (and in VAR/SVAR to some extend), but these cases have more structure than general multi-equation models.
Sysreg in PR #361 subclasses LikelihoodModel, but does some data handling in __init__ without calling super(...).__init__ and doesn't use the generic data handling (and no formulas).
The data handling in the
__init__
of the base.Model is completely oriented towards single equation models, it assumes a single endog and a single exog.system of equation models and versions of discrete choice models that work with different exog for each choice are better served with multiple exog, and in the former case possibly with multiple endog.
possibilities without rewriting
handle_data
to allow for lists/tuples of endog and exog:0) rewrite
handle_data
to handle tuples of exog 1) callhandle_data
multiple times, and save them in makemodel.data
attribute into a list 2) instantiate several base model classes, that work as stub models mainly to hold the data and metainformation across equations 3) possible in some cases: merge all exog into one big exog (same for endog if it applies, not sure ifhandle_data
can handle 2d endog) 4) ignore exog and use data (argument) directly, and map equations to data columns by a dictionary (this seems to be what I had in mind for nested logit). Or maybe better, use exog argument but just as collection of data variables without any direct relation to the actual design matrices for each equation.3) is the simplest solution where it can be used, as for example with panel data.
conditional logit (different exog per choice) could be horizontally stacked, or use long format where however several rows would correspond to one equation (Greene's data csv file format)
Horizontal stacking could require inefficient array copying since we always need to break up the exog into the individual exogs for each choice. But horizontal stacking might be worth a try. Supporting code like k_vars and exog_names would still be wrong without splitting.
1) calling
handle_data
several times: This looks pretty flexible. It requires to include endog each time which would be redundant in the conditional logit case. Also none of the generic support code would know how to handle severaldata
elements in a model.The model would be fully responsible to handle the list of
data
instances for supporting code.One extra problem is that
missing
handling would in most cases need to drop observations across calls tohandle_data
. (It might be possible by given all other exog to the first call tohandle_data
as extra arrays.)4) dictionary plus data This might actually be the easiest for the user, especially if the dictionary can be replaced by a formula. Similar to case 1) we would have to call
handle_data
orhandle_formula_data
several times, building the design matrices internally similar tofrom_formula
.One followup problem, that we have the number of variables different from the number of parameters, does already exist in MNLogit and in ARMAX (and in VAR/SVAR to some extend), but these cases have more structure than general multi-equation models.