statsmodels / statsmodels

Statsmodels: statistical modeling and econometrics in Python
http://www.statsmodels.org/devel/
BSD 3-Clause "New" or "Revised" License
10.1k stars 2.88k forks source link

ENH/Design handle flexible number of args in multi-part models #7576

Open josef-pkt opened 3 years ago

josef-pkt commented 3 years ago

3903 multi-part models

We have currently two part models in ZeroInflated models and coming in BetaModel, and in other models like Mixed,

For two part models with a single endog, we have explicit exog, exog_xxx, and we need other keywords, link, link_xxx, offset, offset_xxx and possibly exposure.

For general cases, we might have more parameters of the underlying distribution and need exog and all others for each of them. e.g. copula for bivariate needs two marginal models, in the multivariate case it can be more marginal models, and it needs to be flexible.

3-parameter distributions, or zero-modified 2 parameter distributions need 3 exog and links. Although a parameter might not be specified as regression and assumed/hardcoded to be constant across observations. e.g. df in t-distribution.

Options are to switch to list or tuples (that's what I started in copulas) or add flexible number of kwargs in generic case.

One possibility for formulas is to start with multipart formulas separated by | as in R (and Kevin's linearmodels), but that wouldn't affect lists of links and offsets.

(for now I stick with two exog models, which is still missing some support)

josef-pkt commented 3 years ago

reminder: I just remembered that GAM is also a multi-part model

We need to check what the current and possible code sharing is. Most of it is specific to the penalized spline part, which is not a standard extra exog_xxx in the model __init__, but we have the basis function array as extra exog.

josef-pkt commented 3 years ago

if extra param is constant, exog_extra is None, then we have two options:

We don't have any example yet that allows scalar parmeter, in models where the param can have exog. A similar case is offset which can be scalar 0 or None. AFAIR, we always force weights to be 1-dim, ones if not provided.

Always using exog_extra (ones by default) is simpler to implement, but computationally less efficient than using scalar. (e.g. scipy distributions recently switched to allowing scalars and broadcasting instead of expanding by default to 1-d arrays.)

I'm currently undecided while working on #7778