pydata / patsy

Describing statistical models in Python using symbolic formulas
Other
947 stars 103 forks source link

Specificing linear constraints in design matrix #185

Closed jeskowagner closed 2 years ago

jeskowagner commented 2 years ago

Hi,

Thanks for developing patsy! It looks to be very useful for my project.

Currently, I struggle to incorporate linear constraints the design matrix. To give you an idea of what I am looking to do, here an example in R:

n=6
g=factor(rep(1:2,each=n/2))
model.matrix(~g, contrasts.arg = list(g="contr.sum"))
#   (Intercept)  g1
# 1        1      1
# 2        1      1
# 3        1      1
# 4        1     -1
# 5        1     -1
# 6        1     -1

The "contr.sum" constraint should be equivalent to patsy's di.linear_constraint({"g": 0}) (as in the documentation). However, I am unable to incorporate the information of that section into the process of creating my design matrix.

g=["1"]*3 + ["2"]*3
constraint={"g":0}

# Option 1, straight to dmatrix
patsy.dmatrix("~g", linear_constraint=constraint)
# -> dmatrix does not accept linear_constraint as argument, fair enough

# Option 2, using DesignInfo
di = patsy.DesignInfo("g")
di.linear_constraint(constraint)
patsy.dmatrix(di) # -> AttributeError: 'NoneType' object has no attribute 'values'

# Option 3, modifying dmatrix after creation
d = patsy.dmatrix("~g")
d.design_info.linear_constraint(constraint)
# -> ValueError: unrecognized variable name/index 'g'

Could you please tell me how I can incorporate linear constraints into the design matrix?

Thanks! Best,

Jesko

matthewwardrop commented 2 years ago

Hi Jesko! The linear constraints are not intended to be part of the design matrix; but rather, a matrix that operates on the design matrix to enforce the constraint during (e.g.) regression.

Thus, if you have a constraint {"g": 0}, then there should be a column in your design matrix of that name. Let's suppose it was the second of three columns. This constraint would look like: constraint_matrix=[[0, 1, 0]]; constraint_values: [0]. The constraint could be evaluated for each row in the design matrix during regression using: design_matrix @ constraint_matrix.T ; and action taken accordingly.

Does that make sense?

Formulaic fixed up a few small details in the linear constraint generation, and I'm more familiar with that code; so here's how it would work there:

>>> import pandas
>>> from formulaic import Formula
>>> model_matrix = Formula("a + b + c").get_model_matrix(pandas.DataFrame({"a": [1,2,3], "b": [4,5,6], "c": [7,8,9]}))
>>> linear_constraints = model_matrix.model_spec.get_linear_constraints({"b": 0, "a + c": 1})
>>> print(model_matrix.to_string())
   Intercept  a  b  c
0        1.0  1  4  7
1        1.0  2  5  8
2        1.0  3  6  9
>>> linear_constraints.constraint_matrix, linear_constraints.constraint_values
(array([[0., 0., 1., 0.],
        [0., 1., 0., 1.]]),
 array([0, 1]))
>>> deviations = (model_matrix.values @ linear_constraints.constraint_matrix.T) - linear_constraints.constraint_values
>>> deviations
array([[ 4.,  7.],
       [ 5.,  9.],
       [ 6., 11.]])
jeskowagner commented 2 years ago

Hi Mathew! Thank you for your helpful response! Clearly I had some lingering confusions between design matrix, linear constraints, and how they are used in regression. Also thanks for the pointer to Formulaic, I had not seen that before!