pydata / patsy

Describing statistical models in Python using symbolic formulas
Other
941 stars 103 forks source link

Unintuitive patsy crash if "C = " is used elsewhere in the code #174

Closed epierson9 closed 3 years ago

epierson9 commented 3 years ago

Patsy (and libraries that rely on it) crashes if the programmer uses the variable name "C" elsewhere in the code, as shown in the following MWE, with an unintuitive error message:

import statsmodels.api as sm
import pandas as pd

d = pd.DataFrame({'a':range(5), 'b':[1, 0, 0, 0, 1]})
sm.OLS.from_formula("b ~ C(a)", data=d).fit()
print("first call works fine")
C = 5
d = pd.DataFrame({'a':range(5), 'b':[1, 0, 0, 0, 1]})
sm.OLS.from_formula("b ~ C(a)", data=d).fit()
print("second call works fine")

The first call works fine, but the second call throws an error; eventual error message is

patsy.PatsyError: Error evaluating factor: TypeError: 'int' object is not callable
    b ~ C(a)
        ^^^^

This is quite hard to debug if the C variable is set far from the patsy call. (In my case, the error came up because I was using sklearn, and C is the name of one of its regularization parameters). Perhaps there's a way to log a more intuitive error message, or to fix this error entirely?

matthewwardrop commented 3 years ago

Thanks for getting in touch!

This issue occurs because patsy honours the surrounding context when computing formulas. This is convenient because you can define ad hoc functions in the local namespace, and have them be usable in the formula. You can override this behaviour by passing eval_env to the .from_formula method as described in: help(sm.OLS.from_formula) and help(patsy.dmatrix).

I understand that this behaviour could be confusing/annoying at times, since there are ways to work around it, and this project is more-or-less feature-frozen, I'm going to close this one out for now.

epierson9 commented 3 years ago

thank you for the additional context!

On Tue, Sep 7, 2021 at 12:46 PM Matthew Wardrop @.***> wrote:

Thanks for getting in touch!

This issue occurs because patsy honours the surrounding context when computing formulas. This is convenient because you can define ad hoc functions in the local namespace, and have them be usable in the formula. You can override this behaviour by passing eval_env to the .from_formula method as described in: help(sm.OLS.from_formula) and help(patsy.dmatrix).

I understand that this behaviour could be confusing/annoying at times, since there are ways to work around it, and this project is more-or-less feature-frozen, I'm going to close this one out for now.

— You are receiving this because you authored the thread. Reply to this email directly, view it on GitHub https://github.com/pydata/patsy/issues/174#issuecomment-914462858, or unsubscribe https://github.com/notifications/unsubscribe-auth/AARKUGDA4VSSHLRRGWQMGS3UAY6VBANCNFSM5B7HPBMA . Triage notifications on the go with GitHub Mobile for iOS https://apps.apple.com/app/apple-store/id1477376905?ct=notification-email&mt=8&pt=524675 or Android https://play.google.com/store/apps/details?id=com.github.android&referrer=utm_campaign%3Dnotification-email%26utm_medium%3Demail%26utm_source%3Dgithub.