pydata / patsy

Describing statistical models in Python using symbolic formulas
Other
941 stars 103 forks source link

Maximum recursion depth error for formulas with more than 485 terms #18

Open szs8 opened 11 years ago

szs8 commented 11 years ago

I am working with a dataframe which has 7000 columns and it turns out that once you go beyond 485 terms, patsy throws a recursion error when going from a formula to a design matrix. Is there a better way of doing this?

Thanks!

In [282]: df = pd.DataFrame(dict(('a' + str(i), np.random.randn(5)) for i in xrange(500)))

In [283]: formula = " + ".join(df.columns)

In [284]: dmatrices(formula, df)

....

/Users/xxx/lib/python2.7/site-packages/patsy-0.1.0_dev-py2.7.egg/patsy/desc.pyc in eval(self, tree, require_evalexpr)
    452                                 "'%s' operator" % (tree.type,),
    453                                 tree.token)
--> 454         result = self._evaluators[key](self, tree)
    455         if require_evalexpr and not isinstance(result, IntermediateExpr):
    456             if isinstance(result, ModelDesc):

/Users/xxx/lib/python2.7/site-packages/patsy-0.1.0_dev-py2.7.egg/patsy/desc.pyc in _eval_binary_plus(evaluator, tree)
    283
    284 def _eval_binary_plus(evaluator, tree):
--> 285     left_expr = evaluator.eval(tree.args[0])
    286     if tree.args[1].type == "ZERO":
    287         return IntermediateExpr(False, None, True, left_expr.terms)

/Users/xxx/lib/python2.7/site-packages/patsy-0.1.0_dev-py2.7.egg/patsy/desc.pyc in eval(self, tree, require_evalexpr)
    452                                 "'%s' operator" % (tree.type,),
    453                                 tree.token)
--> 454         result = self._evaluators[key](self, tree)
    455         if require_evalexpr and not isinstance(result, IntermediateExpr):
    456             if isinstance(result, ModelDesc):

/Users/xxx/lib/python2.7/site-packages/patsy-0.1.0_dev-py2.7.egg/patsy/desc.pyc in _eval_binary_plus(evaluator, tree)
    283
    284 def _eval_binary_plus(evaluator, tree):
--> 285     left_expr = evaluator.eval(tree.args[0])
    286     if tree.args[1].type == "ZERO":
    287         return IntermediateExpr(False, None, True,
    left_expr.terms)

/Users/xxx/lib/python2.7/site-packages/patsy-0.1.0_dev-py2.7.egg/patsy/desc.pyc in eval(self, tree, require_evalexpr)
    448         assert isinstance(tree, ParseNode)
    449         key = (tree.type, len(tree.args))
--> 450         if key not in self._evaluators:
    451             raise PatsyError("I don't know how to evaluate this "
    452                                 "'%s' operator" % (tree.type,),

RuntimeError: maximum recursion depth exceeded in cmp
szs8 commented 11 years ago

I guess I can use ModelDesc etc. https://patsy.readthedocs.org/en/latest/expert-model-specification.html

But in any case it might make sense to fail gracefully here.

njsmith commented 11 years ago

Huh, fair enough, the parse evaluator does recurse over the parse tree. It hadn't occurred to me that people would want to parse strings with hundreds of terms :-).

I'll think about how fixable that is. In the main time you may prefer in any case to use the programmatic interface for constructing formulas, which bypasses the string parser entirely. See http://patsy.readthedocs.org/en/latest/expert-model-specification.html and in particular the paragraph starting "However, there is also a middle ground...".

In your case I'd do something like

from patsy import ModelDesc, Term, LookupFactor

my_formula = ModelDesc([], [Term(LookupFactor(c)) for c in df.columns]) dmatrix(my_formula, df)

Let me know how it goes, there might be other places where I didn't think scaling through far enough...

On Fri, Apr 12, 2013 at 2:04 PM, NaN notifications@github.com wrote:

I am working with a dataframe which has 7000 columns and it turns out that once you go beyond 485 terms, patsy throws a recursion error when going from a formula to a design matrix. Is there a better way of doing this?

Thanks!

In [282]: df = pd.DataFrame(dict(('a' + str(i), np.random.randn(5)) for i in xrange(500))) In [283]: formula = " + ".join(df.columns) In [284]: dmatrices(formula, df) .... /Users/xxx/lib/python2.7/site-packages/patsy-0.1.0_dev-py2.7.egg/patsy/desc.pyc in eval(self, tree, require_evalexpr) 452 "'%s' operator" % (tree.type,), 453 tree.token)--> 454 result = self._evaluators[key](self, tree) 455 if require_evalexpr and not isinstance(result, IntermediateExpr): 456 if isinstance(result, ModelDesc): /Users/xxx/lib/python2.7/site-packages/patsy-0.1.0_dev-py2.7.egg/patsy/desc.pyc in _eval_binary_plus(evaluator, tree) 283 284 def _eval_binary_plus(evaluator, tree):--> 285 left_expr = evaluator.eval(tree.args[0]) 286 if tree.args[1].type == "ZERO": 287 return IntermediateExpr(False, None, True, left_expr.terms) /Users/xxx/lib/python2.7/site-packages/patsy-0.1.0_dev-py2.7.egg/patsy/desc.pyc in eval(self, tree, require_evalexpr) 452 "'%s' operator" % (tree.type,), 453 tree.token)--> 454 result = self._evaluators[key](self, tree) 455 if require_evalexpr and not isinstance(result, IntermediateExpr): 456 if isinstance(result, ModelDesc): /Users/xxx/lib/python2.7/site-packages/patsy-0.1.0_dev-py2.7.egg/patsy/desc.pyc in _eval_binary_plus(evaluator, tree) 283 284 def _eval_binary_plus(evaluator, tree):--> 285 left_expr = evaluator.eval(tree.args[0]) 286 if tree.args[1].type == "ZERO": 287 return IntermediateExpr(False, None, True, left_expr.terms) /Users/xxx/lib/python2.7/site-packages/patsy-0.1.0_dev-py2.7.egg/patsy/desc.pyc in eval(self, tree, require_evalexpr) 448 assert isinstance(tree, ParseNode) 449 key = (tree.type, len(tree.args))--> 450 if key not in self._evaluators: 451 raise PatsyError("I don't know how to evaluate this " 452 "'%s' operator" % (tree.type,), RuntimeError: maximum recursion depth exceeded in cmp

— Reply to this email directly or view it on GitHubhttps://github.com/pydata/patsy/issues/18 .

szs8 commented 11 years ago

Thanks, I was just about to do that and was trying to figure out how categorical columns would be handled especially if they are integers.

I guess I should have never attempted to build such a huge formula anyways but sometimes you are pig headed and just want to plough forward!

jm-contreras-zz commented 10 years ago

@signalseeker, I recently ran into the same error using statsmodels to build a logistic regression with more than 485 predictors. The data I'm working with has a very large predictor space and, unfortunately, there is nothing to be done about it. Thanks for looking into this, @njsmith.

DSLituiev commented 8 years ago

+1. Trying to run an interaction model '(a1+ a2+ ... a360) * (b1+...b40)' works, but '(a1+ a2+ ... a500) * (b1+...b40)' breaks :-(

Have to resort to sklearn.preprocessing.PolynomialFeatures

njsmith commented 8 years ago

@DSLituiev: so as noted upthread, you can use something like (untested)

from patsy import ModelDesc, Term, LookupFactor
terms = []
for i in range(1, 501):
    for j in range(1, 41):
        # Add an interaction between a{i} and b{j}, like a10:b12
        terms.append(Term((LookupFactor("a" + str(i)), LookupFactor("b" + str(j))))
preparsed_formula = ModelDesc([], terms)
dmatrix(preparsed_formula, dataframe)

This gives you exactly the same thing as the patsy formulas you wrote above; it's just that instead of having to generate a big string and then have patsy parse it, you can go directly to patsy's high-level representation of your data structures.

(And if you want to transform individual items before passing them in, you can replace LookupFactor(...) with something like EvalFactor("np.log(x)") or EvalFactor("C(a10)"), or you can even define a custom factor class -- mostly you just need to implement an eval method that takes a dataframe and returns your factor's values.)

njsmith commented 8 years ago

That said, I'm not likely to find the time to fix this soon, but it certainly is fixable by replacing the current recursive loop with an equivalent non-recursive loop, and I'd be happy to accept a patch if anyone wants to make one.

DSLituiev commented 8 years ago

Thank you! this looks like enough for my application, and I am afraid I am not sufficiently equipped to tinker the source for now.

On Thu, May 19, 2016 at 7:20 PM, Nathaniel J. Smith < notifications@github.com> wrote:

That said, I'm not likely to find the time to fix this soon, but it certainly is fixable by replacing the current recursive loop with an equivalent non-recursive loop, and I'd be happy to accept a patch if anyone wants to make one.

— You are receiving this because you were mentioned. Reply to this email directly or view it on GitHub https://github.com/pydata/patsy/issues/18#issuecomment-220501580

jolespin commented 6 years ago

Was this ever fixed?

njsmith commented 6 years ago

@jolespin I don't think so.

jolespin commented 6 years ago

I tried doing a mixed effects model with 4000 attributes and it kind of just got stuck and my computer stopped sounding like it was computing anything. Is there a maximum number attributes that can go in a linear model?

njsmith commented 6 years ago

@jolespin This issue is about formulas with lots of terms, like "y ~ x1 + x2 + x3 + x4 + x5 + x6 + ........ + x3999 + x4000", and it causes crashes, not freezes. Your issue sounds like something you should report to the package you're using to do mixed effect models (maybe statsmodels?)

Hoeze commented 1 year ago

I got the same issue. Any update on this?

matthewwardrop commented 1 year ago

At this point, we are unlikely to fix this in patsy. The issue is resolved in Formulaic, however.