pydata / patsy

Describing statistical models in Python using symbolic formulas
Other
941 stars 103 forks source link

How to do a polynomial? #20

Closed jseabold closed 2 years ago

jseabold commented 11 years ago

I've seen https://groups.google.com/forum/#!topic/pystatsmodels/96cMRgFXBaA, but why doesn't something like this work.

from patsy import dmatrices
data = dict(y=range(1,11), x1=range(21,31), x2=range(11,21))

dmatrices("y ~ x1 + x2**2", data)

or

dmatrices("y ~ x1 + I(x2**2)", data)

This works

dmatrices("y ~ x1 + np.power(x2, 2)", data)
njsmith commented 11 years ago

The reason x22 doesn't work is that x22 is shorthand for x2 * x2 which is shorthand for x2 + x2 + x2:x2 which collapses down to just plain x2 because self-interactions don't make sense. See https://patsy.readthedocs.org/en/latest/formulas.html#the-formula-language

There's an argument that for numerical predictors, x2:x2 should be the same as np.multiply(x2, x2), just like x1:x2 is the same as np.multiply(x1, x2). But even then x2**2 would mean x2+np.multiply(x2, x2).

In your example, I(x2**2) doesn't work because x2 is a python list, not a numpy array. patsy has no crystal ball that tells it that lists should be coerced to arrays while interpreting arbitrary python code. Use DataFrame() instead of dict() and it will just work...

On Tue, Jun 25, 2013 at 1:24 AM, Skipper Seabold notifications@github.comwrote:

I've seen https://groups.google.com/forum/#!topic/pystatsmodels/96cMRgFXBaA, but why doesn't something like this work.

from patsy import dmatrices data = dict(y=range(1,11), x1=range(21,31), x2=range(11,21))

dmatrices("y ~ x1 + x2**2", data)

or

dmatrices("y ~ x1 + I(x2**2)", data)

This works

dmatrices("y ~ x1 + np.power(x2, 2)", data)

— Reply to this email directly or view it on GitHubhttps://github.com/pydata/patsy/issues/20 .

jseabold commented 11 years ago

Is it possible/desirable that if the RHS of \ is an integer (or a numeric term) to not do the interaction expansion, because it's surely not what someone wants?

njsmith commented 11 years ago

I don't understand. The only legal RHS for \ is an integer, anything else raises an error in the formula evaluator...?

The intended usage is to say things like "give me all 1, 2, or 3-way interactions, but not anything higher than that": (a + b + c + d + e + f)**3

-n

On Tue, Jun 25, 2013 at 1:10 PM, Skipper Seabold notifications@github.comwrote:

Is it possible/desirable that if the RHS of \ is an integer (or a numeric term) to not do the interaction expansion, because it's surely not what someone wants?

— Reply to this email directly or view it on GitHubhttps://github.com/pydata/patsy/issues/20#issuecomment-19971654 .

jseabold commented 11 years ago

Yeah I was thinking of *. Hmm. Oh well.

njsmith commented 11 years ago

It's better to use something like poly(x, 3) anyway, right? B/c that can produce orthogonalized columns? The people I know are always excited about testing for whether the higher order terms are significant, and that's most easily done if you use orthogonalized columns as your polynomial basis. (Of course this requires we implement poly() but that's not hard, just, no-one has gotten around to it.)

On Tue, Jun 25, 2013 at 1:22 PM, Skipper Seabold notifications@github.comwrote:

Yeah I was thinking of *. Hmm. Oh well.

— Reply to this email directly or view it on GitHubhttps://github.com/pydata/patsy/issues/20#issuecomment-19972181 .

jseabold commented 11 years ago

I suppose, if for ease of inference, though I rarely see it in practice. Might just be where I hang out. Different strokes.

jseabold commented 11 years ago

Translation of Stata's orthpoly.ado, so likely not fit for inclusion anywhere, but I'll leave it here. // cc @josef-pkt

https://gist.github.com/jseabold/5859121

njsmith commented 11 years ago

I think those are orthogonal polynomials, which are different. Orthogonal polynomials are ones where if you evaluate them at every point in their domain, the resulting vectors are orthogonal. For regression, we want polynomials such that if you evaluate them at every point where there is data, then the resulting vectors are orthogonal.

There's already code for doing this using qr in the patsy Poly class for categorical polynomial coding, but it hasn't been adapted to be a numerical stateful transform. On 25 Jun 2013 16:01, "Skipper Seabold" notifications@github.com wrote:

Translation of Stata's orthpoly.ado, so likely not fit for inclusion anywhere, but I'll leave it here. // cc @josef-pkthttps://github.com/josef-pkt

https://gist.github.com/jseabold/5859121

— Reply to this email directly or view it on GitHubhttps://github.com/pydata/patsy/issues/20#issuecomment-19982092 .

josef-pkt commented 11 years ago

I find it confusing that polynomials should be QR orthogonalized. I think the main polynomials should be numpy.polynomial.xxxvander

A question of choosing names so that most users (and not just R users) understand what the different versions mean.

jseabold commented 11 years ago

AFAICT, the Poly class is only intended for equally spaced categorical variables. At least, I couldn't figure out how to replicate what orthpoly gives with Poly for a continuous regressor, but I just wanted to work on something else for an hour.

I'm not sure I understand your distinction between what's orthogonal to what. I think orthpoly is like a scaled version of legvander that Josef was after.

from statsmodels.tools.tools import webuse
dta = webuse("auto")
weight = dta["weight"].values.astype(float)
orth = orthpoly(weight, 4)
np.allclose(np.dot(orth[:,0], orth[:,1]), 0)

http://www.stata.com/help11.cgi?orthpoly

njsmith commented 11 years ago

I only looked at your code for a few seconds, so I could be confused :-)

This looks like a good skimmable paper on why orthogonal is better than vander in many cases (and I'm not aware of any situation where it's worse): http://www.jstor.org/stable/1403204

The confusion is that the term "orthogonal polynomial" almost always refers to something else, that afaict is totally irrelevant to linear regression. In ordinary mathematical usage, two polynomials f1 and f2 are orthogonal iff the integral of f1(x)f2(x)dx = 0, evaluated over the whole domain.

What we want are polynomials so that sum f1(x)f2(x) = 0 when summing over our data points.

Poly() implements this latter (and only uses it for categorical variables treated as integers, but the code would work for non-integer data just as well).

What stata does I don't know. The manual page you pointed me to strongly implies that they produce the second kind of orthogonal polynomials (when it says "orthog and orthpoly create a set of variables such that the "effects" of all the preceding variable have been removed from each variable."). But afaict the Christoffel-Darboux recurrence is for producing the first kind of orthogonal polynomials.

Are the predictor vectors that you get out of orthpoly actually orthogonal as vectors?

AFAICT, the Poly class is only intended for equally spaced categorical variables. At least, I couldn't figure out how to replicate what orthpolygives with Poly for a continuous regressor, but I just wanted to work on something else for an hour.

I'm not sure I understand your distinction between what's orthogonal to what. I think orthpoly is like a scaled version of legvander that Josef was after.

from statsmodels.tools.tools import webuse dta = webuse("auto") weight = dta["weight"].values.astype(float) orth = orthpoly(weight, 4) np.allclose(np.dot(orth[:,0], orth[:,1]), 0)

http://www.stata.com/help11.cgi?orthpoly

— Reply to this email directly or view it on GitHubhttps://github.com/pydata/patsy/issues/20#issuecomment-19985792 .

jseabold commented 9 years ago

Just parking this one-liner here, since I came back around to this. An equivalent to R's poly for continuous variables AFAIK, though not really numerically sound.

poly = lambda x, degree : np.linalg.qr(np.vander(x, degree + 1)[:, ::-1])[0][:, 1:]

matthewwardrop commented 2 years ago

I'm going to close this one out since a PR exists that adds a poly implementation (#92), and there's no further action to be taken.