hongooi73 / glmnetUtils

Utilities for glmnet
65 stars 18 forks source link

add interaction terms #11

Closed seanv507 closed 7 years ago

seanv507 commented 7 years ago

It would be great to add interaction terms of form ..+length:width ...

any problems with implementing this - would be happy to do this myself..

hongooi73 commented 7 years ago

You can get interaction terms by setting use.model.frame=TRUE. Is this sufficient?

seanv507 commented 7 years ago

No - for precisely the reasons you outline why use.model.frame doesn't work well. That's what I want to work on.

hongooi73 commented 7 years ago

I think it would be a pretty complicated task, since you'd have to parse the formula and figure out what each combination of terms means. Eg how would you handle ~ x1 + x2:(x3 + x4) + x5*x6*x7^2?

Bear in mind as well that : and * can mean different things when variables are factor vs numeric.

Happy to accept a pull request -- I just think that it might be a bigger task than it looks at first.

seanv507 commented 7 years ago

I was going to limit it to just simple products ie "x1+x2:x3 + x3:x4:x5" ( I am predominantly working with factor variables)

Thanks for the heads up about factor vs numeric - useful to bear in mind.

On Mon, Mar 27, 2017 at 10:17 AM, Hong Ooi notifications@github.com wrote:

I think it would be a pretty complicated task, since you'd have to parse the formula and figure out what each operator means. Eg how would you handle ~ x1 + x2:(x3 + x4) + x5x6x7^2?

Bear in mind as well that : and * can mean different things when variables are factor vs numeric.

Happy to accept a pull request -- I just think that it might be a bigger task than it looks at first.

— You are receiving this because you authored the thread. Reply to this email directly, view it on GitHub https://github.com/Hong-Revo/glmnetUtils/issues/11#issuecomment-289384640, or mute the thread https://github.com/notifications/unsubscribe-auth/ADJxLxND_strayIz3Tj5pv1wQeaL4Wr7ks5rp3CogaJpZM4MpYaH .

seanv507 commented 7 years ago

So I believe your great code makes it easier than you think :) .

rhsTerms <- split(deparse(rhs),' + ')
rhsVars <- all.vars(rhs)

.... matrs <- sapply(rhsTerms, function(x) { ie since you are already using formulas to create the model variable any formula of the form: f(x) + f(y) will work

2 issues I have come across I now need to handle '~ . + x:y' and I don't know the terms object and what that is for.

hongooi73 commented 7 years ago

draft implementation: 221ed1b6733d79a07cec63c714b50780890eafa2

hongooi73 commented 7 years ago

There is a question over handling of the dot in formulas like ~ . + a:b or ~ . + sin(x). Should . expand to include main effects that are present in other terms?

How model.matrix/model.frame handles it:

Current implementation will include main effects, but not handle aliasing. This means that ~ . + a:b will output more columns than model.matrix does.

hongooi73 commented 7 years ago

@seanv507 would you have any thoughts on this?

seanv507 commented 7 years ago

unfortunately I am travelling tomorrow and will have infrequent internet access (and am in between jobs so not working on this right now). I presume you mean a*b not a:b ? If so I think that is fine [..for now] I would be happy putting interaction terms explicitly (also because aim is to deal with memory issues of using model.matrix). Aliasing would be a problem! I personally would not use a*b because of that (ie knowing I have to merge coefficient for a from ~ and a from a:b)

hongooi73 commented 7 years ago

No, I mean a:b. I can either make it so that . expands to include all variables that are in the formula, or exclude them. Assuming a, b and x are the only variables in the data, the former would be:

~ . + a:b + sin(x) --> a + b + x + a:b + sin(x)

Excluding would be:

~ . + a:b + sin(x) --> a:b + sin(x)

The former seems to be more consistent with the R default. The main issue is that you have to do more work when interpreting interaction coefficients, especially since aliased columns aren't removed. The latter is probably closer to what you want, but less convenient when doing, eg, polynomial regression.

Best of luck with job hunting, if you haven't got one already!

seanv507 commented 7 years ago

Personally I would prefer former. Sean

On 26 Jul 2017 2:41 am, "Hong Ooi" notifications@github.com wrote:

No, I mean a:b. I can either make it so that . expands to include all variables that are in the formula. Assuming a, b and x are the only variables in the data:

~ . + a:b + sin(x) --> a + b + x + a:b + sin(x)

Or I can exclude them:

~ . + a:b + sin(x) --> a:b + sin(x)

The former seems to be more consistent with the R default. The main issue is that you have to do more work when interpreting interaction coefficients, especially since aliased columns aren't removed. The latter is probably closer to what you want, but less convenient when doing, eg, polynomial regression.

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/Hong-Revo/glmnetUtils/issues/11#issuecomment-317913246, or mute the thread https://github.com/notifications/unsubscribe-auth/ADJxL9POrXsxNac6Z9rfiDweGxvBTFWMks5sRosggaJpZM4MpYaH .