hongooi73 / glmnetUtils

Utilities for glmnet
65 stars 18 forks source link

Error on syntactically illegal names #5

Closed Deleetdk closed 8 years ago

Deleetdk commented 8 years ago

Using a formula with syntactically illegal names gives an error, even though this works fine with e.g. lm.

Simple example:

> test_iris = iris
> test_iris$`in` = rnorm(150)
> f = formula(Sepal.Length ~ `in`)
> lm(f, test_iris)

Call:
lm(formula = f, data = test_iris)

Coefficients:
(Intercept)         `in`  
      5.857        0.108 
> glmnet(f, data = test_iris)
Error in parse(text = paste("~", paste(rhsVars, collapse = "+"))) : 
  <text>:1:3: unexpected 'in'
1: ~ in
      ^

Some solutions to this pesky bug:

1) Use valid names. Not always practical. 2) Use backticks. See here.

My use case in this case is that I want to regress the social status of names on ngrams of length 1-3. So I have tons of variables (about 300), some of which contain illegal keywords like in. Now I could rename all the regex to add something like "ngram_" in front, but this is somewhat ugly for the output.

hongooi73 commented 8 years ago

Thanks for reporting this bug (and the one on missing values). Until I fix this, a workaround is to set use.model.frame=TRUE and glmnetUtils will use the standard R method for building model matrices. 300 variables isn't actually that big; you shouldn't see a huge performance drop.

Deleetdk commented 8 years ago

I used an ad hoc solution described in the link above together with my own convenience wrapper. We have some duplicate effort. Check my function here:

https://github.com/Deleetdk/kirkegaard/blob/master/R/modeling.R#L478

It doesn't use the formula interface and it cannot handle interactions in a smart one (one has to create them manually first), but it streamlines the use of LASSO by automatically calling cv.glmnet to find the optimal shrinkage value. It then uses these to find the betas and saves them. It repeats this a number of times, so that one can easily summarize the output. It's necessary to repeat the process because the CV is based on bootstrapping, hence there is a random component.

Regarding the size of the data. Yes, this is a small scale project compared to say, cancer genomics or a GWAS. I have a list of 1900 first names and I'm creating a bunch of features based on them primarily via n-grams. So e.g. does the name contain hdk yes/no. Since most of these possible features do not appear in the names (e.g. no name begins with xdh), I subset to the ones that appear at least n times and throw away the rest (using n=5, currently). This gives me an analysis problem of the type 1900 x ~300 (n x p). Not suitable for OLS, but not large enough that I've even thought about optimizing for speed. I'm just looking for coding convenience.

The reason I sought your package was that I was looking for a way to use categorical predictors with LASSO. glmnet assumes all the predictors are numeric-continuous as far as I can understand. I've still not found a good way to use my above routine with support for proper categorical predictors. The outcome is numeric-continuous, so that part is easy.

Here's a current version of my Rmd:

http://rpubs.com/EmilOWK/228495

hongooi73 commented 8 years ago

fixed in 493081eaa62f59f4e0d054f7b137e047754c2b6a

hongooi73 commented 8 years ago

Both bugs should now be fixed. Drop me a line if you still encounter errors.