umich-cphds / miselect

Variable Selection for Multiply Imputed Data
2 stars 3 forks source link

penalty factor = zero gives error #1

Open marinkakoenis opened 4 years ago

marinkakoenis commented 4 years ago

Dear Creators,

When trying to force a variable in the model selection by giving it a penalty of zero (while leaving the rest of the variables with a penalty of 1), the following error occurs:

Error in seq.default(log(lambda.max), log(lambda.max * lambda.min.ratio), : 'from' must be a finite number

In the miselect.df, following the vignette, when choosing a value close to zero (1e-9), the error goes away, but the lambda values explode (1e8). With the penalty factor at >0.001, the lambdas look decent (<2).

I have run glmnet with the penaltyfactor = c(0, rep(1,nvar)) without problems and with reasonable results in a real as well as simulated data (55 obs, 312 variables), hence I used it like that for miselect as well.

My questions:

Thanks!

tzimiskes commented 4 years ago

Hello!

It sounds like you have found a bug. I will investigate and get back to you shortly

tzimiskes commented 4 years ago

Hi Marinka,

I have updated the master branch of the github repository with a possible fix for your problem. You can download the updated Github code in R via

#install.packages("devtools")
devtools::install_github("umich-cphds/miselect")

Note that the above code will not generate the vignette for you. If you want the vignette, you can type

# install.packages("devtools")
devtools::install_github("umich-cphds/miselect", build_opts = c())

instead. It may take a few minutes to generate the vignette.

Test it out and please let me know how it goes. Also, feel free to post here or email me if you have any other issues, comments, or suggestions.

marinkakoenis commented 4 years ago

Hey Alexander,

Thanks for the quick fix, it works so far!

I have a question about the optimal alpha and lambda: would you expect that those are the same for the “original dataset” as for the mice-10-chains-dataset? For my dataset they seem to differ quite a lot (with alpha=1, lambda went from 0.25 to <0.10 (with the note that I haven’t run enough cross validations to have a proper idea of stability)).

A general suggestion: the duration of cv.saenet is really long. I can run 500 cv.glmnet iterations in 10-15 minutes, where cv.saenet takes 1 hour for 10 iterations (10 sets/imputation chains of the same dataset, of 55 obs, 312 vars). This make proper repeated cross validation (preferably over a series of simulated datasets) a bit unfeasible. Is there room for improvement on computation time?

Thanks!

Best, Marinka

tzimiskes commented 4 years ago

Hi Marinka,

I have a question about the optimal alpha and lambda: would you expect that those are the same for the “original dataset” as for the mice-10-chains-dataset?

Do you mean comparing the optimal lambda from glmnet using the original data and the optimal lambda from galasso using the imputed data? I don't think these would be the same as the methods are different. If the missing data pattern is simple, I'd expect the optimal models to have similar coefficient estimates. I'll look more into this.

A general suggestion: the duration of cv.saenet is really long. I can run 500 cv.glmnet iterations in 10-15 minutes, where cv.saenet takes 1 hour for 10 iterations (10 sets/imputation chains of the same dataset, of 55 obs, 312 vars).

As it currently stands the methods are slow as they are written in R instead of FORTRAN like glmnet is, but in my experience saenet should be quite a bit faster than galasso. How many alpha are you cross-validating over?

marinkakoenis commented 4 years ago

Do you mean comparing the optimal lambda from glmnet using the original data and the optimal lambda from galasso using the imputed data? I don't think these would be the same as the methods are different. If the missing data pattern is simple, I'd expect the optimal models to have similar coefficient estimates. I'll look more into this.

Yes, glmnet vs saenet, sorry for being unclear. glmnet showed an alpha =1 with lambda = 0.25 would give ~ 5-10 selected variables (looping over alpha=seq(0.1, 1, 0.1), based on 50 repeated cross validations and 100 simulated datasets). Structure of the dataset and missingness (with mean imputation) did not change that estimate. Hence I'm a bit surprised that with cv.saenet, alpha = 0.1 with lamda = 1.3 gives the lowest 1SE CVM, but with only 2 variables. It does seem that alpha doesn't make much of a difference, and with alpha = 1, lambda needs to be < 0.10 to get 5-10 selected variables (again based in simulated data).

As it currently stands the methods are slow as they are written in R instead of FORTRAN like glmnet is, but in my experience saenet should be quite a bit faster than galasso. How many alpha are you cross-validating over?

I want to do a repeated (50 repeats) cross validation while looping over alpha = seq(0.1, 1, 0.1), and do that for 100 different simulated datasets. (I don't want to run this on the real data to avoid bias). One repeat (ie, one dataset, over the 10 alphas) takes about an hour with cv.seanet.

Thanks again and sorry for spamming this thread.

tzimiskes commented 4 years ago

No worries, I think this is worth investigating further. If it is possible, could you send me (alexrix@umich.edu) one of the simulated datasets (the complete data and a version with missing data would be ideal) so I can run the methods and discuss with my co-authors whether or not the results make sense?

The package is still new, so I'd like to make sure that the code is working correctly and its possible our tests missed something.