Closed prateeksasan1 closed 3 years ago
"lambda is quite small": This is the biggest problem. Everything about the algorithm is designed around the idea of starting at a large value of lambda and gradually relaxing the penalization. Trying to fit a model to only one, very small value of lambda is always a bad idea and in your particular situation, likely results in an undefined solution. Try something like setting lambda.min=0.6
and nlambda=10
and see whether that works.
Hi,
We ran the elastic net for a smaller n and saw that the optimal lambda was close to 0.01. That is what we were trying to test. It took almost 34 hours to solve for 0.01 with the full dataset and we have to repeat this 28 times with the same X but different y. We were hoping to reduce the 34 hours and were thinking if loading the data in memory would help.
Thanks Prateek Sasan
OK, makes sense.
Yes, loading the data set into memory into memory will make it faster -- reading from memory is much faster than reading from disk. How much faster? Hard to day, depends on a number of things, but we would be very curious to hear what your experiences are. We've not tested biglasso with data sets as large as what you're describing.
Out of curiosity, is this linear regression or logistic?
Just a few comments here.
dfmax
to e.g. 20e3
to stop early if there are too many variables entering the model.Hi,
Thank you for the suggestions and sorry for the late response.
@pbreheny It is a linear regression with around 480k samples and around 205k (approx 750GB).
We are working on OSC pitzer which has 1 huge memory node (3 TB) https://www.osc.edu/resources/technical_support/supercomputers/pitzer
It took approx 4 hours to run the elastic net with the following settings.
Thank you for the help. One of the big reasons it was running slow was that i had not specified ncores earlier and thus it took a lot of time. The timing reduces by 80% once i specified the ncores. The timing reduced another 25% once the data was read in memory.
It would also be great if we can expand the package to other families like "multivariate gaussian"
Thank you very much for this information. I'm going to close this issue, as I don't see that there are any specific bugs / issues to fix.
Currently multivariate Gaussian models can be fit in grpreg
, a separate package, but this doesn't support data-larger-than-memory cases. I agree that this would be nice, but depends on finding a graduate student to help me with the work :slightly_smiling_face:
Hi,
I have a very large dataset. The X matrix has 480k rows and around 205k cols. I have saved it as a filebacked.big.matrix and i am using attach.big.matrix to "load" at the time of performing the elastic net regression. Sadly, this is working really slow and took almost 2 days for 1 lambda (lambda is quite small). On this, i have 2 questions
Thanks Prateek Sasan