Dealing with Extremely large datasets (750 GB)

pbreheny / biglasso

biglasso: Extending Lasso Model Fitting to Big Data in R

http://pbreheny.github.io/biglasso/

113 stars 29 forks source link

Dealing with Extremely large datasets (750 GB) #31

Closed prateeksasan1 closed 3 years ago

prateeksasan1 commented 3 years ago

Hi,

I have a very large dataset. The X matrix has 480k rows and around 205k cols. I have saved it as a filebacked.big.matrix and i am using attach.big.matrix to "load" at the time of performing the elastic net regression. Sadly, this is working really slow and took almost 2 days for 1 lambda (lambda is quite small). On this, i have 2 questions

Will it be faster if i load the data in memory? I can ask for upto 2-3TB on the server.
If yes, how can i load filebacked matrix in memory? I was trying using read.big.matrix, deepcopy etc but it doesnt look like it works
Any other suggestions on dealing with such a large dataset?

Thanks Prateek Sasan

pbreheny commented 3 years ago

"lambda is quite small": This is the biggest problem. Everything about the algorithm is designed around the idea of starting at a large value of lambda and gradually relaxing the penalization. Trying to fit a model to only one, very small value of lambda is always a bad idea and in your particular situation, likely results in an undefined solution. Try something like setting lambda.min=0.6 and nlambda=10 and see whether that works.

prateeksasan1 commented 3 years ago

Hi,

We ran the elastic net for a smaller n and saw that the optimal lambda was close to 0.01. That is what we were trying to test. It took almost 34 hours to solve for 0.01 with the full dataset and we have to repeat this 28 times with the same X but different y. We were hoping to reduce the 34 hours and were thinking if loading the data in memory would help.

Thanks Prateek Sasan

pbreheny commented 3 years ago

OK, makes sense.

Yes, loading the data set into memory into memory will make it faster -- reading from memory is much faster than reading from disk. How much faster? Hard to day, depends on a number of things, but we would be very curious to hear what your experiences are. We've not tested biglasso with data sets as large as what you're describing.

Out of curiosity, is this linear regression or logistic?

privefl commented 3 years ago

Just a few comments here.

Memory-mapping will read from disk the first time, but if you have enough memory to have the whole dataset in memory, I guess it will stay in memory so that you read from disk only once, right?
The question is then, is the problem reading from disk because not enough memory, or just that too many variables enter the model (because lambda is small and elastic-net regularization is used)? If first, then you can try increasing the memory. If second, then you can try setting dfmax to e.g. 20e3 to stop early if there are too many variables entering the model.

prateeksasan1 commented 3 years ago

Hi,

Thank you for the suggestions and sorry for the late response.

@pbreheny It is a linear regression with around 480k samples and around 205k (approx 750GB).

We are working on OSC pitzer which has 1 huge memory node (3 TB) https://www.osc.edu/resources/technical_support/supercomputers/pitzer

It took approx 4 hours to run the elastic net with the following settings.

Dataset read in memory
ncores specified to the max level
nlambda = 10
lambda.min = 0.01

Thank you for the help. One of the big reasons it was running slow was that i had not specified ncores earlier and thus it took a lot of time. The timing reduces by 80% once i specified the ncores. The timing reduced another 25% once the data was read in memory.

It would also be great if we can expand the package to other families like "multivariate gaussian"

pbreheny commented 3 years ago

Thank you very much for this information. I'm going to close this issue, as I don't see that there are any specific bugs / issues to fix.

Currently multivariate Gaussian models can be fit in grpreg, a separate package, but this doesn't support data-larger-than-memory cases. I agree that this would be nice, but depends on finding a graduate student to help me with the work :slightly_smiling_face: