as.big.matrix() pegs spinny little disk for hours despite terabytes of RAM available

kaneplusplus / bigmemory

126 stars 24 forks source link

as.big.matrix() pegs spinny little disk for hours despite terabytes of RAM available #97

Closed GabeAl closed 4 years ago

GabeAl commented 5 years ago

as.big.matrix(M) on a large matrix M starts spinning my disk. This normal, you might say. Well, sadly it is quite undesired in my case, and I am wondering if I can turn off the paging feature.

I have ~200GB of highly fragmented free disk space yet 1.5Tb of nearly-empty memory, yet this function spends over an hour trying to map my in-RAM matrix to my hard drive despite ample memory available. For context, I routinely work with matrices taking up hundreds of GB of RAM and never run into issues with base R or sparse matrices (dgCMatrices).

The reason I need to use "big.matrix" is because the biglasso package requires all data be cast to this format.

Is there a way to simply turn off paging/files/disk I/O? I have much more RAM than disk space, and RAM is much faster than disk. :)

Update: setting "shared=F" does not fix my issue because it causes an error in DescribeBigMatrix(x) which is used by the downstream package (biglasso).

privefl commented 5 years ago

First, try to be a bit nicer when you ask for help using open-source software.

Then, try to use as.big.matrix(M, shared = FALSE). From what I remember, this is what is the closest to RAM matrices.

Third, if that does not work, please give more information (e.g. size and type of data).

GabeAl commented 5 years ago

Thanks @privefl . I'll give it a try and let you know how it goes. I'm glad an option exists, and at first glance of the docs, "shared" didn't stand out to me as having to do with my issue, but I see now this is memory mapping lingo.

Thanks also for being blunt about my tone. I had learned to dramatize my issues online and by phone to clarify my situation and avoid being bounced. Perhaps it is time I unlearned this (mis-)behavior in the era of open source, where the currency is increasingly academic and my audience far more empowered. 👍

GabeAl commented 5 years ago

Unfortunately this doesn't work for my use-case. I get the following error: Error in DescribeBigMatrix(x) : you can't describe a non-shared big.matrix.

I'm using this in context of the biglasso package, so I have not called this "DescribeBigMatrix" function directly. I observe paging behavior with a numeric matrix as small as 100 x 5,000,000.

privefl commented 5 years ago

Yes, if using shared = FALSE, then you can't use parallelism in {biglasso}.

An alternative would be to try to specify a backingfile.

GabeAl commented 5 years ago

Hm, interesting idea. Could you elaborate a bit on this? I see documentation about loading data from a file (where that file would naturally become the backing file), but could I specify a backing file of arbitrary size (like 0 bytes) for an in-ram matrix, which would make bigmemory fall back on a shared memory representation?