kaneplusplus / bigmemory

126 stars 24 forks source link

Support for Long Vectors? #91

Open jonpeake opened 5 years ago

jonpeake commented 5 years ago

I'm working with data outside the realm of 2^31-1 elements, so I'm having issues converting matrices to big.matrix format using as.big.matrix (due to the inability to pass long vectors to .C or .Fortran, I'm assuming, but I'm not savvy enough to find where this occurs in the source code). I've found a work-around in creating a big.matrix and then assigning data column-wise from my original matrix, but this seems inefficient. I was wondering if there is a way to allow for conversion of matrices >2^31-1 elements, maybe using the dotCall64 package as a dependency? I've used this package in editing R base functions that call .C and .Fortran and it seems to work well.

privefl commented 5 years ago

I'm able to run

x <- matrix(1L, 2^16, 2^16)
X <- bigmemory::as.big.matrix(x)
jonpeake commented 5 years ago

I'm able to run

x <- matrix(1L, 2^16, 2^16)
X <- bigmemory::as.big.matrix(x)

I'm trying to do it for a file-backed big.matrix. If I try to as a non-filebacked, it just crashes my OS (running Ubuntu 18.10).

kaneplusplus commented 5 years ago

Does it really crash the operating system or does it crash R? If the latter, is there an error message?

jonpeake commented 5 years ago

I actually ended up clean installing Ubuntu 18.04 because I was having bugs with 18.10. After re-installing, I am able to use non-filebacked as.big.matrix. However, this doesn't solve my original problem of trying to use the filebacked as.big.matrix (which is what I really need). I still get the long-vector error, stemming from the SetMatrixElements sub-function.

privefl commented 5 years ago

What is the dimensions of your data? It does not work with filebacked but works with non-filebacked?

jonpeake commented 5 years ago

My data are 50490x50490 of data type double. And correct, it works with non-filebacked, but does not work with filebacked.

privefl commented 5 years ago

Please try:

# devtools::install_github("privefl/bigstatsr")
X <- bigstatsr::as_FBM(x, backingfile = "data/test")
(desc <- sub("\\.bk$", ".desc", X$backingfile))
dput(X$bm.desc(), desc)

library(bigmemory)
X.bm <- attach.big.matrix(desc)
jonpeake commented 5 years ago

I tried that and it worked, but I would also like to be able to perform the biganalytics and bigalgebra (cdeterman fork) functions, which also run into the problem of not supporting long vectors. How did you get around the long vector problem in bigstatsr? I'm wondering if your method can be easily ported to bigmemory (specifically the SetMatrixElements function, where the error is ultimately occurring; see code below for the error message)

>Y=X.bm*X.bm
Error in SetMatrixElements(x@address, as.double(j), as.double(i), as.double(value)) : 
  long vectors not supported yet: ../../src/include/Rinlinedfuns.h:519
Error during wrapup: long vectors not supported yet: ../../src/include/Rinlinedfuns.h:519
privefl commented 5 years ago

I think the main difference is that I'm relying heavily on Rcpp to link R and C++. This keeps things really simple for me. Yet, at the time bigmemory was developed, Rcpp was not as mature as now I guess.

jonpeake commented 5 years ago

@cdeterman @kaneplusplus Any chance you could look into using Rcpp for bigmemory to support long vectors?

cdeterman commented 5 years ago

If I recall, the main limitation with long vectors had to do with the BLAS backends. Maybe certain BLAS support this and the installation can be conditional upon that? The C++ should be relatively straightforward to write but the BLAS backends I believe are the main factor. @kaneplusplus anything to confirm in that regard?

kaneplusplus commented 5 years ago

@cdeterman I think that's correct. At one point we linked bigmemory to the 64-bit api for the MKL in bigalgebra but since each of the BLAS set ups were slightly different it was difficult to do in general and we weren't seeing a lot of interest at the time.

cdeterman commented 5 years ago

Perhaps that can be the thought then. If we can setup some sort of configuration at compile time to somehow detect what the BLAS backend is we could have the C++ conditionally (e.g. using ifdef or something) have the long vectors and otherwise have the R throw an error stating that the BLAS backend does not support them. Not sure exactly how at the moment but perhaps some food for thought.

jonpeake commented 5 years ago

Just came across a similar but different bug as well. When subsetting by a vector (i.e., I have a vector of indices that I want to use to set a value or vector of values for in a big.matrix), the SetIndivVectorElements.bm sub-function coerces the index vector to as.integer. This causes a bug where indices are coerced into NAs if you have over 2^31-1 elements in the original big.matrix, since this is the maximum value of an integer type in R. I found a workaround by just calling the bigmemory:::SetIndivVectorMatrixElements function directly and not calling as.integer in my function call, since for my purposes all of my indices are "integers" in the more broad sense of the word. Another workaround could be to use the bit64 package if you want to still coerce to integers.

privefl commented 5 years ago

X <- big.matrix(3, 3); X[] <- 1:9

ktoij <- function(k, X) {

  • k <- k - 1
  • n <- nrow(X)
  • cbind(row = k %% n, col = k %/% n) + 1
  • }

vec <- c(2, 4, 9)

(ind <- ktoij(vec, X)) row col [1,] 2 1 [2,] 1 2 [3,] 3 3

X[ind] <- 0

X[] [,1] [,2] [,3] [1,] 1 0 7 [2,] 0 5 8 [3,] 3 6 0

jonpeake commented 5 years ago

I tried initially to use the 2-column approach, but unfortunately the which function also comes across the dreaded 2^31-1 problem. Seems that although base R in general supports long vectors, a lot of the base functions have not been updated to reflect this support. I ended up having to do a workaround where I coerce my logical index matrix into a one-column big.matrix, then use the mwhich function provided in bigmemory to get the element-number-based index vector.

prateeksasan1 commented 4 years ago

Hi,

I am facing the same problem. I am getting the following error.

Error in SetMatrixElements(x@address, as.double(j), as.double(i), as.double(value)) : long vectors not supported yet: ../../src/include/Rinlinedfuns.h:535

Is there a resolution to this?

Thanks

kaneplusplus commented 4 years ago

Can you tell us which version of R you are using?

prateeksasan1 commented 4 years ago

R/4.0.2

Its on my universities' server

kaneplusplus commented 4 years ago

Thanks for the extra information. Can you check the values your are sending to the assignment where this is happening? The error looks like it's coming from R, rather than bigmemory's C code. Can you call as.double on the values? My hunch is that an easy fix would be to break the assignment into a few smaller assignment but let's see if we can do a little better job tracking down the problem.

elenabernabeu commented 2 years ago

Hi,

Continuing on from other people in this post, we are also having some issues with the long vector issue when using as.big.matrix directly on data. Specifically, getting the same error as @prateeksasan1:

Error in SetMatrixElements(x@address, as.double(j), as.double(i), as.double(value)) : long vectors not supported yet: ../../src/include/Rinlinedfuns.h:535

Some background info:

Wondering if you had any insights on this?

Thanks!

scottgigante-immunai commented 1 year ago

+1 to the need to store big matrices in memory with more than 2**32 entries. Is this planned at any point?

kaneplusplus commented 1 year ago

So, bigmemory supports more than 2**32 entries. It looks like as.big.matrix() doesn't because of R. My guess is that if you pre-create the big.matrix and copy your in-memory matrix to the big.matrix object in pieces, it will work fine.

scottgigante-immunai commented 1 year ago

Would you be able to build this workaround into as.big.matrix?

kaneplusplus commented 1 year ago

bigmemory is mostly in maintenance mode these days. I haven't had a lot of time to devote to it. I would happily take a pull request.