privefl / bigstatsr

R package for statistical tools with big matrices stored on disk.
https://privefl.github.io/bigstatsr/
179 stars 30 forks source link

How to cbind/rbind two big matrices that have the same dimension #176

Closed minhnd212 closed 2 months ago

minhnd212 commented 7 months ago

Hi, I have two big matrices, both are 30 x 2^29 (30 rows, 2^29 columns), let say bm1 and bm2. I would like to "cbind" the two matrices, i.e. placing the content of bm2 to the right of bm1, which creates a new big matrix that is 30 x 2^30.

I attempted to do that with the following code:

Step 1: Expand bm1 by adding x columns where x is the current number of columns of bm1

bm1$add_columns(bm1$ncol)

Step 2: Place the content of bm2 to the right half of the expanded bm1

bm1[,(bm1$ncol/2 + 1):bm1$ncol] <- bm2[]

I received the "cannot allocate vector of size 60.0 GB" error. Upon further examination, I understand that it is because these two matrices, bm1[,(bm1$ncol/2 + 1):bm1$ncol] and bm2[], which are both 30 x 2^29, can not be called. I also understand that when working with FBM objects, it is generally not a good idea to work the the matrix representations of these objects directly. I have looked at the list of available functions in the bigstatsr package, but I couldn't find one that seems to solve my problem, which is to place the content of a big matrix inside another big matrix.

Could you give me a suggestion on how I can go about solving this problem? Thank you.

privefl commented 7 months ago

Please handle your opened issues before opening too many.

For this particular issue, I would just create the new one, and fill it with big_apply(). If running time is a problem, I would implement this in Rcpp.

minhnd212 commented 7 months ago

I apologize for the late response. I have trying out some idea based on your suggestion, which took a lot of times, but I don't think I got it right. I tried to bypass the "cannot allocate vector of size 60.0 GB" error by instead of filling out the right half of bm1 with bm2 at once, I filled each column of bm1 with each column of bm2 using a for loop (either both big_apply() or not). It ran for 24 hours then the computer crashed.

So, I don't think the way I approached big_apply() by writing the a.FUN with a for loop inside is correct. But, I have not been able to find a creative (or correct) way of using big_apply() for what I try to do yet.

If that is not too much of a trouble, could you give me some more details on how you would write the big_apply() function in this case? Thank you.

Yes, my next step after getting this step fixed, i.e. being able to place the content of a big matrix inside another matrix, is to implement this in RCPP to reduce the computational time.

privefl commented 7 months ago

Please share what you've tried with big_apply(), we'll go from there.

privefl commented 6 months ago

Any update on this?

RdeBiotec commented 5 months ago

If it's of use, I did this for "rbind":

m <- ncol(list_FBMs[[1]])  ## assuming that all have the same number of columns
n <- sum(sapply(list_FBMs, nrow))

merged_bm=FBM(n,m,backingfile = file.path(filepath,"Merged_bm"), is_read_only = F)

offset <- 0
or (i in c(1:length(list_FBMs))) {
      offset2=big_apply(X=list_FBMs[[i]], function(X, ind, offset, Xm) {
        Xm[ind+offset,] = X[ind, , drop = FALSE]
        ind[length(ind)]
      }, a.combine  = "c",  ncores = if(nrow(list_FBMs[[i]])>10000){nb_cores()}else{1}, ind=rows_along(list_FBMs[[i]]), offset=offset,  Xm=merged_bm)
      offset=offset+offset2[length(offset2)]

}

But even if I wanted (which would be very useful), I have no idea how to do this in RCPP.

privefl commented 5 months ago

Can you remind me what is the issue here? Your code works, right? I think you can optimize it a bit by doing something like this:

all_nrow <- sapply(list_FBMs, nrow)
offsets <- c(0, cumsum(head(all_nrow, -1)))

Then I would put the for-loop inside big_apply() so that you maximize the accesses to the same columns, and also the parallelization.

RdeBiotec commented 5 months ago

Oh, I just wanted to contribute to the discussion (how to do cbind/rbind in FBMs).

My code works, yes. It's kind of slow with big FBMs, so I will try your proposal, thanks!

privefl commented 5 months ago

And you (almost always) want to split the blocks over the columns, not the rows.

privefl commented 5 months ago

If you don't want to manipulate offsets, you can even do something like this:

big_apply(merged_bm, function(X, ind, list_fbm) {
  X[, ind] <- do.call("rbind", lapply(list_fbm, function(fbm) fbm[, ind, drop = FALSE]))
  NULL
}, list_fbm = list_FBMs, a.combine  = "c", ncores = nb_cores())
RdeBiotec commented 4 months ago

It works much much faster. Thanks a lot.

privefl commented 4 months ago

Should this example be added to the vignette on big_apply()? If so, does someone want to try a pull request?

dramanica commented 4 months ago

Is appending the backing files cheating? Admittedly, not an elegant big* solution, but it's two lines and pretty fast...

library(bigstatsr)
X1 <- as_FBM(matrix(1:4, 2), backingfile = tempfile())
X2 <- as_FBM(matrix(5:8, 2), backingfile = tempfile())
file.append(X1$backingfile, X2$backingfile)
# and amend the new number of columns
X1$ncol <- X1$ncol + X2$ncol
X1[]
privefl commented 4 months ago

This hacky solution might work, but is much more restrictive though. Only works for cbind, needs the same types, overwrites X1, and needs to resave the RDS if existing.