privefl / bigstatsr

R package for statistical tools with big matrices stored on disk.
https://privefl.github.io/bigstatsr/
179 stars 30 forks source link

fast method to write sparse matrix to file #23

Closed ChiWPak closed 6 years ago

ChiWPak commented 6 years ago

bigstatsr is super convenient if a large matrix is already on file but this is often not the case for me. I typically want to use bigstatsr after I've created a larger-than-memory matrix in R - which is usually in the form of a very sparse matrix (>95% sparsity). Writing to disk is a bottleneck because I have to iterate through the spares matrix by row-chunks (using lapply or equivalent), convert to non-sparse matrix, and write each chunk to file using append=TRUE. A faster method for doing this would be really helpful. Thanks!

privefl commented 6 years ago

Basically, if I understand correctly, you would like an efficient way to write from a sparse (column-oriented?) matrix to an FBM on disk? What is the size of your data?

I wonder if big_copy (or even big_apply) is already working for this. I could make some efficient solution for this. I think it is already a problem I tried to solve when I worked with {bigmemory}.

ChiWPak commented 6 years ago

Basically, if I understand correctly, you would like an efficient way to write from a sparse (column-oriented?) matrix to an FBM on disk? What is the size of your data?

Yes! The matrices I come across are usually 1M x 20-50K. It'd be great if the solution were general enough to deal with different sparse matrix classes such as those used in text analysis - quanteda's dfm or tm's dtm class.

privefl commented 6 years ago

Basically, you could do it directly in R:

# Packages
library(Matrix)
library(bigstatsr)

# Data
spMat <- sparseMatrix(i = integer(), j = integer(), dims = c(1e6, 2e3))
N <- 1e8
x <- runif(N)
i <- sample(nrow(spMat), N, replace = TRUE)
j <- sample(ncol(spMat), N, replace = TRUE)
spMat[cbind(i, j)] <- x

# Solutions
system.time(
  fbm <- FBM(nrow(spMat), ncol(spMat), init = 0)
) # 13 sec
file.size(fbm$backingfile) / 1024^3  # 15 GB

# All at once
system.time({
  ind_nozero <- which(spMat != 0, arr.ind = TRUE)
  fbm[ind_nozero] <- spMat[ind_nozero]
}) # 37 sec

# By blocks to use less memory
system.time(
  big_apply(fbm, a.FUN = function(X, ind, spMat) {
    offset <- min(ind) - 1
    ind_nozero <- which(spMat[, ind] != 0, arr.ind = TRUE)
    ind_nozero[, 2] <- ind_nozero[, 2] + offset
    X[ind_nozero] <- spMat[ind_nozero]
    NULL
  }, a.combine = 'c', spMat = spMat, block.size = 1e3)
) # 40 sec

Here, I tested with only 2000 columns. The data is already 15GB on disk and takes 13 seconds just to initialize values to 0. Then, it takes 40 seconds to transfer non-zero values to the FBM.

ChiWPak commented 6 years ago

When I try

fbm <- FBM(nrow(dtmat_tfidf), ncol(dtmat_tfidf), init = 0)
# Error in getXPtrFBM(.self$backingfile, .self$nrow, .self$ncol, .self$type) :
  # Invalid argument

nrow(dtmat_tfidf)
# [1] 1110015
ncol(dtmat_tfidf)
# [1] 20473

I get an error. dtmat_tfidf is a sparse matrix of quanteda DocumentFeatureMatrix

dtmat_tfidf
# Document-feature matrix of: 1,110,015 documents, 20,473 features (99.8% sparse).
class(dtmat_tfidf)
# [1] "dfm"
# attr(,"package")
# [1] "quanteda"

Doing the same with spMat

spMat <- sparseMatrix(i = integer(), j = integer(), dims = c(1e6, 2e3))
fbm <- FBM(nrow(spMat), ncol(spMat), init = 0)

freezes my laptop (16 GB RAM)

privefl commented 6 years ago

Weird. Windows?

ChiWPak commented 6 years ago

Yeah, Windows...technically Ubuntu on Windows 10.

Distributor ID: Ubuntu
Description:    Ubuntu 16.04.4 LTS
Release:        16.04
Codename:       xenial
privefl commented 6 years ago

Can you try using a filebacked.big.matrix from package {bigmemory}?

ChiWPak commented 6 years ago

Sorry could you be more explicit about what you'd like me to test?

privefl commented 6 years ago

Just tested with two computers:

It doesn't freeze with Linux, but does with Windows. Same with package {bigmemory}.

Not sure how to help you here.

What is the application you want to do that you need to write a 99.8% sparse matrix to disk?

ChiWPak commented 6 years ago

It doesn't freeze with Linux, but does with Windows.

Hmm...that's not good news. I want to perform big_SVD() on document-feature-matrices. I want to view whether documents that are grouped by a meta-keyword cluster together based on their actual word content. After a very long time, I was able to write the matrix out to file - a whopping 46 GB.

When I try to read the file with big_read(), I still get the following error

Error in getXPtrFBM(.self$backingfile, .self$nrow, .self$ncol, .self$type) :
  Invalid argument

Any idea what the error is related to?

privefl commented 6 years ago

big_read() is meant to read text files, and is not very efficient.

Here, if you want a partial SVD, you should really use package {RSpectra} directly: svds(spMat, k = 10)

ChiWPak commented 6 years ago

Thanks! - that worked well. Appreciate the walkthrough.