Closed ChiWPak closed 6 years ago
Basically, if I understand correctly, you would like an efficient way to write from a sparse (column-oriented?) matrix to an FBM on disk? What is the size of your data?
I wonder if big_copy
(or even big_apply
) is already working for this.
I could make some efficient solution for this. I think it is already a problem I tried to solve when I worked with {bigmemory}.
Basically, if I understand correctly, you would like an efficient way to write from a sparse (column-oriented?) matrix to an FBM on disk? What is the size of your data?
Yes! The matrices I come across are usually 1M x 20-50K. It'd be great if the solution were general enough to deal with different sparse matrix classes such as those used in text analysis - quanteda
's dfm or tm
's dtm class.
Basically, you could do it directly in R:
# Packages
library(Matrix)
library(bigstatsr)
# Data
spMat <- sparseMatrix(i = integer(), j = integer(), dims = c(1e6, 2e3))
N <- 1e8
x <- runif(N)
i <- sample(nrow(spMat), N, replace = TRUE)
j <- sample(ncol(spMat), N, replace = TRUE)
spMat[cbind(i, j)] <- x
# Solutions
system.time(
fbm <- FBM(nrow(spMat), ncol(spMat), init = 0)
) # 13 sec
file.size(fbm$backingfile) / 1024^3 # 15 GB
# All at once
system.time({
ind_nozero <- which(spMat != 0, arr.ind = TRUE)
fbm[ind_nozero] <- spMat[ind_nozero]
}) # 37 sec
# By blocks to use less memory
system.time(
big_apply(fbm, a.FUN = function(X, ind, spMat) {
offset <- min(ind) - 1
ind_nozero <- which(spMat[, ind] != 0, arr.ind = TRUE)
ind_nozero[, 2] <- ind_nozero[, 2] + offset
X[ind_nozero] <- spMat[ind_nozero]
NULL
}, a.combine = 'c', spMat = spMat, block.size = 1e3)
) # 40 sec
Here, I tested with only 2000 columns. The data is already 15GB on disk and takes 13 seconds just to initialize values to 0. Then, it takes 40 seconds to transfer non-zero values to the FBM.
When I try
fbm <- FBM(nrow(dtmat_tfidf), ncol(dtmat_tfidf), init = 0)
# Error in getXPtrFBM(.self$backingfile, .self$nrow, .self$ncol, .self$type) :
# Invalid argument
nrow(dtmat_tfidf)
# [1] 1110015
ncol(dtmat_tfidf)
# [1] 20473
I get an error. dtmat_tfidf
is a sparse matrix of quanteda DocumentFeatureMatrix
dtmat_tfidf
# Document-feature matrix of: 1,110,015 documents, 20,473 features (99.8% sparse).
class(dtmat_tfidf)
# [1] "dfm"
# attr(,"package")
# [1] "quanteda"
Doing the same with spMat
spMat <- sparseMatrix(i = integer(), j = integer(), dims = c(1e6, 2e3))
fbm <- FBM(nrow(spMat), ncol(spMat), init = 0)
freezes my laptop (16 GB RAM)
Weird. Windows?
Yeah, Windows...technically Ubuntu on Windows 10.
Distributor ID: Ubuntu
Description: Ubuntu 16.04.4 LTS
Release: 16.04
Codename: xenial
Can you try using a filebacked.big.matrix
from package {bigmemory}?
Sorry could you be more explicit about what you'd like me to test?
Just tested with two computers:
It doesn't freeze with Linux, but does with Windows. Same with package {bigmemory}.
Not sure how to help you here.
What is the application you want to do that you need to write a 99.8% sparse matrix to disk?
It doesn't freeze with Linux, but does with Windows.
Hmm...that's not good news. I want to perform big_SVD()
on document-feature-matrices. I want to view whether documents that are grouped by a meta-keyword cluster together based on their actual word content. After a very long time, I was able to write the matrix out to file - a whopping 46 GB.
When I try to read the file with big_read()
, I still get the following error
Error in getXPtrFBM(.self$backingfile, .self$nrow, .self$ncol, .self$type) :
Invalid argument
Any idea what the error is related to?
big_read()
is meant to read text files, and is not very efficient.
Here, if you want a partial SVD, you should really use package {RSpectra} directly:
svds(spMat, k = 10)
Thanks! - that worked well. Appreciate the walkthrough.
bigstatsr
is super convenient if a large matrix is already on file but this is often not the case for me. I typically want to usebigstatsr
after I've created a larger-than-memory matrix in R - which is usually in the form of a very sparse matrix (>95% sparsity). Writing to disk is a bottleneck because I have to iterate through the spares matrix by row-chunks (usinglapply
or equivalent), convert to non-sparse matrix, and write each chunk to file usingappend=TRUE
. A faster method for doing this would be really helpful. Thanks!