koheiw / proxyC

R package for large-scale similarity/distance computation
GNU General Public License v3.0
29 stars 6 forks source link

Subset error #37

Closed qzhang503 closed 10 months ago

qzhang503 commented 1 year ago
library(Matrix)

nrow <- 193490
ncol <- 46373

set.seed <- 123
ps <- sample(nrow * ncol, 638609)
js <- ps %% ncol
js <- ifelse(js == 0, ncol, js)
is <- ceiling(ps/ncol)
M <- sparseMatrix(x = TRUE, i = is, j = js)

D <- proxyC::simil(M, margin = 2)

# subset error
Z <- D[1:100, 1:1000]

Error in subCsp_ij(x, i, j, drop = drop) : Cholmod error 'problem too large' at file ../Core/cholmod_sparse.c, line 89

koheiw commented 1 year ago

You need more space in your RAM. If you cannot add, make D sparse by setting a threashold.

D <- proxyC::simil(M, margin = 2, min_simil = 0.5)

By the way, rsparseatrix() is the best way to create a random matrix, M.

qzhang503 commented 1 year ago

Hi Kohei,

Thank you for getting back to me so quickly and the note on creating a random sparse matrix. My system has 512GB RAM. I shouldn't be suggesting at all, but just wonder if the range of a subset can be "calculated out" without copying, perhaps something similar to the subsetting of the upreastem sparse matrix, M, where subsetting seems to work fine.

The min_simil setting is not suitable. My application is to find any similarity that is above zero. For example, I next make the conversion of D <- D > 0. Aside, the simil() utility is overkilling for my problem (for now), I will need to construct a custom distance function, that is any(A & B) between logical vectors A and B, which seems not yet an option in the current method parameter. I recalled seeing some notes on creating a custom distance function and will need to set aside some time to learn that.

Best, Qiang Zhang

On Sun, Sep 11, 2022 at 6:42 PM Kohei Watanabe @.***> wrote:

You need more space in your RAM. If you cannot add, make D sparse by setting a threashold.

D <- proxyC::simil(M, margin = 2, min_simil = 0.5)

By the way, rsparseatrix() is the best way to create a random matrix, M.

— Reply to this email directly, view it on GitHub https://github.com/koheiw/proxyC/issues/37#issuecomment-1243069825, or unsubscribe https://github.com/notifications/unsubscribe-auth/AK5U2STHCI5ZJN7IHDR2Z5LV5ZU7BANCNFSM6AAAAAAQJ2YXIE . You are receiving this because you authored the thread.Message ID: @.***>

kasperwelbers commented 1 year ago

Hi @qzhang503,

I'm not sure I get what you mean by calculating out, but the problem is in the calculation of the similarity matrix (D), not in the subsetting of this matrix. What you could do is subset M, and only calculate the similarity of the columns you're interested in.

library(Matrix)
set.seed(1)
M = rsparsematrix(1000, 400, 0.1)
D = proxyC::simil(M, margin = 2) 
D[1:2, 1:4]

## same results, but only calculating 2*4 similarities instead of 400^2
proxyC::simil(M[,1:2], M[,1:4], margin=2)

About calculating any(A&B), there are (several) measures where the image >= 0. So Jaccard for instance is only 0 if there are no intersections. You can then just dichotomize the results.

a = c(T,T,F,F,F)
b = c(T,T,T,F,F)
c = c(F,F,F,T,T)
d = c(F,F,F,T,T)
m = matrix(c(a,b,c,d), nrow=5)
d = proxyC::simil(m, method='jaccard', margin=2) 
d     ## intersections divided by unions
d > 0 ## no intersection or any intersection
qzhang503 commented 1 year ago

Sorry for the confusion. The original problem is on the subsetting of D. Without overstating, the input M with the problem that I am working on has to be first taken as a whole to get to the D. It was then memory expensive and caused failures to run (1) m <- as.matrix(D) -> (2) d <- as.dist(m) -> (3) h <- hclust(d). To get away from the as.matrix (which I told it made four copies of data during its act and it seemed to be), My resort prior to R4.2.1 was to subset gradually D to fill a premade, empty m.

The good news: having had the new subset problem, I tested again the as.matrix(D). It is now memory-efficient and I don't have to subset D as I did in earlier codes. Thank you very much again for your time and comments.

On Mon, Sep 12, 2022 at 1:43 PM Kasper Welbers @.***> wrote:

Hi @qzhang503 https://github.com/qzhang503,

I'm not sure I get what you mean by calculating out, but the problem is in the calculation of the similarity matrix (D), not in the subsetting of this matrix. What you could do is subset M, and only calculate the similarity of the columns you're interested in.

library(Matrix) set.seed(1) M = rsparsematrix(1000, 400, 0.1) D = proxyC::simil(M, margin = 2) D[1:2, 1:4]

same results, but only calculating 2*4 similarities instead of 400^2

proxyC::simil(M[,1:2], M[,1:4], margin=2)

About calculating any(A&B), there are (several) measures where the image

= 0. So Jaccard for instance is only 0 if there are no intersections. You can then just dichotomize the results.

a = c(T,T,F,F,F) b = c(T,T,T,F,F) c = c(F,F,F,T,T) d = c(F,F,F,T,T) m = matrix(c(a,b,c,d), nrow=5) d = proxyC::simil(m, method='jaccard', margin=2) d ## intersections divided by unions d > 0 ## no intersection or any intersection

— Reply to this email directly, view it on GitHub https://github.com/koheiw/proxyC/issues/37#issuecomment-1244148192, or unsubscribe https://github.com/notifications/unsubscribe-auth/AK5U2SQ6I5XEAPFLOUBJQALV552TTANCNFSM6AAAAAAQJ2YXIE . You are receiving this because you were mentioned.Message ID: @.***>

koheiw commented 1 year ago

I see. simil() returns a dgTMatix but it is not efficient when neither min_simil nor rank is used. I think I should change that to dgeMatrix (a dense format). This is what you did manually. Thanks for the feedback.

qzhang503 commented 1 year ago

Awesome, thanks!

On Mon, Sep 12, 2022 at 7:31 PM Kohei Watanabe @.***> wrote:

Closed #37 https://github.com/koheiw/proxyC/issues/37 as completed.

— Reply to this email directly, view it on GitHub https://github.com/koheiw/proxyC/issues/37#event-7372053117, or unsubscribe https://github.com/notifications/unsubscribe-auth/AK5U2SSDNS6IEFPKVMWS3R3V57DNBANCNFSM6AAAAAAQJ2YXIE . You are receiving this because you were mentioned.Message ID: @.***>

-- Qiang Zhang https://github.com/qzhang503/proteoM https://github.com/qzhang503/proteoQ