koheiw / proxyC

R package for large-scale similarity/distance computation
GNU General Public License v3.0
29 stars 6 forks source link

Compute distance between 1 column and all other columns in a matrix? #19

Closed b-tierney closed 3 years ago

b-tierney commented 3 years ago

Hi -- For a simple canopy clustering algorithm, I need to compute jaccard similarity between and X and a Y, where X is a single column from a large sparse matrix and Y is said sparse matrix. I was hoping to use proxyC for it, because R's implementations for doing so otherwise tend to be slow/opaque when working with sparse data.

Is is possible to use/adapt proxyC for this usecase? I have found thus far that the simil() functions seems to either want 1 sparse matrix or 2 equal sized sparse matrices. Thank you!

koheiw commented 3 years ago

Hi @b-tierney x and y can have different number of columns, so no problem to pass a single-column matrix to x and a the full matrix to y. The only constraint is that the number of rows of the matrices must be the same.

b-tierney commented 3 years ago

Ah gotchya, thank you! I'm sure I'm being dense or formatting the matrices incorrectly, but I'm getting this error re: column number, which is what led me to ping you here. Can provide data if needed, it's just a simulated sparse matrix though.

Screen Shot 2021-08-12 at 12 28 59 PM

koheiw commented 3 years ago

You should set margin = 2 for column-wise similarity and sparse_testdata[,1, drop = FALSE] to keep it as a matrix.

b-tierney commented 3 years ago

Ah I see! Apologies for being dense, and thank you so much for the help.