ropensci / textreuse

Detect text reuse and document similarity
https://docs.ropensci.org/textreuse
197 stars 33 forks source link

Question: Reuse minhash functions #72

Closed iainmwallace closed 7 years ago

iainmwallace commented 7 years ago

Hi,

Very nice package!

Would it be possible to re-use the LSH/minhash functionalities for a different use case? Specifically, is there a way to use it if I have a series of binary vectors and I want to approximate the jacard distance for one vector against all others?

d<-matrix(rnorm(1000),nrow = 100, ncol =10) d[d>1]<-1 d[d<1]<-0

These types of matrices are common in chemistry and biology, and speeding up similarity searching would be really useful.

Thanks

Iain

lmullen commented 7 years ago

It wouldn't be simple. Ideally, one day I am going to rewrite this package to take a document term matrix instead of a NLP derived class. Until then, you would have to pretend that your data is a text. You could do that by turning each row of your matrix into a "document" where the values are separated by spaces. (The separation by spaces is essential so that you can use a word-based tokenizer. Of course, you could write your own tokenizing function.) To convert to texts, do something like:

vector_to_text <- function(row) {
  paste(as.character(row), collapse = " ")
}

texts <- apply(d, vector_to_text, MARGIN = 1)

Then you can pass those texts to the functions in this package, write them to disk, etc.

iainmwallace commented 7 years ago

Thanks for the quick response!