Closed iainmwallace closed 7 years ago
It wouldn't be simple. Ideally, one day I am going to rewrite this package to take a document term matrix instead of a NLP derived class. Until then, you would have to pretend that your data is a text. You could do that by turning each row of your matrix into a "document" where the values are separated by spaces. (The separation by spaces is essential so that you can use a word-based tokenizer. Of course, you could write your own tokenizing function.) To convert to texts, do something like:
vector_to_text <- function(row) {
paste(as.character(row), collapse = " ")
}
texts <- apply(d, vector_to_text, MARGIN = 1)
Then you can pass those texts to the functions in this package, write them to disk, etc.
Thanks for the quick response!
Hi,
Very nice package!
Would it be possible to re-use the LSH/minhash functionalities for a different use case? Specifically, is there a way to use it if I have a series of binary vectors and I want to approximate the jacard distance for one vector against all others?
d<-matrix(rnorm(1000),nrow = 100, ncol =10) d[d>1]<-1 d[d<1]<-0
These types of matrices are common in chemistry and biology, and speeding up similarity searching would be really useful.
Thanks
Iain