ropensci / textreuse

Detect text reuse and document similarity
https://docs.ropensci.org/textreuse
197 stars 33 forks source link

Some problem with lsh() function and data_frame? #66

Closed vmustafa closed 9 years ago

vmustafa commented 9 years ago

I am trying to follow the example in "Minhash and locality-sensitive hashing" vignette by Lincoln Mullen.

Everything seems to be working as per script till I come to the part where I need to get the buckets for LSH for the documents, i.e., when I run the following command:-

buckets <- lsh(corpus, bands = 80)

This throws the following error:-

Error: data_frames can only contain 1d atomic vectors and lists

I am unable to figure out which object is interpreted as data_frame. TextReuseCorpus is, I guess, coming from the "tm" package Corpus class which may not be derived from data_frame. Though I wonder whether it has anything to do with data.frame class?

Any help appreciated.

Regards

Mustafa Vadnagarwala.

lmullen commented 9 years ago

@vmustafa Can you please provide a reproducible example? I'll need to see how you constructed the corpus to see what the problem is. At the outset, though, I agree that there could be a more useful error message.

vmustafa commented 9 years ago

@lmullen

As I mentioned, I am following the example given in the vignette "Minhash and locality-sensitive hashing".

The R code is as follows:-

library(textreuse) minhash <- minhash_generator(n = 240, seed = 3552) dir <- system.file("extdata/ats", package = "textreuse") corpus <- TextReuseCorpus(dir = dir, tokenizer = tokenize_ngrams, n = 5, hash_func = minhash, keep_tokens = TRUE) buckets <- lsh(corpus, bands = 80)

After the last command -- buckets ... the Error message gets displayed.

lmullen commented 9 years ago

Can you please paste the results of sessionInfo() after re-running that code to get that error message?

On Thursday, November 5, 2015, vmustafa notifications@github.com wrote:

@lmullen https://github.com/lmullen

As I mentioned, I am following the example given in the vignette "Minhash and locality-sensitive hashing".

The R code is as follows:-

library(textreuse) minhash <- minhash_generator(n = 240, seed = 3552) dir <- system.file("extdata/ats", package = "textreuse") corpus <- TextReuseCorpus(dir = dir, tokenizer = tokenize_ngrams, n = 5, hash_func = minhash, keep_tokens = TRUE) buckets <- lsh(corpus, bands = 80)

After the last command -- buckets ... the Error message gets displayed.

— Reply to this email directly or view it on GitHub https://github.com/ropensci/textreuse/issues/66#issuecomment-154137086.

Lincoln Mullen, http://lincolnmullen.com Assistant Professor, Department of History & Art History George Mason University

nicmer commented 9 years ago

I can reproduce the error. session info:

R version 3.1.2 (2014-10-31)
Platform: x86_64-apple-darwin13.4.0 (64-bit)

locale:
[1] de_DE.UTF-8/de_DE.UTF-8/de_DE.UTF-8/C/de_DE.UTF-8/de_DE.UTF-8

attached base packages:
[1] stats     graphics  grDevices utils     datasets  methods   base     

other attached packages:
[1] textreuse_0.1.0

loaded via a namespace (and not attached):
 [1] assertthat_0.1     DBI_0.3.1          dplyr_0.4.3        lazyeval_0.1.10    magrittr_1.5      
 [6] NLP_0.1-8          parallel_3.1.2     R6_2.1.1           Rcpp_0.12.1        RcppProgress_0.2.1
[11] stringi_1.0-1      stringr_1.0.0      tools_3.1.2
lmullen commented 9 years ago

You have a typo in the line where you create the corpus.

In this line

corpus <- TextReuseCorpus(dir = dir, tokenizer = tokenize_ngrams, n = 5, 
                          hash_func = minhash, keep_tokens = TRUE)

you have the argument hash_func = minhash, but the argument should be minhash_func = minhash.

I realize the terminology is a bit ambiguous. But the hashes represent the tokens (i.e., for a 100,000 word document there will be about 100,000 hashes, for a 50 word document there will be about 50 hashes) but the minhashes represent the document (i.e., for a document of any length, there will be 240 hashes, or whatever value you used).

I will add a check in the next version which will return a more informative error message.

Additionally, you may wish to upgrade to version 0.1.1, which fixes the ugly progress bars in the vignette (my bad).

Feel free to reopen the issue if you have further problems. In particular I've tested this only on an English locale, and while there is no reason it shouldn't work in a German locale, I'll be glad to know if it does.

lmullen commented 9 years ago

The next CRAN version will now have a more informative error message if lsh() is used on a corpus without minhashes.