Closed vmustafa closed 9 years ago
@vmustafa Can you please provide a reproducible example? I'll need to see how you constructed the corpus to see what the problem is. At the outset, though, I agree that there could be a more useful error message.
@lmullen
As I mentioned, I am following the example given in the vignette "Minhash and locality-sensitive hashing".
The R code is as follows:-
library(textreuse) minhash <- minhash_generator(n = 240, seed = 3552) dir <- system.file("extdata/ats", package = "textreuse") corpus <- TextReuseCorpus(dir = dir, tokenizer = tokenize_ngrams, n = 5, hash_func = minhash, keep_tokens = TRUE) buckets <- lsh(corpus, bands = 80)
After the last command -- buckets ... the Error message gets displayed.
Can you please paste the results of sessionInfo()
after re-running that
code to get that error message?
On Thursday, November 5, 2015, vmustafa notifications@github.com wrote:
@lmullen https://github.com/lmullen
As I mentioned, I am following the example given in the vignette "Minhash and locality-sensitive hashing".
The R code is as follows:-
library(textreuse) minhash <- minhash_generator(n = 240, seed = 3552) dir <- system.file("extdata/ats", package = "textreuse") corpus <- TextReuseCorpus(dir = dir, tokenizer = tokenize_ngrams, n = 5, hash_func = minhash, keep_tokens = TRUE) buckets <- lsh(corpus, bands = 80)
After the last command -- buckets ... the Error message gets displayed.
— Reply to this email directly or view it on GitHub https://github.com/ropensci/textreuse/issues/66#issuecomment-154137086.
Lincoln Mullen, http://lincolnmullen.com Assistant Professor, Department of History & Art History George Mason University
I can reproduce the error. session info:
R version 3.1.2 (2014-10-31)
Platform: x86_64-apple-darwin13.4.0 (64-bit)
locale:
[1] de_DE.UTF-8/de_DE.UTF-8/de_DE.UTF-8/C/de_DE.UTF-8/de_DE.UTF-8
attached base packages:
[1] stats graphics grDevices utils datasets methods base
other attached packages:
[1] textreuse_0.1.0
loaded via a namespace (and not attached):
[1] assertthat_0.1 DBI_0.3.1 dplyr_0.4.3 lazyeval_0.1.10 magrittr_1.5
[6] NLP_0.1-8 parallel_3.1.2 R6_2.1.1 Rcpp_0.12.1 RcppProgress_0.2.1
[11] stringi_1.0-1 stringr_1.0.0 tools_3.1.2
You have a typo in the line where you create the corpus.
In this line
corpus <- TextReuseCorpus(dir = dir, tokenizer = tokenize_ngrams, n = 5,
hash_func = minhash, keep_tokens = TRUE)
you have the argument hash_func = minhash
, but the argument should be minhash_func = minhash
.
I realize the terminology is a bit ambiguous. But the hashes represent the tokens (i.e., for a 100,000 word document there will be about 100,000 hashes, for a 50 word document there will be about 50 hashes) but the minhashes represent the document (i.e., for a document of any length, there will be 240 hashes, or whatever value you used).
I will add a check in the next version which will return a more informative error message.
Additionally, you may wish to upgrade to version 0.1.1, which fixes the ugly progress bars in the vignette (my bad).
Feel free to reopen the issue if you have further problems. In particular I've tested this only on an English locale, and while there is no reason it shouldn't work in a German locale, I'll be glad to know if it does.
The next CRAN version will now have a more informative error message if lsh()
is used on a corpus without minhashes.
I am trying to follow the example in "Minhash and locality-sensitive hashing" vignette by Lincoln Mullen.
Everything seems to be working as per script till I come to the part where I need to get the buckets for LSH for the documents, i.e., when I run the following command:-
buckets <- lsh(corpus, bands = 80)
This throws the following error:-
Error: data_frames can only contain 1d atomic vectors and lists
I am unable to figure out which object is interpreted as data_frame. TextReuseCorpus is, I guess, coming from the "tm" package Corpus class which may not be derived from data_frame. Though I wonder whether it has anything to do with data.frame class?
Any help appreciated.
Regards
Mustafa Vadnagarwala.