ropensci / textreuse

Detect text reuse and document similarity
https://docs.ropensci.org/textreuse
197 stars 33 forks source link

Short documents and skip_grams assertion do not match #88

Open awagner-mainz opened 5 years ago

awagner-mainz commented 5 years ago

As I am reading it, the TextReuseCorpus function has some safety check in order not to run tokenizers on documents that are too short, "too short" being documents too small to generate two ngrams of the requested size. In addition to that, the tokenizers seem to have their own assertions to prevent running with too short documents.

However, I have run into problems with skipgrams. First, the safety check in TextReuseCorpus lets documents pass that the assertion in tokenize_skip_ngrams then bails out on, because the latter assertion assumes a larger minimum document length. Second, I don't quite understand why the assertion would require this in the first place. IIUC, it's n + n * k - k <= length(words), but why would I not be able to generate skipgrams with the same document length as that of the ngram tokenizer (n < length(words)).

FWIW, I am trying to build large skipgrams, say, with n=15 and k = 3.

https://github.com/ropensci/textreuse/blob/35f8421d16ed4348d5784a2cbf4a42067e8813b2/R/tokenizers.R#L59

Thanks for any pointers or insights.

lmullen commented 5 years ago

Have you tried using the skip-gram tokenizer in the tokenizers package? Those tokenizers will eventually replace the ones in this package. Note that their output format is somewhat different, so you will have to use them by passing the simplify = TRUE argument.

In general, this package is intended to let you drop in different tokenizers, so if the existing tokenizers do not meet your needs, you might consider writing a special case one.

awagner-mainz commented 5 years ago

Ah, I wasn't aware of that and have not tried it. Will do and report back. Thank you!