quanteda / quanteda

An R package for the Quantitative Analysis of Textual Data
https://quanteda.io
GNU General Public License v3.0
844 stars 189 forks source link

ngrams performance #149

Closed kbenoit closed 7 years ago

kbenoit commented 8 years ago

@adamobeng has proposed a simple R-based ngram former that seems to blow away the C++ code in terms of speed. @koheiw has something gone awry with the C++ code? We should test this and figure out what, if anything, has happened.

Switch to the dev_ngrams branch and try the ngramsNew() methods, and examples - and the benchmark code is at the top of the new file ngramsDEV.R, which duplicates the existing ngrams methods in API, but with ngramsNew to differentiate them.

Things to consider/implement:

  1. Might be inefficient, since it creates complete length(n) (where n is the vector argument to ngrams()) copies of the tokens.
  2. skip functionality is not yet implemented.

Note: The original code for ngrams, which the C++ code replaced, was based on a similar method implemented in R.

toks <- tokenize(inaugTexts, removePunct = TRUE)

rbenchmark::benchmark(new = ngramsNew(toks, n = 1:4),
                      old = ngrams(toks, n = 1:4),
                      replications = 2)
##   test replications elapsed relative user.self sys.self user.child sys.child
## 1  new            2   0.499    1.000     0.494    0.006      0.000     0.000
## 2  old            2  76.589  153.485    67.438    8.899      0.022     0.038
kbenoit commented 8 years ago

@adamobeng if you dig through the revision history for ngrams.R you will see how it used to be done in R rather than C++. Might be worth comparing approaches, as I don't remember copying the tokens in order to paste the ngram components together.

koheiw commented 8 years ago

As Adam spotted, the old version was awfully slow, so I updated the C++ version. It is in ngrams3.cpp. For tokenizedTexts, skipgram_cppl2 should be used.

Native R version is still faster, but C++ version has skipgram function with UTF-8 support.

test replications elapsed relative user.self sys.self user.child sys.child 1 R 100 32.077 1.000 33.219 0 0 0 2 C++ 100 68.789 2.144 69.072 0 0 0

dselivanov commented 8 years ago

@kbenoit, @koheiw, @adamobeng why tokenizers doesn't work for you?

tokenizers:::generate_ngrams_batch(txt, ngram_min = 2, ngram_max = 2, 
stopwords = character(0), ngram_delim = ' ')

@lmullen and I started this package especially for such cases. We can expose generate_ngrams_batch function if you need it.

kbenoit commented 8 years ago

Thanks for the suggestion, we will check it out. We have no objection in principle, if it's based on stringi to work with Unicode, and fast. We'd build wrappers through to maintain a consistent UI with the other quanteda functions and to implement additional features such as the ability to preserve Twitter punctuation characters (something needed by lots of quanteda users).

But your text2vec doesn't use tokenizers either last I checked? We're very impressed by the performance of your token handling in that package btw.

dselivanov commented 8 years ago

@kbenoit text2vec uses similar approach to ngram generation, but at little bit lower level. As I said, at the moment tokenize_ngrams performs both tokenization and ngram generation. We can split this into 2 parts.

lmullen commented 8 years ago

@kbenoit: We do use stringi as the base for the tokenizers package. And if there is anything we can do that would make it more useful for you, please do let us know.

My textreuse package doesn't use tokenizers either at the moment, but it will at the next update.

kbenoit commented 8 years ago

@dselivanov @lmullen Thanks for the offer. Makes a lot of sense to invest in just one tokeniser and base other packages on that.

@dselivanov BTW we are working on a cfm (context-feature-matrix) function in quanteda to produce the input item needed for your gloVe function. Like your use of iterators but think that some users might find this hard to understand. Have you considered encapsulating it all so the users never have to know about iterators?

dselivanov commented 8 years ago

@kbenoit I don't think iterators is a big problem since text2vec provides constructors for common sources (including plain list of tokens - itoken(tokens) ). It is trivial to hide all actions with iterators, but I would prefer users will try to learn one new concept which will save them a lot (both memory and cpu time) in future.

Can you provide details, what "context-feature-matrix" means? how it different from term cooccurrence matrix?

kbenoit commented 8 years ago

context-feature-matrix is the same as your tcm - except that we call terms "features" and they are always the column. (So the column is the feature and the row is the feature but not as a target but as a context marker.) Result is still VxV where V is the number of features.

dselivanov commented 8 years ago

Do you implementing this function from scratch or creating wrapper for create_tcm? Is there any code?

kbenoit commented 8 years ago

We've implemented from scratch so far, focusing on function rather than performance for now. Nothing close in performance to that of create_tcm at the moment.

dselivanov commented 8 years ago

Ok, ping me when it will be close to the finish. I could have a look. But think it is hard to implement it in R. Even in c++ version consumes a lot of RAM on large corpuses...