Closed kbenoit closed 7 years ago
@adamobeng if you dig through the revision history for ngrams.R
you will see how it used to be done in R rather than C++. Might be worth comparing approaches, as I don't remember copying the tokens in order to paste the ngram components together.
As Adam spotted, the old version was awfully slow, so I updated the C++ version. It is in ngrams3.cpp. For tokenizedTexts, skipgram_cppl2 should be used.
Native R version is still faster, but C++ version has skipgram function with UTF-8 support.
test replications elapsed relative user.self sys.self user.child sys.child 1 R 100 32.077 1.000 33.219 0 0 0 2 C++ 100 68.789 2.144 69.072 0 0 0
@kbenoit, @koheiw, @adamobeng why tokenizers doesn't work for you?
tokenizers:::generate_ngrams_batch(txt, ngram_min = 2, ngram_max = 2,
stopwords = character(0), ngram_delim = ' ')
@lmullen and I started this package especially for such cases.
We can expose generate_ngrams_batch
function if you need it.
Thanks for the suggestion, we will check it out. We have no objection in principle, if it's based on stringi to work with Unicode, and fast. We'd build wrappers through to maintain a consistent UI with the other quanteda functions and to implement additional features such as the ability to preserve Twitter punctuation characters (something needed by lots of quanteda users).
But your text2vec doesn't use tokenizers either last I checked? We're very impressed by the performance of your token handling in that package btw.
@kbenoit text2vec uses similar approach to ngram generation, but at little bit lower level.
As I said, at the moment tokenize_ngrams
performs both tokenization and ngram generation. We can split this into 2 parts.
@kbenoit: We do use stringi as the base for the tokenizers package. And if there is anything we can do that would make it more useful for you, please do let us know.
My textreuse package doesn't use tokenizers either at the moment, but it will at the next update.
@dselivanov @lmullen Thanks for the offer. Makes a lot of sense to invest in just one tokeniser and base other packages on that.
@dselivanov BTW we are working on a cfm (context-feature-matrix) function in quanteda to produce the input item needed for your gloVe function. Like your use of iterators but think that some users might find this hard to understand. Have you considered encapsulating it all so the users never have to know about iterators?
@kbenoit I don't think iterators is a big problem since text2vec provides constructors for common sources (including plain list of tokens - itoken(tokens)
). It is trivial to hide all actions with iterators, but I would prefer users will try to learn one new concept which will save them a lot (both memory and cpu time) in future.
Can you provide details, what "context-feature-matrix" means? how it different from term cooccurrence matrix?
context-feature-matrix is the same as your tcm - except that we call terms "features" and they are always the column. (So the column is the feature and the row is the feature but not as a target but as a context marker.) Result is still VxV where V is the number of features.
Do you implementing this function from scratch or creating wrapper for create_tcm
? Is there any code?
We've implemented from scratch so far, focusing on function rather than performance for now. Nothing close in performance to that of create_tcm
at the moment.
Ok, ping me when it will be close to the finish. I could have a look. But think it is hard to implement it in R. Even in c++ version consumes a lot of RAM on large corpuses...
@adamobeng has proposed a simple R-based ngram former that seems to blow away the C++ code in terms of speed. @koheiw has something gone awry with the C++ code? We should test this and figure out what, if anything, has happened.
Switch to the
dev_ngrams
branch and try thengramsNew()
methods, and examples - and the benchmark code is at the top of the new filengramsDEV.R
, which duplicates the existing ngrams methods in API, but withngramsNew
to differentiate them.Things to consider/implement:
length(n)
(wheren
is the vector argument tongrams()
) copies of the tokens.skip
functionality is not yet implemented.Note: The original code for ngrams, which the C++ code replaced, was based on a similar method implemented in R.