ropensci / tokenizers

Fast, Consistent Tokenization of Natural Language Text
https://docs.ropensci.org/tokenizers
Other
185 stars 25 forks source link

Lower level C++ api with external pointers #50

Closed dselivanov closed 7 years ago

dselivanov commented 7 years ago

The idea is to have option to receive vector<vector<string>> instead of list of character vectors (for example it can look like tokenize_words(..., as_xptr = TRUE)).

So packages can use lower level API. In particular I will be able to depreciate ngram-generation functionality in text2vec and rely solely on tokenizers (and to avoid such confusion ). Plus having this will reduce memory usage and allow easy thread parallelism.

See details here - https://github.com/gagolews/stringi/issues/264

lmullen commented 7 years ago

@dselivanov Sounds like a good idea.

Ironholds commented 7 years ago

So would this be dependent on stringi?

dselivanov commented 7 years ago

I think so. I will make some experiments with simple whitespace tokenizer and post some results here.

dselivanov commented 7 years ago

I've done some experiments couple of month ago. Don't think this will bring a lot of speed-up (as I remember only 15-35%). So in my opinion it doesn't worth to focus on this issue. Also this will not help with memory fragmentation (see experiments here, I personally switched to jemalloc and very happy).