Closed dselivanov closed 7 years ago
@dselivanov Sounds like a good idea.
So would this be dependent on stringi?
I think so. I will make some experiments with simple whitespace tokenizer and post some results here.
I've done some experiments couple of month ago. Don't think this will bring a lot of speed-up (as I remember only 15-35%). So in my opinion it doesn't worth to focus on this issue. Also this will not help with memory fragmentation (see experiments here, I personally switched to jemalloc
and very happy).
The idea is to have option to receive
vector<vector<string>>
instead of list of character vectors (for example it can look liketokenize_words(..., as_xptr = TRUE)
).So packages can use lower level API. In particular I will be able to depreciate ngram-generation functionality in text2vec and rely solely on tokenizers (and to avoid such confusion ). Plus having this will reduce memory usage and allow easy thread parallelism.
See details here - https://github.com/gagolews/stringi/issues/264