Closed dselivanov closed 8 years ago
Sure. That's just an n-gram tokenizer where the units are characters instead of words. Can you please try the tokenize_character_shingles()
function in this branch?
https://github.com/ropensci/tokenizers/tree/shingle-characters
cool, I thought it will be much slower. Interesting fact: strsplit(txt, "", FALSE)
twice faster than stri_split_boundaries(txt, type = "character")
data("movie_review")
txt = movie_review$review %>% tolower()
system.time(chars <- strsplit(txt, "", F))
# user system elapsed
# 0.186 0.006 0.192
system.time(stringi::stri_split_boundaries(txt, type = "character"))
# user system elapsed
# 0.410 0.006 0.416
Interesting observation. Do you think the speed gain is at the cost of correctness for strings which have multibyte characters? I'd rather just stick with stringi. I tested this on a corpus with 2.5 million words and strsplit
saved less than half a second. So the time savings is negligible for even a mid-sized corpus.
Are the default arguments for tokenize_character_shingles()
reasonable? If you don't have any other things that you think would be worth adding, I think I'll package this up and sent it to CRAN.
I don't think it worth to change stri_split_boundaries
to strsplit
- as you pointed the difference is very small. tokenize_character_shingles
looks good to me.
This function is on master now. I'll release the new version soon.
It can be useful to have "shingle" tokenizer which will work on character level (useful for noisy/corrupted texts, especially LSH over shingles techniques ).
Here are examples:
input = "abcd efgh jklmn"
shingle_size = 3
output = c("abc", "d ef", "gh ", "jkl", "mn")
output = c("abc", "bcd", "cd ", "d e", " ef", "fgh", "gh ", "h j", " jk", "jkl", "klm", "lmn")
Worth to check
stringdist::qgrams
. I likestringdist
a lot - brilliant package package with minimum dependencies.