ropensci / tokenizers

Fast, Consistent Tokenization of Natural Language Text
https://docs.ropensci.org/tokenizers
Other
185 stars 25 forks source link

Character level tokenizers #22

Closed dselivanov closed 8 years ago

dselivanov commented 8 years ago

It can be useful to have "shingle" tokenizer which will work on character level (useful for noisy/corrupted texts, especially LSH over shingles techniques ).

Here are examples: input = "abcd efgh jklmn" shingle_size = 3

  1. output = c("abc", "d ef", "gh ", "jkl", "mn")
  2. Sliding window: output = c("abc", "bcd", "cd ", "d e", " ef", "fgh", "gh ", "h j", " jk", "jkl", "klm", "lmn")

Worth to check stringdist::qgrams. I like stringdist a lot - brilliant package package with minimum dependencies.

lmullen commented 8 years ago

Sure. That's just an n-gram tokenizer where the units are characters instead of words. Can you please try the tokenize_character_shingles() function in this branch?

https://github.com/ropensci/tokenizers/tree/shingle-characters

dselivanov commented 8 years ago

cool, I thought it will be much slower. Interesting fact: strsplit(txt, "", FALSE) twice faster than stri_split_boundaries(txt, type = "character")

data("movie_review")
txt = movie_review$review %>% tolower()
system.time(chars <- strsplit(txt, "", F))
# user  system elapsed 
#  0.186   0.006   0.192 
system.time(stringi::stri_split_boundaries(txt, type = "character"))
#   user  system elapsed 
#  0.410   0.006   0.416 
lmullen commented 8 years ago

Interesting observation. Do you think the speed gain is at the cost of correctness for strings which have multibyte characters? I'd rather just stick with stringi. I tested this on a corpus with 2.5 million words and strsplit saved less than half a second. So the time savings is negligible for even a mid-sized corpus.

Are the default arguments for tokenize_character_shingles() reasonable? If you don't have any other things that you think would be worth adding, I think I'll package this up and sent it to CRAN.

dselivanov commented 8 years ago

I don't think it worth to change stri_split_boundaries to strsplit - as you pointed the difference is very small. tokenize_character_shingles looks good to me.

lmullen commented 8 years ago

This function is on master now. I'll release the new version soon.