ropensci / tokenizers

Fast, Consistent Tokenization of Natural Language Text
https://docs.ropensci.org/tokenizers
Other
184 stars 25 forks source link

Rewrite tokenize_skip_ngrams to preserve order #11

Closed lmullen closed 7 years ago

lmullen commented 8 years ago

tokenize_skip_ngrams() should work the same was as tokenize_ngrams()

> tokenize_ngrams(test, n = 2, n_min = 1)
[[1]]
 [1] "one"         "one two"     "two"         "two three"   "three"       "three four" 
 [7] "four"        "four five"   "five"        "five six"    "six"         "six seven"  
[13] "seven"       "seven eight" "eight"       "eight nine"  "nine"        "nine ten"   
[19] "ten" 
> tokenize_skip_ngrams(test, n = 2, k = 1)
[[1]]
 [1] "one three"   "two four"    "three five"  "four six"    "five seven"  "six eight"  
 [7] "seven nine"  "eight ten"   "one two"     "two three"   "three four"  "four five"  
[13] "five six"    "six seven"   "seven eight" "eight nine"  "nine ten"   

It should preserve the order of the tokens in the documents.

lmullen commented 7 years ago

Not going to change this because #24 is the actual problem.