ropensci / tokenizers

Fast, Consistent Tokenization of Natural Language Text
https://docs.ropensci.org/tokenizers
Other
184 stars 25 forks source link

Punctuation options #48

Closed lmullen closed 7 years ago

lmullen commented 7 years ago

Some tokenizers have options to strip or retain punctuation and numbers. These options should also be present in the word and n-gram tokenizers.

lmullen commented 7 years ago

The punct-options branch now has options for preserving punctuation in tokenize_words().

@kbenoit Does this do what you expect it to do?

lmullen commented 7 years ago

Not going to do this for n-grams. Works for word tokenizer.