ropensci / tokenizers

Fast, Consistent Tokenization of Natural Language Text
https://docs.ropensci.org/tokenizers
Other
184 stars 25 forks source link

Strip punctuation option for tokenize_ngrams #57

Closed alanault closed 6 years ago

alanault commented 6 years ago

I saw in issue 48 that the option to control the removal of punctuation was excluded from the tokenize_ngrams function.

Is there a programming rationale for this? It seems like punctuation can add a lot of value to text, especially in languages like Spanish.

e.g. We're going to the R conference. We're going to the R conference? We're going to the R conference???????

Each has a very different meaning. In Spanish, it can be crucial in understanding whether a sentence is a statement or a question.