ropensci / tokenizers

Fast, Consistent Tokenization of Natural Language Text
https://docs.ropensci.org/tokenizers
Other
184 stars 25 forks source link

Add strip_url option to tokenize_words() #85

Closed fschaffner closed 1 year ago

fschaffner commented 1 year ago

Hi, thanks for maintaining this package!

I was wondering if a strip_url = TRUE option could be added to tokenize_words()? Or is there already a recommended way of removing URLs using tokenize_words()?

lmullen commented 1 year ago

I don't think this is something that I am willing to add at this point. This use is too highly specific for a general purpose package.

tokenize_words() will give you back a character vector. You can easily filter any character vector to remove URLs by writing, e.g., a regular expression that will detect a URL.