kbenoit commented 7 years ago

What it does:

Provides a special tokeniser for Twitter data that preserves Twitter hashtags and usernames, according to the rules for valid construction of both. It also offers an option to keep or remove entirely URLs, which are a very common feature in Tweets.
The special handling is implemented as options (with defaults)
- strip_punctuation = TRUE
- strip_url = FALSE
Adds tests.

This partially addresses #25 and lays out new function arguments and a possible framework for implementing other aspects of the #25 wishlist. It also defines the target behaviours needed, to structure later rewriting (parts) of this in C++ for faster handling. (Although: In my experience doing this in C++, stringi is just about as fast!)

Additional notes:

I tried to adhere to the existing conventions in tokenizers as much as possible, but added a new option for strip_punctuation that adds slightly to the complexity of basic-tokenizers. I would however like to see this show up eventually in tokenize_words().
There are probably more efficient ways to do this, but I converted the basic list created by the stri_split_boundaries() into a vector, after recording the positions so I could split it back later, which avoids a lot of difficult to construct (and read) Map/mapply operations of lists of logicals for the index operations designed to allow special handling.

codecov-io commented 7 years ago

Codecov Report

Merging #44 into master will increase coverage by 1.11%. The diff coverage is 100%.

@@            Coverage Diff             @@
##           master      #44      +/-   ##
==========================================
+ Coverage   88.99%   90.11%   +1.11%     
==========================================
  Files          12       13       +1     
  Lines         309      344      +35     
==========================================
+ Hits          275      310      +35     
  Misses         34       34

Impacted Files	Coverage Δ
R/tokenize_tweets.R	`100% <100%> (ø)`

Continue to review full report at Codecov.

Legend - Click here to learn more Δ = absolute <relative> (impact), ø = not affected, ? = missing data Powered by Codecov. Last update c79a17c...2ae6e8f. Read the comment docs.

lmullen commented 7 years ago

@kbenoit Thanks for the PR. Looks great. I'm glad to have this in tokenizers before the next release.

Clever solutions to preserving the usernames and URLs. I wouldn't have thought of that.

Adding this to NEWS in a separate commit.

ropensci / tokenizers

Add tokenize_tweets() function and tests #44

Codecov Report