ropensci / tokenizers

Fast, Consistent Tokenization of Natural Language Text
https://docs.ropensci.org/tokenizers
Other
184 stars 25 forks source link

Add tokenize_tweets() function and tests #44

Closed kbenoit closed 7 years ago

kbenoit commented 7 years ago

What it does:

This partially addresses #25 and lays out new function arguments and a possible framework for implementing other aspects of the #25 wishlist. It also defines the target behaviours needed, to structure later rewriting (parts) of this in C++ for faster handling. (Although: In my experience doing this in C++, stringi is just about as fast!)

Additional notes:

codecov-io commented 7 years ago

Codecov Report

Merging #44 into master will increase coverage by 1.11%. The diff coverage is 100%.

Impacted file tree graph

@@            Coverage Diff             @@
##           master      #44      +/-   ##
==========================================
+ Coverage   88.99%   90.11%   +1.11%     
==========================================
  Files          12       13       +1     
  Lines         309      344      +35     
==========================================
+ Hits          275      310      +35     
  Misses         34       34
Impacted Files Coverage Δ
R/tokenize_tweets.R 100% <100%> (ø)

Continue to review full report at Codecov.

Legend - Click here to learn more Δ = absolute <relative> (impact), ø = not affected, ? = missing data Powered by Codecov. Last update c79a17c...2ae6e8f. Read the comment docs.

lmullen commented 7 years ago

@kbenoit Thanks for the PR. Looks great. I'm glad to have this in tokenizers before the next release.

Clever solutions to preserving the usernames and URLs. I wouldn't have thought of that.

Adding this to NEWS in a separate commit.