Closed kbenoit closed 7 years ago
Merging #44 into master will increase coverage by
1.11%
. The diff coverage is100%
.
@@ Coverage Diff @@
## master #44 +/- ##
==========================================
+ Coverage 88.99% 90.11% +1.11%
==========================================
Files 12 13 +1
Lines 309 344 +35
==========================================
+ Hits 275 310 +35
Misses 34 34
Impacted Files | Coverage Δ | |
---|---|---|
R/tokenize_tweets.R | 100% <100%> (ø) |
Continue to review full report at Codecov.
Legend - Click here to learn more
Δ = absolute <relative> (impact)
,ø = not affected
,? = missing data
Powered by Codecov. Last update c79a17c...2ae6e8f. Read the comment docs.
@kbenoit Thanks for the PR. Looks great. I'm glad to have this in tokenizers before the next release.
Clever solutions to preserving the usernames and URLs. I wouldn't have thought of that.
Adding this to NEWS in a separate commit.
What it does:
strip_punctuation = TRUE
strip_url = FALSE
This partially addresses #25 and lays out new function arguments and a possible framework for implementing other aspects of the #25 wishlist. It also defines the target behaviours needed, to structure later rewriting (parts) of this in C++ for faster handling. (Although: In my experience doing this in C++, stringi is just about as fast!)
Additional notes:
strip_punctuation
that adds slightly to the complexity ofbasic-tokenizers
. I would however like to see this show up eventually intokenize_words()
.stri_split_boundaries()
into a vector, after recording the positions so I could split it back later, which avoids a lot of difficult to construct (and read)Map
/mapply
operations of lists of logicals for the index operations designed to allow special handling.