ropensci / tokenizers

Fast, Consistent Tokenization of Natural Language Text
https://docs.ropensci.org/tokenizers
Other
184 stars 25 forks source link

Fix #31 #38

Closed Ironholds closed 7 years ago

Ironholds commented 7 years ago

This fixes #31 . It's a bit more complicated than one might think from the description, essentially because the most efficient way to check stopwords is with a set object, which needs to be constructed within C++. Accordingly I've shifted the vectorisation to compiled code, so that the set object only needs to be constructed once.

codecov-io commented 7 years ago

Codecov Report

Merging #38 into master will increase coverage by 0.33%. The diff coverage is 96.22%.

Impacted file tree graph

@@            Coverage Diff             @@
##           master      #38      +/-   ##
==========================================
+ Coverage   87.69%   88.02%   +0.33%     
==========================================
  Files          11       11              
  Lines         260      284      +24     
==========================================
+ Hits          228      250      +22     
- Misses         32       34       +2
Impacted Files Coverage Δ
src/shingle_ngrams.cpp 100% <ø> (ø) :arrow_up:
src/init.c 100% <ø> (ø) :arrow_up:
R/ngram-tokenizers.R 96.55% <92.85%> (-3.45%) :arrow_down:
src/skip_ngrams.cpp 97.43% <97.43%> (-2.57%) :arrow_down:

Continue to review full report at Codecov.

Legend - Click here to learn more Δ = absolute <relative> (impact), ø = not affected, ? = missing data Powered by Codecov. Last update b6fc617...38802e9. Read the comment docs.

lmullen commented 7 years ago

@Ironholds This looks great. I'm not merging it immediately, simply because I am working on getting tokenize_skip_ngrams() to produce correct output. I'm at the place where you could take over that if you wanted, and combine the solutions to #31 and #24. Let me pick up the discussion in #24.

Ironholds commented 7 years ago

OK, this should be good to go unless you want me to add in the keyboard-interrupts for convenience and clear out 3 open tickets at once (which would also give me an opportunity to update news, which I forgot to do, which..doh)

lmullen commented 7 years ago

Just saw that last message, sorry. I've merged this in. Looks fantastic. I'm going to go through and add some more tests and rework some of the documentation/vignettes in light of this, but it looks like you've completed the bulk of the work. Thanks so much!!!

Ironholds commented 7 years ago

Yay!