ropensci / tokenizers

Fast, Consistent Tokenization of Natural Language Text
https://docs.ropensci.org/tokenizers
Other
184 stars 25 forks source link

Clarify what `n_min` means for n-gram tokenization. #72

Closed juliasilge closed 4 years ago

juliasilge commented 5 years ago

There's been some uncertainty about what n_min means, and I believe just a little more detail would be helpful. Addresses juliasilge/tidytext#148.

codecov-io commented 5 years ago

Codecov Report

Merging #72 into master will decrease coverage by 0.07%. The diff coverage is n/a.

Impacted file tree graph

@@            Coverage Diff             @@
##           master      #72      +/-   ##
==========================================
- Coverage   98.12%   98.04%   -0.08%     
==========================================
  Files          12       12              
  Lines         426      410      -16     
==========================================
- Hits          418      402      -16     
  Misses          8        8
Impacted Files Coverage Δ
R/ngram-tokenizers.R 95.34% <ø> (-0.21%) :arrow_down:
R/utils.R 92.85% <0%> (-0.9%) :arrow_down:
R/character-shingles-tokenizers.R 90% <0%> (-0.48%) :arrow_down:
R/tokenize_tweets.R 97.5% <0%> (-0.12%) :arrow_down:
src/skip_ngrams.cpp 97.56% <0%> (-0.06%) :arrow_down:
R/chunk-text.R 100% <0%> (ø) :arrow_up:
src/shingle_ngrams.cpp 100% <0%> (ø) :arrow_up:
R/ptb-tokenizer.R 100% <0%> (ø) :arrow_up:
R/wordcount.R 100% <0%> (ø) :arrow_up:

Continue to review full report at Codecov.

Legend - Click here to learn more Δ = absolute <relative> (impact), ø = not affected, ? = missing data Powered by Codecov. Last update ca1674a...4a2f0d3. Read the comment docs.