Jauntbox commented 4 years ago

Related issues n/a

Describe the proposed solution Tests did not catch that the token length distributions added to TextStats were actually entry length distributions. This PR refactors some of the functions in TextTokenizer, SmartTextVectorizer, and SmartTextMapVectorizer so that they are directly testable. It also adds more robust tests to check desired behavior of the TextStats object.

Describe alternatives you've considered n/a

Additional context n/a

codecov[bot] commented 4 years ago

Codecov Report

Merging #464 into master will increase coverage by 0.00%. The diff coverage is 92.30%.

@@           Coverage Diff           @@
##           master     #464   +/-   ##
=======================================
  Coverage   86.98%   86.99%           
=======================================
  Files         345      345           
  Lines       11575    11616   +41     
  Branches      376      376           
=======================================
+ Hits        10069    10105   +36     
- Misses       1506     1511    +5

Impacted Files	Coverage Δ
...n/scala/com/salesforce/op/dsl/RichMapFeature.scala	`67.64% <ø> (ø)`
.../scala/com/salesforce/op/dsl/RichTextFeature.scala	`81.94% <ø> (ø)`
...a/com/salesforce/op/filters/RawFeatureFilter.scala	`92.97% <ø> (ø)`
...main/scala/com/salesforce/op/test/TestCommon.scala	`40.90% <0.00%> (-9.10%)`	:arrow_down:
...e/op/stages/impl/feature/SmartTextVectorizer.scala	`95.58% <96.15%> (-0.03%)`	:arrow_down:
...om/salesforce/op/filters/FeatureDistribution.scala	`98.70% <100.00%> (+0.03%)`	:arrow_up:
...sification/BinaryClassificationModelSelector.scala	`98.24% <100.00%> (ø)`
...p/stages/impl/feature/SmartTextMapVectorizer.scala	`100.00% <100.00%> (ø)`
...esforce/op/stages/impl/feature/TextTokenizer.scala	`97.22% <100.00%> (+0.07%)`	:arrow_up:
...sforce/op/stages/OpPipelineStageReaderWriter.scala	`87.50% <100.00%> (+0.40%)`	:arrow_up:

Continue to review full report at Codecov.

Legend - Click here to learn more Δ = absolute <relative> (impact), ø = not affected, ? = missing data Powered by Codecov. Last update ed4abfd...d78868d. Read the comment docs.

Jauntbox commented 4 years ago

Just a heads up on a few more commits - adding a toggle for tokenization in text lengths. Will cause a problems with Chinese/Korean text based on our current tokenizers.

Jauntbox commented 4 years ago

Ok - ready for a final look. Sorry for the last-minute refactoring, but realized we needed this toggle exposed for experiments.

Final refactoring:

Removed tokenizeStringOpt method since we can get by without it (should be more readable too)
Moved methods to create TextStats instances from the SmartTextVectorizer objects and into the TextStats object since they make more sense there
Added toggle to SmartTextVectorizer and SmartTextMapVectorizer to enable/disable tokenization when calculating length distributions in TextStats
Added tests to check that this toggle does what we want
Added logging of derived quantities and vectorization method used in SmartTextVectorizer and SmartTextMapVectorizer

TuanNguyen27 commented 4 years ago

lgtm !

salesforce / TransmogrifAI

Allow TextStats length distribution to be token-based and refactor for testability #464

Codecov Report