salesforce / TransmogrifAI

TransmogrifAI (pronounced trăns-mŏgˈrə-fī) is an AutoML library for building modular, reusable, strongly typed machine learning workflows on Apache Spark with minimal hand-tuning
https://transmogrif.ai
BSD 3-Clause "New" or "Revised" License
2.24k stars 392 forks source link

Allow TextStats length distribution to be token-based and refactor for testability #464

Closed Jauntbox closed 4 years ago

Jauntbox commented 4 years ago

Related issues n/a

Describe the proposed solution Tests did not catch that the token length distributions added to TextStats were actually entry length distributions. This PR refactors some of the functions in TextTokenizer, SmartTextVectorizer, and SmartTextMapVectorizer so that they are directly testable. It also adds more robust tests to check desired behavior of the TextStats object.

Describe alternatives you've considered n/a

Additional context n/a

codecov[bot] commented 4 years ago

Codecov Report

Merging #464 into master will increase coverage by 0.00%. The diff coverage is 92.30%.

Impacted file tree graph

@@           Coverage Diff           @@
##           master     #464   +/-   ##
=======================================
  Coverage   86.98%   86.99%           
=======================================
  Files         345      345           
  Lines       11575    11616   +41     
  Branches      376      376           
=======================================
+ Hits        10069    10105   +36     
- Misses       1506     1511    +5     
Impacted Files Coverage Δ
...n/scala/com/salesforce/op/dsl/RichMapFeature.scala 67.64% <ø> (ø)
.../scala/com/salesforce/op/dsl/RichTextFeature.scala 81.94% <ø> (ø)
...a/com/salesforce/op/filters/RawFeatureFilter.scala 92.97% <ø> (ø)
...main/scala/com/salesforce/op/test/TestCommon.scala 40.90% <0.00%> (-9.10%) :arrow_down:
...e/op/stages/impl/feature/SmartTextVectorizer.scala 95.58% <96.15%> (-0.03%) :arrow_down:
...om/salesforce/op/filters/FeatureDistribution.scala 98.70% <100.00%> (+0.03%) :arrow_up:
...sification/BinaryClassificationModelSelector.scala 98.24% <100.00%> (ø)
...p/stages/impl/feature/SmartTextMapVectorizer.scala 100.00% <100.00%> (ø)
...esforce/op/stages/impl/feature/TextTokenizer.scala 97.22% <100.00%> (+0.07%) :arrow_up:
...sforce/op/stages/OpPipelineStageReaderWriter.scala 87.50% <100.00%> (+0.40%) :arrow_up:

Continue to review full report at Codecov.

Legend - Click here to learn more Δ = absolute <relative> (impact), ø = not affected, ? = missing data Powered by Codecov. Last update ed4abfd...d78868d. Read the comment docs.

Jauntbox commented 4 years ago

Just a heads up on a few more commits - adding a toggle for tokenization in text lengths. Will cause a problems with Chinese/Korean text based on our current tokenizers.

Jauntbox commented 4 years ago

Ok - ready for a final look. Sorry for the last-minute refactoring, but realized we needed this toggle exposed for experiments.

Final refactoring:

TuanNguyen27 commented 4 years ago

lgtm !