salesforce / TransmogrifAI

TransmogrifAI (pronounced trăns-mŏgˈrə-fī) is an AutoML library for building modular, reusable, strongly typed machine learning workflows on Apache Spark with minimal hand-tuning
https://transmogrif.ai
BSD 3-Clause "New" or "Revised" License
2.24k stars 392 forks source link

Add support for ignoring text that looks like IDs in SmartTextMapVectorizer #455

Closed Jauntbox closed 4 years ago

Jauntbox commented 4 years ago

Related issues This is the map version of #448

Describe the proposed solution Adds a few parameters to SmartTextMapVectorizer to allow for ignoring text fields that would be hashed (eg. not categorical) if they have a token length variance below a specified threshold (eg. to catch machine-generated IDs).

Describe alternatives you've considered Other alternatives are a sort of topK token counting (eg. with a countMinSketch). This works, but is difficult to robustly scale with dataset size, and may be implemented later via Algebird's TopKCMS data structure. Filtering data by raw text length std dev, or by how well the text length distribution fits a poisson distribution performed better on synthetic data and requires less modifications to SmartTextVectorizer.

Additional context Extra thing we need to be careful of is that we still use the CJK tokenizer for Chinese and Korean text (Japanese uses a proper language-specific tokenizer already), and this tokenizer always splits the text into character bigrams which would cause it to fail any length distribution checks. We will need to update the Korean & Chinese tokenizers to language-specific ones that will pick out words rather than bigrams.

We plan to also add a way to filter based on goodness of fit of the text length distribution to a Poisson distribution in a future PR. All the information is already available to do this, so the modifications should be straightforward.

codecov[bot] commented 4 years ago

Codecov Report

Merging #455 into master will increase coverage by 19.71%. The diff coverage is 86.53%.

Impacted file tree graph

@@             Coverage Diff             @@
##           master     #455       +/-   ##
===========================================
+ Coverage   67.23%   86.95%   +19.71%     
===========================================
  Files         337      340        +3     
  Lines       11161    11418      +257     
  Branches      350      371       +21     
===========================================
+ Hits         7504     9928     +2424     
+ Misses       3657     1490     -2167
Impacted Files Coverage Δ
...ain/scala/com/salesforce/op/aggregators/Maps.scala 96.55% <ø> (+3.44%) :arrow_up:
...alesforce/op/aggregators/TimeBasedAggregator.scala 100% <ø> (+100%) :arrow_up:
...la/com/salesforce/op/test/TestFeatureBuilder.scala 100% <ø> (+100%) :arrow_up:
...e/op/stages/impl/feature/SmartTextVectorizer.scala 95.61% <ø> (+4.38%) :arrow_up:
...la/com/salesforce/op/features/FeatureBuilder.scala 35.17% <0%> (+6.5%) :arrow_up:
.../scala/com/salesforce/op/dsl/RichTextFeature.scala 81.94% <0%> (+38.28%) :arrow_up:
...sforce/op/stages/impl/feature/Transmogrifier.scala 98.05% <100%> (+29.59%) :arrow_up:
...sforce/op/stages/OpPipelineStageReaderWriter.scala 87.09% <100%> (+0.43%) :arrow_up:
...p/stages/impl/feature/SmartTextMapVectorizer.scala 100% <100%> (+2.5%) :arrow_up:
...orce/op/aggregators/MonoidAggregatorDefaults.scala 100% <100%> (+1.81%) :arrow_up:
... and 147 more

Continue to review full report at Codecov.

Legend - Click here to learn more Δ = absolute <relative> (impact), ø = not affected, ? = missing data Powered by Codecov. Last update 10644d8...069031c. Read the comment docs.

Jauntbox commented 4 years ago

@leahmcguire Sorry, you looked at it right before I got a fix and new test in. It should be ready to go now.