Jauntbox commented 4 years ago

Related issues N/A

Describe the proposed solution Adds a few parameters to SmartTextVectorizer to allow for ignoring text fields that would be hashed (eg. not categorical) if they have a token length variance below a specified threshold (eg. to catch machine-generated IDs).

Describe alternatives you've considered Other alternatives are a sort of topK token counting (eg. with a countMinSketch). This works, but is difficult to robustly scale with dataset size, and may be implemented later via Algebird's TopKCMS data structure. Filtering data by raw text length std dev, or by how well the text length distribution fits a poisson distribution performed better on synthetic data and requires less modifications to SmartTextVectorizer.

Additional context Extra thing we need to be careful of is that we still use the CJK tokenizer for Chinese and Korean text (Japanese uses a proper language-specific tokenizer already), and this tokenizer always splits the text into character bigrams which would cause it to fail any length distribution checks. We will need to update the Korean & Chinese tokenizers to language-specific ones that will pick out words rather than bigrams.

We plan to also add a way to filter based on goodness of fit of the text length distribution to a Poisson distribution in a future PR. All the information is already available to do this, so the modifications should be straightforward.

codecov[bot] commented 4 years ago

Codecov Report

Merging #448 into master will decrease coverage by <.01%. The diff coverage is 94.91%.

@@            Coverage Diff             @@
##           master     #448      +/-   ##
==========================================
- Coverage   86.95%   86.95%   -0.01%     
==========================================
  Files         337      337              
  Lines       11102    11131      +29     
  Branches      364      593     +229     
==========================================
+ Hits         9654     9679      +25     
- Misses       1448     1452       +4

Impacted Files	Coverage Δ
...om/salesforce/op/filters/FeatureDistribution.scala	`98.66% <100%> (ø)`	:arrow_up:
...sforce/op/stages/OpPipelineStageReaderWriter.scala	`86.66% <100%> (+0.45%)`	:arrow_up:
...p/stages/impl/feature/SmartTextMapVectorizer.scala	`100% <100%> (ø)`	:arrow_up:
...e/op/stages/impl/feature/SmartTextVectorizer.scala	`95.61% <94.44%> (-3.24%)`	:arrow_down:

Continue to review full report at Codecov.

Legend - Click here to learn more Δ = absolute <relative> (impact), ø = not affected, ? = missing data Powered by Codecov. Last update bad14d5...8051f75. Read the comment docs.

MWYang commented 4 years ago

@Jauntbox Regarding creating an enum to replace the booleans in SmartTextVectorizer, I have done this already on my personal branch for incorporating name detection in STV (https://github.com/MWYang/TransmogrifAI/pull/1/files) (Look for SmartTextVectorizerAction.) Hopefully that's helpful, even though my changes are a lot to look through right now. 😅

leahmcguire commented 4 years ago

"oof!" someone has been hanging out with @snabar :-P

snabar commented 4 years ago

"oof!" someone has been hanging out with @snabar :-P

Ooooooofffff!

tovbinm commented 4 years ago

@Jauntbox lgtm.

Compilation failed though. I presume merge conflict to blame? - https://travis-ci.com/salesforce/TransmogrifAI/jobs/269487360#L695

tovbinm commented 4 years ago

Also

warning file=/home/travis/build/salesforce/TransmogrifAI/features/src/main/scala/com/salesforce/op/stages/impl/feature/TextVectorizationMethod.scala message=Header does not match expected text line=2

https://travis-ci.com/salesforce/TransmogrifAI/jobs/272953824#L508

salesforce / TransmogrifAI

Add support for ignoring text that looks like IDs in SmartTextVectorizer #448

Codecov Report