Closed Jauntbox closed 4 years ago
Merging #448 into master will decrease coverage by
<.01%
. The diff coverage is94.91%
.
@@ Coverage Diff @@
## master #448 +/- ##
==========================================
- Coverage 86.95% 86.95% -0.01%
==========================================
Files 337 337
Lines 11102 11131 +29
Branches 364 593 +229
==========================================
+ Hits 9654 9679 +25
- Misses 1448 1452 +4
Impacted Files | Coverage Δ | |
---|---|---|
...om/salesforce/op/filters/FeatureDistribution.scala | 98.66% <100%> (ø) |
:arrow_up: |
...sforce/op/stages/OpPipelineStageReaderWriter.scala | 86.66% <100%> (+0.45%) |
:arrow_up: |
...p/stages/impl/feature/SmartTextMapVectorizer.scala | 100% <100%> (ø) |
:arrow_up: |
...e/op/stages/impl/feature/SmartTextVectorizer.scala | 95.61% <94.44%> (-3.24%) |
:arrow_down: |
Continue to review full report at Codecov.
Legend - Click here to learn more
Δ = absolute <relative> (impact)
,ø = not affected
,? = missing data
Powered by Codecov. Last update bad14d5...8051f75. Read the comment docs.
@Jauntbox Regarding creating an enum to replace the booleans in SmartTextVectorizer, I have done this already on my personal branch for incorporating name detection in STV (https://github.com/MWYang/TransmogrifAI/pull/1/files) (Look for SmartTextVectorizerAction
.) Hopefully that's helpful, even though my changes are a lot to look through right now. 😅
"oof!" someone has been hanging out with @snabar :-P
"oof!" someone has been hanging out with @snabar :-P
Ooooooofffff!
@Jauntbox lgtm.
Compilation failed though. I presume merge conflict to blame? - https://travis-ci.com/salesforce/TransmogrifAI/jobs/269487360#L695
Also
warning file=/home/travis/build/salesforce/TransmogrifAI/features/src/main/scala/com/salesforce/op/stages/impl/feature/TextVectorizationMethod.scala message=Header does not match expected text line=2
https://travis-ci.com/salesforce/TransmogrifAI/jobs/272953824#L508
Related issues N/A
Describe the proposed solution Adds a few parameters to SmartTextVectorizer to allow for ignoring text fields that would be hashed (eg. not categorical) if they have a token length variance below a specified threshold (eg. to catch machine-generated IDs).
Describe alternatives you've considered Other alternatives are a sort of topK token counting (eg. with a countMinSketch). This works, but is difficult to robustly scale with dataset size, and may be implemented later via Algebird's TopKCMS data structure. Filtering data by raw text length std dev, or by how well the text length distribution fits a poisson distribution performed better on synthetic data and requires less modifications to SmartTextVectorizer.
Additional context Extra thing we need to be careful of is that we still use the CJK tokenizer for Chinese and Korean text (Japanese uses a proper language-specific tokenizer already), and this tokenizer always splits the text into character bigrams which would cause it to fail any length distribution checks. We will need to update the Korean & Chinese tokenizers to language-specific ones that will pick out words rather than bigrams.
We plan to also add a way to filter based on goodness of fit of the text length distribution to a Poisson distribution in a future PR. All the information is already available to do this, so the modifications should be straightforward.