salesforce / TransmogrifAI

TransmogrifAI (pronounced trăns-mŏgˈrə-fī) is an AutoML library for building modular, reusable, strongly typed machine learning workflows on Apache Spark with minimal hand-tuning
https://transmogrif.ai
BSD 3-Clause "New" or "Revised" License
2.24k stars 392 forks source link

Add categorical detection to be coverage based in addition to unique count based #473

Closed michaelweilsalesforce closed 4 years ago

michaelweilsalesforce commented 4 years ago

Related issues Currently SmartTextVectorizer and SmartTextMapVectorizer will count the number of unique entries in a text field (up to a threshold, currently 50) and treat the feature as categorical if it has < 50 unique entries. You can still run into features that are effectively categorical, but may have a long tail of low-frequency entries. We would get better signal extraction if we treated these as categorical instead of hashing them.

Describe the proposed solution Adding an extra check for Text(Map) features in order to become categoricals. This only applies to features that have a cardinality higher than the threshold and therefore would be hashed.

A better approach to detecting text features that are really categorical would be to use a coverage criteria. For example, the topK entries with minimum support cover at least 90% of the entries, then this would be a good feature to pivot by entry instead of hash by token. The value of 90% can be tuned by the user thanks to a param.

Extra checks need to be passed :

If there is m < TopK elements with the required minimum support, then we are looking at the coverage of these m elements.

Describe alternatives you've considered I've considered using Algebird Count Min Sketch in order to compute the current TextStats. However I ran into multiple issue :

A branch still exists : mw/coverage, but it is in shambles.

Additional context Some criticism regarding TextStats. It seems not to be a semi group as it is not associative. Was it intended?

codecov[bot] commented 4 years ago

Codecov Report

Merging #473 into master will increase coverage by 0.01%. The diff coverage is 95.00%.

Impacted file tree graph

@@            Coverage Diff             @@
##           master     #473      +/-   ##
==========================================
+ Coverage   86.99%   87.00%   +0.01%     
==========================================
  Files         345      345              
  Lines       11624    11643      +19     
  Branches      386      604     +218     
==========================================
+ Hits        10112    10130      +18     
- Misses       1512     1513       +1     
Impacted Files Coverage Δ
...e/op/stages/impl/feature/SmartTextVectorizer.scala 95.17% <90.00%> (-0.42%) :arrow_down:
...p/stages/impl/feature/SmartTextMapVectorizer.scala 100.00% <100.00%> (ø)

Continue to review full report at Codecov.

Legend - Click here to learn more Δ = absolute <relative> (impact), ø = not affected, ? = missing data Powered by Codecov. Last update 7d0c33e...ee1bdb2. Read the comment docs.

leahmcguire commented 4 years ago

So is this a WIP or something? :-P

michaelweilsalesforce commented 4 years ago

@leahmcguire @Jauntbox No longer WIPWIPWIPWIPWIPWIPWIPWIPWIP