salesforce / TransmogrifAI

TransmogrifAI (pronounced trăns-mŏgˈrə-fī) is an AutoML library for building modular, reusable, strongly typed machine learning workflows on Apache Spark with minimal hand-tuning
https://transmogrif.ai
BSD 3-Clause "New" or "Revised" License
2.24k stars 392 forks source link

[WIP] Experimenting with adaptive hash size based on token cardinality #460

Closed TuanNguyen27 closed 4 years ago

TuanNguyen27 commented 4 years ago

Related issues

N/A

Describe the proposed solution

Current feature engineering for raw text feature uses a hard-coded size for hash space. For raw features that contain free text, this hash space isn't enough to fully express their content. This PR approximates token cardinality for each raw feature of text type using HyperLogLog and uses this information to set the hash space size. Related article on how to set this size

Describe alternatives you've considered Embeddings for text field, which currently don't work as well ? (To be further investigated). There's also an approach that combines hashing and embeddings that saves a lot of memory.

Additional context Add any other context about the changes here.

codecov[bot] commented 4 years ago

Codecov Report

Merging #460 into master will decrease coverage by 19.33%. The diff coverage is 83.76%.

Impacted file tree graph

@@             Coverage Diff             @@
##           master     #460       +/-   ##
===========================================
- Coverage   86.99%   67.65%   -19.34%     
===========================================
  Files         345      345               
  Lines       11622    11691       +69     
  Branches      609      602        -7     
===========================================
- Hits        10110     7910     -2200     
- Misses       1512     3781     +2269     
Impacted Files Coverage Δ
.../scala/com/salesforce/op/dsl/RichTextFeature.scala 23.61% <ø> (-58.34%) :arrow_down:
...ce/op/stages/impl/feature/DateListVectorizer.scala 0.00% <0.00%> (-98.02%) :arrow_down:
...s/impl/feature/OPCollectionHashingVectorizer.scala 73.41% <63.15%> (-23.14%) :arrow_down:
...e/op/stages/impl/feature/SmartTextVectorizer.scala 91.53% <93.93%> (-4.06%) :arrow_down:
...force/op/stages/impl/feature/OPMapVectorizer.scala 96.35% <100.00%> (-1.46%) :arrow_down:
...p/stages/impl/feature/SmartTextMapVectorizer.scala 97.69% <100.00%> (-2.31%) :arrow_down:
...sforce/op/stages/impl/feature/Transmogrifier.scala 65.19% <100.00%> (-32.86%) :arrow_down:
...a/com/salesforce/op/utils/date/DateTimeUtils.scala 100.00% <100.00%> (ø)
...com/salesforce/op/test/TestOpWorkflowBuilder.scala 0.00% <0.00%> (-100.00%) :arrow_down:
...om/salesforce/op/stages/impl/feature/OpNGram.scala 0.00% <0.00%> (-100.00%) :arrow_down:
... and 137 more

Continue to review full report at Codecov.

Legend - Click here to learn more Δ = absolute <relative> (impact), ø = not affected, ? = missing data Powered by Codecov. Last update 2e19aee...2e19aee. Read the comment docs.