salesforce / TransmogrifAI

TransmogrifAI (pronounced trăns-mŏgˈrə-fī) is an AutoML library for building modular, reusable, strongly typed machine learning workflows on Apache Spark with minimal hand-tuning
https://transmogrif.ai
BSD 3-Clause "New" or "Revised" License
2.24k stars 393 forks source link

Fixed SmartTextMapVectorizer counting empty string entries #449

Closed MWYang closed 4 years ago

MWYang commented 4 years ago

Related issues Tangentially related to the name detection changes I've been making as part of my internship (#445 and other deprecated PRs).

Describe the proposed solution Previously, SmartTextMapVectorizer would treat empty string values in input maps as their own category. This could lead to edge case behavior where an otherwise categorical input would be treated as non-categorical due to the presence of empty strings (probably unlikely to happen IRL but it has tripped me up in tests for my other work). This PR implements:

codecov[bot] commented 4 years ago

Codecov Report

Merging #449 into master will decrease coverage by 4.11%. The diff coverage is 100%.

Impacted file tree graph

@@            Coverage Diff             @@
##           master     #449      +/-   ##
==========================================
- Coverage   86.93%   82.81%   -4.12%     
==========================================
  Files         337      337              
  Lines       11096    11098       +2     
  Branches      362      595     +233     
==========================================
- Hits         9646     9191     -455     
- Misses       1450     1907     +457
Impacted Files Coverage Δ
...p/stages/impl/feature/SmartTextMapVectorizer.scala 100% <100%> (ø) :arrow_up:
...p/stages/impl/feature/TextMapPivotVectorizer.scala 100% <100%> (ø) :arrow_up:
...ce/op/stages/impl/feature/TextLenTransformer.scala 0% <0%> (-100%) :arrow_down:
...op/stages/impl/feature/TimePeriodTransformer.scala 0% <0%> (-100%) :arrow_down:
...force/op/stages/impl/feature/OpIndexToString.scala 0% <0%> (-100%) :arrow_down:
...alesforce/op/aggregators/TimeBasedAggregator.scala 0% <0%> (-100%) :arrow_down:
...com/salesforce/op/test/TestOpWorkflowBuilder.scala 0% <0%> (-100%) :arrow_down:
...e/op/stages/impl/insights/RecordInsightsLOCO.scala 0% <0%> (-96.85%) :arrow_down:
...es/src/main/scala/com/salesforce/op/OpParams.scala 0% <0%> (-85.72%) :arrow_down:
...ages/impl/regression/RegressionModelSelector.scala 18.51% <0%> (-79.63%) :arrow_down:
... and 36 more

Continue to review full report at Codecov.

Legend - Click here to learn more Δ = absolute <relative> (impact), ø = not affected, ? = missing data Powered by Codecov. Last update 8c6110b...8ae18c6. Read the comment docs.

Jauntbox commented 4 years ago

Oh dear, I may have misinterpreted the issue you brought up @MWYang

Right now, we should be treating the empty string "" differently from an empty dataframe entry (a true null). A mediocre example as to why they could be different: "" could be the contents of a survey that was left blank, while null could correspond to someone who hasn't been presented with said survey.

So the main thing is that a Text column corresponding to ("A", null, "C") should be treated the same as a TextMap column containing (Map("f1" -> "A"), Map.empty, Map("f1" -> "C")) while a Text column corresponding to ("A", "", "C") should be treated the same as a TextMap column containing (Map("f1" -> "A"), Map("f1" -> ""), Map("f1" -> "C")), which I think it is.

Part of the confusion is because we store empty elements slightly differently between map and non-map features. An empty element in a Text field will show up as a None inside our Text object, while an empty element in a TextMap field will not be added to the map at all (since the values are String rather than Option[String])