MWYang commented 4 years ago

Related issues Tangentially related to the name detection changes I've been making as part of my internship (#445 and other deprecated PRs).

Describe the proposed solution Previously, SmartTextMapVectorizer would treat empty string values in input maps as their own category. This could lead to edge case behavior where an otherwise categorical input would be treated as non-categorical due to the presence of empty strings (probably unlikely to happen IRL but it has tripped me up in tests for my other work). This PR implements:

the one-line fix in SmartTextMapVectorizer to not count empty strings when determining which input features are categorical
a test for SmartTextMapVectorizer against this behavior
the one-line fix in TextMapPivotVectorizer to treat empty string entries as null values rather than other values

codecov[bot] commented 4 years ago

Codecov Report

Merging #449 into master will decrease coverage by 4.11%. The diff coverage is 100%.

@@            Coverage Diff             @@
##           master     #449      +/-   ##
==========================================
- Coverage   86.93%   82.81%   -4.12%     
==========================================
  Files         337      337              
  Lines       11096    11098       +2     
  Branches      362      595     +233     
==========================================
- Hits         9646     9191     -455     
- Misses       1450     1907     +457

Impacted Files	Coverage Δ
...p/stages/impl/feature/SmartTextMapVectorizer.scala	`100% <100%> (ø)`	:arrow_up:
...p/stages/impl/feature/TextMapPivotVectorizer.scala	`100% <100%> (ø)`	:arrow_up:
...ce/op/stages/impl/feature/TextLenTransformer.scala	`0% <0%> (-100%)`	:arrow_down:
...op/stages/impl/feature/TimePeriodTransformer.scala	`0% <0%> (-100%)`	:arrow_down:
...force/op/stages/impl/feature/OpIndexToString.scala	`0% <0%> (-100%)`	:arrow_down:
...alesforce/op/aggregators/TimeBasedAggregator.scala	`0% <0%> (-100%)`	:arrow_down:
...com/salesforce/op/test/TestOpWorkflowBuilder.scala	`0% <0%> (-100%)`	:arrow_down:
...e/op/stages/impl/insights/RecordInsightsLOCO.scala	`0% <0%> (-96.85%)`	:arrow_down:
...es/src/main/scala/com/salesforce/op/OpParams.scala	`0% <0%> (-85.72%)`	:arrow_down:
...ages/impl/regression/RegressionModelSelector.scala	`18.51% <0%> (-79.63%)`	:arrow_down:
... and 36 more

Continue to review full report at Codecov.

Legend - Click here to learn more Δ = absolute <relative> (impact), ø = not affected, ? = missing data Powered by Codecov. Last update 8c6110b...8ae18c6. Read the comment docs.

Jauntbox commented 4 years ago

Oh dear, I may have misinterpreted the issue you brought up @MWYang

Right now, we should be treating the empty string "" differently from an empty dataframe entry (a true null). A mediocre example as to why they could be different: "" could be the contents of a survey that was left blank, while null could correspond to someone who hasn't been presented with said survey.

So the main thing is that a Text column corresponding to ("A", null, "C") should be treated the same as a TextMap column containing (Map("f1" -> "A"), Map.empty, Map("f1" -> "C")) while a Text column corresponding to ("A", "", "C") should be treated the same as a TextMap column containing (Map("f1" -> "A"), Map("f1" -> ""), Map("f1" -> "C")), which I think it is.

Part of the confusion is because we store empty elements slightly differently between map and non-map features. An empty element in a Text field will show up as a None inside our Text object, while an empty element in a TextMap field will not be added to the map at all (since the values are String rather than Option[String])

salesforce / TransmogrifAI

Fixed SmartTextMapVectorizer counting empty string entries #449

Codecov Report