Closed MWYang closed 4 years ago
Merging #449 into master will decrease coverage by
4.11%
. The diff coverage is100%
.
@@ Coverage Diff @@
## master #449 +/- ##
==========================================
- Coverage 86.93% 82.81% -4.12%
==========================================
Files 337 337
Lines 11096 11098 +2
Branches 362 595 +233
==========================================
- Hits 9646 9191 -455
- Misses 1450 1907 +457
Impacted Files | Coverage Δ | |
---|---|---|
...p/stages/impl/feature/SmartTextMapVectorizer.scala | 100% <100%> (ø) |
:arrow_up: |
...p/stages/impl/feature/TextMapPivotVectorizer.scala | 100% <100%> (ø) |
:arrow_up: |
...ce/op/stages/impl/feature/TextLenTransformer.scala | 0% <0%> (-100%) |
:arrow_down: |
...op/stages/impl/feature/TimePeriodTransformer.scala | 0% <0%> (-100%) |
:arrow_down: |
...force/op/stages/impl/feature/OpIndexToString.scala | 0% <0%> (-100%) |
:arrow_down: |
...alesforce/op/aggregators/TimeBasedAggregator.scala | 0% <0%> (-100%) |
:arrow_down: |
...com/salesforce/op/test/TestOpWorkflowBuilder.scala | 0% <0%> (-100%) |
:arrow_down: |
...e/op/stages/impl/insights/RecordInsightsLOCO.scala | 0% <0%> (-96.85%) |
:arrow_down: |
...es/src/main/scala/com/salesforce/op/OpParams.scala | 0% <0%> (-85.72%) |
:arrow_down: |
...ages/impl/regression/RegressionModelSelector.scala | 18.51% <0%> (-79.63%) |
:arrow_down: |
... and 36 more |
Continue to review full report at Codecov.
Legend - Click here to learn more
Δ = absolute <relative> (impact)
,ø = not affected
,? = missing data
Powered by Codecov. Last update 8c6110b...8ae18c6. Read the comment docs.
Oh dear, I may have misinterpreted the issue you brought up @MWYang
Right now, we should be treating the empty string ""
differently from an empty dataframe entry (a true null
). A mediocre example as to why they could be different: ""
could be the contents of a survey that was left blank, while null
could correspond to someone who hasn't been presented with said survey.
So the main thing is that a Text column corresponding to ("A", null, "C")
should be treated the same as a TextMap column containing (Map("f1" -> "A"), Map.empty, Map("f1" -> "C"))
while a Text column corresponding to ("A", "", "C")
should be treated the same as a TextMap column containing (Map("f1" -> "A"), Map("f1" -> ""), Map("f1" -> "C"))
, which I think it is.
Part of the confusion is because we store empty elements slightly differently between map and non-map features. An empty element in a Text field will show up as a None inside our Text object, while an empty element in a TextMap field will not be added to the map at all (since the values are String rather than Option[String])
Related issues Tangentially related to the name detection changes I've been making as part of my internship (#445 and other deprecated PRs).
Describe the proposed solution Previously,
SmartTextMapVectorizer
would treat empty string values in input maps as their own category. This could lead to edge case behavior where an otherwise categorical input would be treated as non-categorical due to the presence of empty strings (probably unlikely to happen IRL but it has tripped me up in tests for my other work). This PR implements:SmartTextMapVectorizer
to not count empty strings when determining which input features are categoricalSmartTextMapVectorizer
against this behaviorTextMapPivotVectorizer
to treat empty string entries as null values rather than other values