Closed erica-chiu closed 5 years ago
Merging #381 into master will increase coverage by
0.02%
. The diff coverage is100%
.
@@ Coverage Diff @@
## master #381 +/- ##
==========================================
+ Coverage 86.84% 86.87% +0.02%
==========================================
Files 336 339 +3
Lines 10948 10988 +40
Branches 351 573 +222
==========================================
+ Hits 9508 9546 +38
- Misses 1440 1442 +2
Impacted Files | Coverage Δ | |
---|---|---|
...e/op/stages/impl/feature/MinMaxNormEstimator.scala | 100% <100%> (ø) |
|
...sforce/op/stages/impl/feature/LabelEstimator.scala | 100% <100%> (ø) |
|
.../op/stages/impl/feature/StandardMinEstimator.scala | 100% <100%> (ø) |
|
...es/src/main/scala/com/salesforce/op/OpParams.scala | 85.71% <0%> (-4.09%) |
:arrow_down: |
Continue to review full report at Codecov.
Legend - Click here to learn more
Δ = absolute <relative> (impact)
,ø = not affected
,? = missing data
Powered by Codecov. Last update 42fc765...e010715. Read the comment docs.
Is this really what we want? We would need to apply different normalization based on properties of the data - which we do not know beforehand. Rather than creating multiple estimators which do different things don't we want one estimator that examines the data and and does the correct thing for the data?
I would suggest one combined estimator with an enum of transformation types: "Norm, MinMax, StandardZeroMin, Auto" where auto will select between all possible methods to give the best result for the data.
Is this auto transformation selecting the best result based on data distribution or on actual experimentation?
you would need to do actual experiments to determine what the best transformation for a given distribution is. My point is that we are trying to build AutoML so we need the transformation to be applied correctly within the stage not based on predefined knowledge about the data.
@leahmcguire That is what we'd want to work towards, but we don't know what "auto" should be yet. We're planning on testing these different rescalings in automl workflows to see what would work best before enabling it.
We could have an estimator that takes an enum that just has the three scalings we have for now, and then add auto once we do some experiments.
Related issues N/A
Describe the proposed solution Adding a custom estimator to allow standardization but keeping positive values by subtracting by min instead of mean Adding an enum to allow for a choice among the current linear estimators for the label Adding metadata to MinMaxNormEstimator to allow for descaling
Describe alternatives you've considered The alternative is to create one large estimator that changes the linear function applied depending on the enum given as an input. Would reduce repetitive code, but may become clunky with too many different functions.
Additional context These changes are meant to experiment with GLMs and normalizing the label for regression problems.