salesforce / TransmogrifAI

TransmogrifAI (pronounced trăns-mŏgˈrə-fī) is an AutoML library for building modular, reusable, strongly typed machine learning workflows on Apache Spark with minimal hand-tuning
https://transmogrif.ai
BSD 3-Clause "New" or "Revised" License
2.24k stars 393 forks source link

Add descaling ability to MinMaxNormEstimator and a custom estimator with enum #381

Closed erica-chiu closed 5 years ago

erica-chiu commented 5 years ago

Related issues N/A

Describe the proposed solution Adding a custom estimator to allow standardization but keeping positive values by subtracting by min instead of mean Adding an enum to allow for a choice among the current linear estimators for the label Adding metadata to MinMaxNormEstimator to allow for descaling

Describe alternatives you've considered The alternative is to create one large estimator that changes the linear function applied depending on the enum given as an input. Would reduce repetitive code, but may become clunky with too many different functions.

Additional context These changes are meant to experiment with GLMs and normalizing the label for regression problems.

codecov[bot] commented 5 years ago

Codecov Report

Merging #381 into master will increase coverage by 0.02%. The diff coverage is 100%.

Impacted file tree graph

@@            Coverage Diff             @@
##           master     #381      +/-   ##
==========================================
+ Coverage   86.84%   86.87%   +0.02%     
==========================================
  Files         336      339       +3     
  Lines       10948    10988      +40     
  Branches      351      573     +222     
==========================================
+ Hits         9508     9546      +38     
- Misses       1440     1442       +2
Impacted Files Coverage Δ
...e/op/stages/impl/feature/MinMaxNormEstimator.scala 100% <100%> (ø)
...sforce/op/stages/impl/feature/LabelEstimator.scala 100% <100%> (ø)
.../op/stages/impl/feature/StandardMinEstimator.scala 100% <100%> (ø)
...es/src/main/scala/com/salesforce/op/OpParams.scala 85.71% <0%> (-4.09%) :arrow_down:

Continue to review full report at Codecov.

Legend - Click here to learn more Δ = absolute <relative> (impact), ø = not affected, ? = missing data Powered by Codecov. Last update 42fc765...e010715. Read the comment docs.

leahmcguire commented 5 years ago

Is this really what we want? We would need to apply different normalization based on properties of the data - which we do not know beforehand. Rather than creating multiple estimators which do different things don't we want one estimator that examines the data and and does the correct thing for the data?

I would suggest one combined estimator with an enum of transformation types: "Norm, MinMax, StandardZeroMin, Auto" where auto will select between all possible methods to give the best result for the data.

erica-chiu commented 5 years ago

Is this auto transformation selecting the best result based on data distribution or on actual experimentation?

leahmcguire commented 5 years ago

you would need to do actual experiments to determine what the best transformation for a given distribution is. My point is that we are trying to build AutoML so we need the transformation to be applied correctly within the stage not based on predefined knowledge about the data.

Jauntbox commented 5 years ago

@leahmcguire That is what we'd want to work towards, but we don't know what "auto" should be yet. We're planning on testing these different rescalings in automl workflows to see what would work best before enabling it.

We could have an estimator that takes an enum that just has the three scalings we have for now, and then add auto once we do some experiments.