salesforce / TransmogrifAI

TransmogrifAI (pronounced trăns-mŏgˈrə-fī) is an AutoML library for building modular, reusable, strongly typed machine learning workflows on Apache Spark with minimal hand-tuning
https://transmogrif.ai
BSD 3-Clause "New" or "Revised" License
2.24k stars 393 forks source link

Eval dataset should not be balanced #429

Closed AdamChit closed 5 years ago

AdamChit commented 5 years ago

Related issues validationPrepare was being called on the validation datasets which will balance the dataset. This should not be done because the validation set should represent the real distribution.

Describe the proposed solution remove the call to validationPrepare

codecov[bot] commented 5 years ago

Codecov Report

Merging #429 into master will decrease coverage by 0.01%. The diff coverage is n/a.

Impacted file tree graph

@@            Coverage Diff             @@
##           master     #429      +/-   ##
==========================================
- Coverage   86.97%   86.96%   -0.02%     
==========================================
  Files         337      337              
  Lines       11082    11082              
  Branches      355      588     +233     
==========================================
- Hits         9639     9637       -2     
- Misses       1443     1445       +2
Impacted Files Coverage Δ
...op/stages/impl/tuning/OpTrainValidationSplit.scala 100% <ø> (ø) :arrow_up:
...orce/op/stages/impl/tuning/OpCrossValidation.scala 97.95% <ø> (ø) :arrow_up:
...es/src/main/scala/com/salesforce/op/OpParams.scala 85.71% <0%> (-4.09%) :arrow_down:

Continue to review full report at Codecov.

Legend - Click here to learn more Δ = absolute <relative> (impact), ø = not affected, ? = missing data Powered by Codecov. Last update 8624922...f52c945. Read the comment docs.

tovbinm commented 5 years ago

@leahmcguire @AdamChit didn't we just change it in this PR - https://github.com/salesforce/TransmogrifAI/pull/424 ?

leahmcguire commented 5 years ago

Yes @tovbinm I was too quick to make the change :-)

gerashegalov commented 5 years ago

@AdamChit @leahmcguire Can we capture the expected behavior in a test to reduce the risk of merging "too quick"?