salesforce / TransmogrifAI

TransmogrifAI (pronounced trăns-mŏgˈrə-fī) is an AutoML library for building modular, reusable, strongly typed machine learning workflows on Apache Spark with minimal hand-tuning
https://transmogrif.ai
BSD 3-Clause "New" or "Revised" License
2.24k stars 393 forks source link

Ensure correct metrics despite model failures on some CV folds #404

Closed leahmcguire closed 5 years ago

leahmcguire commented 5 years ago

Related issues GLM optimizer is not very stable so sometimes runs will fail in some CV folds but not others - this was causing metrics and grids to not be computed correctly since we assumed that the same number of parameter grids successfully run in each cross validation fold

Describe the proposed solution throw out grids which are not in every cv run (since they are likely to fail in some modeling runs)

Describe alternatives you've considered use grids with metrics in a map to make sure that the correct metrics are combined in CV and simply take the best metric

Additional context Add any other context about the changes here.

codecov[bot] commented 5 years ago

Codecov Report

Merging #404 into master will decrease coverage by 0.01%. The diff coverage is 100%.

Impacted file tree graph

@@            Coverage Diff             @@
##           master     #404      +/-   ##
==========================================
- Coverage   86.96%   86.95%   -0.02%     
==========================================
  Files         337      337              
  Lines       11054    11060       +6     
  Branches      361      591     +230     
==========================================
+ Hits         9613     9617       +4     
- Misses       1441     1443       +2
Impacted Files Coverage Δ
...orce/op/stages/impl/tuning/OpCrossValidation.scala 97.95% <100%> (+0.28%) :arrow_up:
...es/src/main/scala/com/salesforce/op/OpParams.scala 85.71% <0%> (-4.09%) :arrow_down:

Continue to review full report at Codecov.

Legend - Click here to learn more Δ = absolute <relative> (impact), ø = not affected, ? = missing data Powered by Codecov. Last update 95a77b1...6687a34. Read the comment docs.

leahmcguire commented 5 years ago

@tovbinm @gerashegalov I discovered some fun behavior using algebird implicit for maps:

val test = Seq(Map("a" -> 1, "b" -> 1), Map("a" -> 1, "b" -> 1)) val combined = test.reduce( + ) // Map("a" -> 2, "b" -> 2)) as expected

val test = Seq(Map("a" -> 0, "b" -> 0), Map("a" -> 0, "b" -> 0)) val combined = test.reduce( + ) // Map() :-(

Think I may need to write out the reduce function...

tovbinm commented 5 years ago

@leahmcguire Map behavior is as expected under monoid rules. In order to keep the map keys with 0 values use semigroup for map values. Example for Map[String, Long] below:

import com.twitter.algebird._
import com.twitter.algebird.Operators._

implicit val longSemigroup = Semigroup.from[Long](_ + _)
implicit val mapLongMonoid = Monoid.mapMonoid[String, Long](longSemigroup)

// works!
(Map("a" -> 0L) + Map("b" -> 0L)) shouldBe Map("a" -> 0L, "b" -> 0L)