Jauntbox commented 4 years ago

Related issues N/A

Describe the proposed solution The initial thresholded confusion matrix for binary classification metrics was added in https://github.com/salesforce/TransmogrifAI/pull/492, through selective overriding of Spark's BinaryClassificationMetrics class to expose only the thresholded confusion matrices. To do this, the calculation had to be done a second time.

Unfortunately, this calculation is non-deterministic. I haven't figured out exactly which piece, but I'm worried about the partitioning since a mapPartitions is used. This means that we need to go back to the original plan of copying the entire Spark class, since that's the only way to guarantee that the confusion matrices agrees with the rest of the thresholded metrics.

Describe alternatives you've considered Reimplementing the other thresholded metrics amounts to copy/pasting the entire class so isn't really an alternative.

Additional context N/A

codecov[bot] commented 4 years ago

Codecov Report

Merging #510 into master will decrease coverage by 0.01%. The diff coverage is 100.00%.

@@            Coverage Diff             @@
##           master     #510      +/-   ##
==========================================
- Coverage   86.72%   86.70%   -0.02%     
==========================================
  Files         347      347              
  Lines       11860    11830      -30     
  Branches      386      382       -4     
==========================================
- Hits        10285    10257      -28     
+ Misses       1575     1573       -2

Impacted Files	Coverage Δ
...p/evaluators/OpBinaryClassificationEvaluator.scala	`82.60% <100.00%> (ø)`
...aluation/ExtendedBinaryClassificationMetrics.scala	`100.00% <100.00%> (ø)`
...es/src/main/scala/com/salesforce/op/OpParams.scala	`85.71% <0.00%> (-4.09%)`	:arrow_down:

Continue to review full report at Codecov.

Legend - Click here to learn more Δ = absolute <relative> (impact), ø = not affected, ? = missing data Powered by Codecov. Last update be60275...8a05fcc. Read the comment docs.

gerashegalov commented 4 years ago

Strong +1 for using reflection in the interim instead of copy&pasting. And let us make a PR to Spark

salesforce / TransmogrifAI

Fix binary classification threshold metrics non-determinism #510

Codecov Report