salesforce / TransmogrifAI

TransmogrifAI (pronounced trăns-mŏgˈrə-fī) is an AutoML library for building modular, reusable, strongly typed machine learning workflows on Apache Spark with minimal hand-tuning
https://transmogrif.ai
BSD 3-Clause "New" or "Revised" License
2.24k stars 392 forks source link

Refactor flaky local scoring tests #501

Closed Jauntbox closed 4 years ago

Jauntbox commented 4 years ago

Related issues N/A

Describe the proposed solution The current local scoring tests are flaky when xgboost models are included because the dataset used is a tiny 8-row hardcoded dataset. This can cause the train/validation splits to often contain all of a single class which causes xgboost models to throw an error.

This PR makes the dataset used a synthetic dataset of adjustable size (now 100 rows) to fix this problem.

Describe alternatives you've considered We could also have used the full hardcoded Titanic dataset, but this was much easier for me.

Additional context N/A

codecov[bot] commented 4 years ago

Codecov Report

Merging #501 into master will increase coverage by 1.48%. The diff coverage is n/a.

Impacted file tree graph

@@            Coverage Diff             @@
##           master     #501      +/-   ##
==========================================
+ Coverage   80.02%   81.51%   +1.48%     
==========================================
  Files         346      346              
  Lines       11782    11782              
  Branches      385      385              
==========================================
+ Hits         9429     9604     +175     
+ Misses       2353     2178     -175     
Impacted Files Coverage Δ
...ala/com/salesforce/op/utils/tuples/RichTuple.scala 0.00% <0.00%> (-100.00%) :arrow_down:
...alesforce/op/aggregators/TimeBasedAggregator.scala 0.00% <0.00%> (-100.00%) :arrow_down:
...stages/impl/feature/TimePeriodMapTransformer.scala 0.00% <0.00%> (-100.00%) :arrow_down:
...e/op/stages/impl/insights/RecordInsightsCorr.scala 0.00% <0.00%> (-98.25%) :arrow_down:
utils/src/main/scala/com/salesforce/op/UID.scala 0.00% <0.00%> (-91.67%) :arrow_down:
...op/stages/impl/preparators/MinVarianceFilter.scala 0.00% <0.00%> (-91.31%) :arrow_down:
...es/src/main/scala/com/salesforce/op/OpParams.scala 0.00% <0.00%> (-85.72%) :arrow_down:
...ala/com/salesforce/op/stages/SparkStageParam.scala 0.00% <0.00%> (-77.42%) :arrow_down:
...a/com/salesforce/op/utils/spark/RichMetadata.scala 15.78% <0.00%> (-73.69%) :arrow_down:
...la/com/salesforce/op/utils/spark/RichDataset.scala 15.38% <0.00%> (-70.77%) :arrow_down:
... and 103 more

Continue to review full report at Codecov.

Legend - Click here to learn more Δ = absolute <relative> (impact), ø = not affected, ? = missing data Powered by Codecov. Last update f5aef4f...30cd461. Read the comment docs.

Jauntbox commented 4 years ago

Oof, didn't realize you weren't a "code owner", I guess @leahmcguire needs to approve this too

salesforce-cla[bot] commented 3 years ago

Thanks for the contribution! It looks like @Jauntbox is an internal user so signing the CLA is not required. However, we need to confirm this.