Closed AdamChit closed 5 years ago
Thanks for the contribution! Before we can merge this, we need @AdamChit to sign the Salesforce.com Contributor License Agreement.
@TuanNguyen27 Could you review
Merging #413 into master will increase coverage by
0.02%
. The diff coverage is100%
.
@@ Coverage Diff @@
## master #413 +/- ##
==========================================
+ Coverage 86.97% 86.99% +0.02%
==========================================
Files 337 337
Lines 11060 11078 +18
Branches 357 597 +240
==========================================
+ Hits 9619 9637 +18
Misses 1441 1441
Impacted Files | Coverage Δ | |
---|---|---|
...alesforce/op/stages/impl/tuning/DataBalancer.scala | 96.11% <ø> (-0.18%) |
:arrow_down: |
...om/salesforce/op/stages/impl/tuning/Splitter.scala | 98.07% <100%> (+0.34%) |
:arrow_up: |
...e/op/stages/impl/selector/ModelSelectorNames.scala | 100% <100%> (ø) |
:arrow_up: |
...alesforce/op/stages/impl/tuning/DataSplitter.scala | 90% <100%> (+23.33%) |
:arrow_up: |
Continue to review full report at Codecov.
Legend - Click here to learn more
Δ = absolute <relative> (impact)
,ø = not affected
,? = missing data
Powered by Codecov. Last update 1bf6fdf...cfbe22f. Read the comment docs.
Unrelated Test fails during the Travis CI check https://travis-ci.com/salesforce/TransmogrifAI/jobs/243091644 I'll create a ticket to increase the tolerance on the test here
Related issues
DataBalancer for binary classification has a parameter that controls the max data passed into modeling - Regression should allow similar limits
Describe the proposed solution
Steps:
Investigate where the check should occur (somewhere in DataSplitter)
Add logic to downsample when the number of records is reached
Add downsampling information to the summary object and log
Add tests to DataSplitterTest and RegressionModelSelectorTest to cover the downsampling logic
Describe alternatives you've considered
Not having a limit for any of the model types - this was not optimal because some spark models may have very long runtimes or bad behavior with too much data. So the default will be to downsample once we have passes 1M records and give the user the option to set their own maxTrainingSample if they are ok with working with large dataset