AdamChit commented 5 years ago

Related issues

DataBalancer for binary classification has a parameter that controls the max data passed into modeling - Regression should allow similar limits

Describe the proposed solution

Steps:

Investigate where the check should occur (somewhere in DataSplitter)
Add logic to downsample when the number of records is reached
Add downsampling information to the summary object and log
Add tests to DataSplitterTest and RegressionModelSelectorTest to cover the downsampling logic

Describe alternatives you've considered

Not having a limit for any of the model types - this was not optimal because some spark models may have very long runtimes or bad behavior with too much data. So the default will be to downsample once we have passes 1M records and give the user the option to set their own maxTrainingSample if they are ok with working with large dataset

salesforce-cla[bot] commented 5 years ago

Thanks for the contribution! Before we can merge this, we need @AdamChit to sign the Salesforce.com Contributor License Agreement.

AdamChit commented 5 years ago

@TuanNguyen27 Could you review

codecov[bot] commented 5 years ago

Codecov Report

Merging #413 into master will increase coverage by 0.02%. The diff coverage is 100%.

@@            Coverage Diff             @@
##           master     #413      +/-   ##
==========================================
+ Coverage   86.97%   86.99%   +0.02%     
==========================================
  Files         337      337              
  Lines       11060    11078      +18     
  Branches      357      597     +240     
==========================================
+ Hits         9619     9637      +18     
  Misses       1441     1441

Impacted Files	Coverage Δ
...alesforce/op/stages/impl/tuning/DataBalancer.scala	`96.11% <ø> (-0.18%)`	:arrow_down:
...om/salesforce/op/stages/impl/tuning/Splitter.scala	`98.07% <100%> (+0.34%)`	:arrow_up:
...e/op/stages/impl/selector/ModelSelectorNames.scala	`100% <100%> (ø)`	:arrow_up:
...alesforce/op/stages/impl/tuning/DataSplitter.scala	`90% <100%> (+23.33%)`	:arrow_up:

Continue to review full report at Codecov.

Legend - Click here to learn more Δ = absolute <relative> (impact), ø = not affected, ? = missing data Powered by Codecov. Last update 1bf6fdf...cfbe22f. Read the comment docs.

AdamChit commented 5 years ago

Unrelated Test fails during the Travis CI check https://travis-ci.com/salesforce/TransmogrifAI/jobs/243091644 I'll create a ticket to increase the tolerance on the test here

salesforce / TransmogrifAI

Regression training limit #413

Codecov Report