salesforce / TransmogrifAI

TransmogrifAI (pronounced trăns-mŏgˈrə-fī) is an AutoML library for building modular, reusable, strongly typed machine learning workflows on Apache Spark with minimal hand-tuning
https://transmogrif.ai
BSD 3-Clause "New" or "Revised" License
2.24k stars 393 forks source link

Regression training limit #413

Closed AdamChit closed 5 years ago

AdamChit commented 5 years ago

Related issues

DataBalancer for binary classification has a parameter that controls the max data passed into modeling - Regression should allow similar limits

Describe the proposed solution

Steps:

  1. Investigate where the check should occur (somewhere in DataSplitter)

  2. Add logic to downsample when the number of records is reached

  3. Add downsampling information to the summary object and log

  4. Add tests to DataSplitterTest and RegressionModelSelectorTest to cover the downsampling logic

Describe alternatives you've considered

Not having a limit for any of the model types - this was not optimal because some spark models may have very long runtimes or bad behavior with too much data. So the default will be to downsample once we have passes 1M records and give the user the option to set their own maxTrainingSample if they are ok with working with large dataset

salesforce-cla[bot] commented 5 years ago

Thanks for the contribution! Before we can merge this, we need @AdamChit to sign the Salesforce.com Contributor License Agreement.

AdamChit commented 5 years ago

@TuanNguyen27 Could you review

codecov[bot] commented 5 years ago

Codecov Report

Merging #413 into master will increase coverage by 0.02%. The diff coverage is 100%.

Impacted file tree graph

@@            Coverage Diff             @@
##           master     #413      +/-   ##
==========================================
+ Coverage   86.97%   86.99%   +0.02%     
==========================================
  Files         337      337              
  Lines       11060    11078      +18     
  Branches      357      597     +240     
==========================================
+ Hits         9619     9637      +18     
  Misses       1441     1441
Impacted Files Coverage Δ
...alesforce/op/stages/impl/tuning/DataBalancer.scala 96.11% <ø> (-0.18%) :arrow_down:
...om/salesforce/op/stages/impl/tuning/Splitter.scala 98.07% <100%> (+0.34%) :arrow_up:
...e/op/stages/impl/selector/ModelSelectorNames.scala 100% <100%> (ø) :arrow_up:
...alesforce/op/stages/impl/tuning/DataSplitter.scala 90% <100%> (+23.33%) :arrow_up:

Continue to review full report at Codecov.

Legend - Click here to learn more Δ = absolute <relative> (impact), ø = not affected, ? = missing data Powered by Codecov. Last update 1bf6fdf...cfbe22f. Read the comment docs.

AdamChit commented 5 years ago

Unrelated Test fails during the Travis CI check https://travis-ci.com/salesforce/TransmogrifAI/jobs/243091644 I'll create a ticket to increase the tolerance on the test here