intel-analytics / analytics-zoo

Distributed Tensorflow, Keras and PyTorch on Apache Spark/Flink & Ray
https://analytics-zoo.readthedocs.io/
Apache License 2.0
16 stars 3 forks source link

AutoML Typology pipeline selection function for easy use #824

Open gganduu opened 4 years ago

gganduu commented 4 years ago

Hi, the current implementation is to generate different recipe for different typologies. In order to promote AutoML function to user, the typologies could be selected feature is more precious. This function could help scientist to get a accuracy baseline according to AutoML tool.

jason-dai commented 4 years ago

@gganduu To clarify, you would like to automatically select the neural networks to be used, yes?

shane-huang commented 4 years ago

The current implementation is able to select from different topologies (or neural network models), though to enable it user have to write their own recipe, which requires a certain level of understanding of the internals and programming. Besides, it might be more convinient to output a leaderboard for users to get a idea of the accuracy for each topology/model family, as h2o and azure did. Moreover, there might be other family of models (e.g. arima) that may not be proper to be searched together with neural net topoligies (statistical models may have different ways to search and may require shorter time to finish). Last, sometimes user may select the model not only according to accuracy, but also the cost - e.g. they might prefer simpler models, if their accuracy is not significantly lower than complex models.

Based on above reasons, a reasonable way to improve is to run (families of) models sequentially, enable distributed trials in each (family of) model, and output a leaderboard for each family/topology. The changes to zouwu will include below:

  1. add include_algos and exclude_algos to allow user to specify which algos to include or exclude. AutoTSTrainer outputs a AutoTsTrainResult object instead of a TSPipeline. This also allows for future extensions such as visualization, monitoring and resume training. TSPipeline can be accessible from a AutoTSTrainResult. TSPipeline is still made for deployment and online training and shouldn't contain too much meta information about auto training.
AutoTSTrainResult=AutoTSTrainer.fit(..., include_algos=["lstm, seq2seq","randomwalk"], exclude_algos=["algos"])
ts_pipeline = AutoTSTrainResult.get_pipeline()
  1. The order of running the algorithms are predefined. The order is generally as below:

    1. run simple models (e.g. average method, randomwalk, more refer to https://otexts.com/fpp2/simple-methods.html )
    2. ar and es series of models (e.g. holt-winters, arima)
    3. neural network models
  2. Finally we output a leaderboard of the algorithms about accuracy.

    leaderboard = AutoTSTrainResult.get_leaderboard()
    The output is a dataframe and an example leaderboard looks like below: algorithm MSE SMAP
    ARIMA 100.0 0.2
    LSTM 12.0 0.07
    MTNet 11.0 0.06
  3. each algo pipeline can be accessed from the AutoTSTrainResult by specifying the algo arguments in get_pipeline(), default is auto, which selectes the algo with best accuracy, later we may change to an ensemble of several best models instead of the best one. User can also specify "lstm" or "arima" to get corresponding pipelines.

    ts_pipeline = AutoTSTrainResult.get_pipeline(algo="auto")
  4. User still can specify the detailed hyper-params for each algo in a predefined-file, format would be something like below:

    lstm:lstm_1_units=(32, 16)
    mtnet:ar_windows=(4, 12)
    training_iteration=50

    lstm:lstm_1_units: lstm is the namespace and lstm_1_units is the hyper parameter for the algorithm. The hyper-params will be read and default parameters will be overriden by user-specified hyper-params.

@gganduu would you check if this is what you want?

shane-huang commented 4 years ago

@jason-dai @yushan111 please also review if the general design above is okay.

gganduu commented 4 years ago

To clarify, you would like to automatically select the neural networks to be used, yes?

yes, it could be automatic as a default option, and support manually selected by user

gganduu commented 4 years ago

The current implementation is able to select from different topologies (or neural network models), though to enable it user have to write their own recipe, which requires a certain level of understanding of the internals and programming. Besides, it might be more convinient to output a leaderboard for users to get a idea of the accuracy for each topology/model family, as h2o and azure did. Moreover, there might be other family of models (e.g. arima) that may not be proper to be searched together with neural net topoligies (statistical models may have different ways to search and may require shorter time to finish). Last, sometimes user may select the model not only according to accuracy, but also the cost - e.g. they might prefer simpler models, if their accuracy is not significantly lower than complex models.

Based on above reasons, a reasonable way to improve is to run (families of) models sequentially, enable distributed trials in each (family of) model, and output a leaderboard for each family/topology. The changes to zouwu will include below:

  1. add include_algos and exclude_algos to allow user to specify which algos to include or exclude. AutoTSTrainer outputs a AutoTsTrainResult object instead of a TSPipeline. This also allows for future extensions such as visualization, monitoring and resume training. TSPipeline can be accessible from a AutoTSTrainResult. TSPipeline is still still for deployment and online training and shouldn't contain too much meta information about auto training.
AutoTSTrainResult=AutoTSTrainer.fit(..., include_algos=["lstm, seq2seq","randomwalk"], exclude_algos=["algos"])
ts_pipeline = AutoTSTrainResult.get_pipeline()
  1. The order of running the algorithms are predefined. The order is generally as below:

    1. run simple models (e.g. average method, randomwalk, more refer to https://otexts.com/fpp2/simple-methods.html )
    2. ar and es series of models (e.g. holt-winters, arima)
    3. neural network models
  2. Finally we output a leaderboard of the algorithms about accuracy.
leaderboard = AutoTSTrainResult.get_leaderboard()

The output is a dataframe and an example leaderboard looks like below:

algorithm MSE SMAP ARIMA 100.0 0.2 LSTM 12.0 0.07 MTNet 11.0 0.06

  1. each pipeline can be accessed from the pipeline by specifying the algo arguments in get_pipeline(), default is auto, which selectes the algo with best accuracy, later we may change to an ensemble of several best models instead of the best one.
ts_pipeline = AutoTSTrainResult.get_pipeline(algo="auto")
  1. User still can specify the detailed hyper-params for each algo in a predefined-file, format would be something like below:
lstm:lstm_1_units=(32, 16)
mtnet:ar_windows=(4, 12)
training_iteration=50

lstm:lstm_1_units: lstm is the namespace and lstm_1_units is the hyper parameter for the algorithm. The hyper-params will be read and default parameters will be overriden by user-specified hyper-params.

@gganduu would you check if this is what you want?

It's what I want to have, looks good! The purpose is for easy use, thanks Shane!

shanyu-sys commented 4 years ago

@shane-huang LGTM

helenlly commented 3 years ago

@shane-huang @gganduu seems the fix is available. May we close this issue? thanks