odahu / odahu-flow

Apache License 2.0
12 stars 2 forks source link

AutoML #18

Open mcdoker18 opened 4 years ago

mcdoker18 commented 4 years ago

Summary: we need to provide a tool for the user so that he can run multiple training workers with different values of hyperparameters.

AutoML theory:

Candidate open-source AutoML tool for integration into ODAHU

Below subtasks:

  1. Develop the syntax for describing a strategy for hyperparameters variation (examples below are just concepts – you should design syntax by yourself and get approve from the team). We suggest to implement parent oadhu API resource that will include regular ModelTrainingTemplate resource with other parameters for scalable running. I know about three strategies for hyperparameters searching: 1.1. Manual – user hardcode fixed values for hyperparameters, review results, and modify them for the next run (ODAHU already allows do this). Example
    hyperParameters:
    strategy: "default/fixed"
    alpha: "1.0"
    beta: "2.0"

    1.2. Random search in space. Imagine that we have X hyperparameters, therefore if we want to run training with bruteforce all possible combinations for these parameters with some values, we will have N^X required runs. To decrease the complexity of searching in such a big space usually, a random selection of values in ranges is used Example:

    hyperParameters:
    strategy: "random"
    countOfSelections: 10
    alpha: "[0.1, 10]", step: "0.1"
    beta: "[2, 50]", step: "2"
    gamma: "[0.1, 0.5]", step: "0.1"

    1.3. Varying one hyperparameter in range, while others are fixed One of the hyperparameters is described with range syntax, others are hardcoded. System guarantee that all values for variated hyperparameter will be runned Example:

    hyperParameters:
    strategy: "range"
    alpha: "[0.1, 10]", step: "0.1"
    beta: "2"
    gamma: "3"
  2. The syntax for describing resources for scalable model training according to hyperparameters Example:
    resources:
    workers: 5
    limits:
      cpu: 4024m
      memory: 4024Mi
    requests:
      cpu: 2024m
      memory: 2024Mi
  3. The implementation of running scalable training processes for implementing the strategy that the user describe with distributed training on resources
alinaignatiuk commented 2 years ago

Competitors like sagemaker and kubeflow. Depend on the positioning and strategy.