legion-platform / legion

The Toolchain agnostic Runtime AI platform
http://legion-platform.org
Apache License 2.0
22 stars 10 forks source link

Horizontal scaling for model training #1053

Open mcdoker18 opened 4 years ago

mcdoker18 commented 4 years ago

For now, we can only scale resources for model training vertically. We have to do research and determine how we can improve it. Sub tasks:

vlad-tokarev commented 4 years ago

Distribute ML research

Questions

  1. How legion user should interact with ModelTrainingScaling feature?
    1. What API User should use
      1. Should he use unified legion API for scaling?
      2. Or he should rely on some framework specific scaling API? Examples:
        • TF distributed
        • Sklearn partial_fit
        • Horovod
    2. How many changes is desirable that user should do in his code to run scaled way of training?

General concerns about ML scaling process

  1. Scaling ML algorithms
    1. First of all we need to realize that not all algorithms and libraries can support scaling
    2. First of all algorithm must support scaling. After this, library that implements it should support scaling
    3. For example (as I understand): almost all NN methods and XGboost ara could be scaled by distribution calculations, but not Linear regression method
  2. Two approaches for scaling models
    1. Parallelization (Boost speed of training by multiple workers)
      1. Gradient Averaging Used for separation Gradients calculation from Gradient applying, so that calculations could be divided in the node pools. Obviously could be used only for models that use Gradient descent method to combine the results of multiple independent workers. For example this approach is used by TF native distributed framework and Horovod
      2. But I have not found information about distributed computations for the very popular Sklearn library. This lib only support parallelization for one node with python joblib library
    2. Incremental learning (Boost resourses usage and speed by stream data processing)
      1. This technique that allows ML algo get data incrementally. Not all algos support this techniques
      2. Sklearn models support partial_fit method for incremental learning so legion user can write his notebook for using datastreams from big data storages

Tools

  1. Horovod
    • Was introduced by Uber
    • Uber was using TF distributed approach but faced with two problems:
      • TF Distributed tool use GPU resourses in not optimal way
      • TF Distributed tool has verbose API that introduce a lot of new concepts to existing ML codebase
    • Rely on MPI framework under the hood
    • Rely on ring allreduce
    • Support primarily NN, not support XGBoost
  2. rabit
    • lib from XGBoost authors for Reliable Allreduce and Broadcast Interface for distributed machine learning
  3. Dusk
  4. Amazon SageMaker This is AWS abstraction for running TF models either with Horovod or TF Native approach

Problems

  1. ML training is actually running by toolchain integration (TI), not Legion by itself. For example MLflow integration run training using one of its backends (local conda runner, databricks, k8s) (our integration only use local backend). Therefore because legion scaling feature probably should not have direct dependence with TI, we need to introduce new abstraction level for running?