Distribute ML research

Questions

Scaling ML algorithms
1. First of all we need to realize that not all algorithms and libraries can support scaling
2. First of all algorithm must support scaling. After this, library that implements it should support scaling
3. For example (as I understand): almost all NN methods and XGboost ara could be scaled by distribution calculations, but not Linear regression method
Two approaches for scaling models
1. Parallelization (Boost speed of training by multiple workers)
  1. Gradient Averaging Used for separation Gradients calculation from Gradient applying, so that calculations could be divided in the node pools. Obviously could be used only for models that use Gradient descent method to combine the results of multiple independent workers. For example this approach is used by TF native distributed framework and Horovod
  2. But I have not found information about distributed computations for the very popular Sklearn library. This lib only support parallelization for one node with python joblib library
2. Incremental learning (Boost resourses usage and speed by stream data processing)
  1. This technique that allows ML algo get data incrementally. Not all algos support this techniques
  2. Sklearn models support partial_fit method for incremental learning so legion user can write his notebook for using datastreams from big data storages

Horovod
- Was introduced by Uber
- Uber was using TF distributed approach but faced with two problems:
  - TF Distributed tool use GPU resourses in not optimal way
  - TF Distributed tool has verbose API that introduce a lot of new concepts to existing ML codebase
- Rely on MPI framework under the hood
- Rely on ring allreduce
- Support primarily NN, not support XGBoost
rabit
- lib from XGBoost authors for Reliable Allreduce and Broadcast Interface for distributed machine learning
Dusk
- Is approach (from XGBoost docs) for scaling by XGBoost library
Amazon SageMaker This is AWS abstraction for running TF models either with Horovod or TF Native approach

ML training is actually running by toolchain integration (TI), not Legion by itself. For example MLflow integration run training using one of its backends (local conda runner, databricks, k8s) (our integration only use local backend). Therefore because legion scaling feature probably should not have direct dependence with TI, we need to introduce new abstraction level for running?