pentium3 / sys_reading

system paper reading notes
236 stars 12 forks source link

Big Model Tutorial Techniques and Systems to Train and Serve Bigger Models #259

Closed pentium3 closed 9 months ago

pentium3 commented 1 year ago

https://zhuanlan.zhihu.com/p/562741952

pentium3 commented 9 months ago

https://alpa.ai/icml22-tutorial.html

https://sites.google.com/view/icml-2022-big-model

Trends Driving Big Models - Ion Stoica

intro

so we need to parallelize the model

two ways to parallelize model

1. inter-operator parallelism:

image

2. intra-operator parallelism:

image

Challenges

Huge optimization space:

Growing diversity of specialized hardware (e.g., GPUs, TPUs)

Huge and growing diversity of NN architectures

pentium3 commented 9 months ago

https://icml.cc/virtual/2022/tutorial/18440

How to train and serve big models?

image

Using parallelization.

Scope of the tutorial: Parallelization techniques and systems for big models

pentium3 commented 9 months ago

Preliminaries

Historical context and problem overview

1. Data parallelism with Parameter Server

allow asynchronously training

image

however, many existing system like pytorch DDP use all-reduce to update parameter synchronously, since communication between GPUs (within single machine) is fast enough

image

2. Computational Graph and Placement

TensorFlow: DL computation formulated as a dataflow graph

memory of one GPU is limited, so we need to place different node of dataflow graph to different devices(GPUs)

for big model, deciding placement is a hard problem.

Background

1. distributed DL Computation

computation pattern of DL training:

for each epoch {
    for a batch of input data {
        ▽L(): forward propagation: compute prediction with model
        ▽L(): backward propagation: apply loss function and then compute gradient 
        f(): use gradient to update model weights.
    }
}

image

2. Classic Views of ML Parallelisms: Data Parallelism / Model Parallelism

Data Parallelism:

Model Parallelism

New view (this tutorial): Inter-op parallelism / Intra-op parallelism

1. formulate DL Computation as Graph

image

components:

3 steps:

2. formulate compute resources as Device Cluster

image

3. Problem formulation: How to partition the computational graph on the device cluster?

image

image

4. compute pattern inside Intra- and Inter-op Parallelism

In the following example, which computes a matrix multiplication:

image

partition strategy 1: Intra-op parallelism

image

(1):

(2):

image

pattern of Collective Communication: (from NCCL documentation)

used all-reduce to aggregate partial results across all devices to a full tensor onto devices

image

partition strategy 2: Inter-op parallelism

image

pattern of Point-to-point Communication:

image

5. summarize:

image

we can formulate the ultimate problem with this new view of parallelism:

What’s the best way to execute the graph subject to memory and communication constraints?

pentium3 commented 9 months ago

Inter-op parallelism

suppose we partition a Computational Graph to 4 stages with inter-op parallelism, and put each stage on one device:

image

pipeline parallelism

1. pipeline bubble

due to data dependency, stage 2 can compute only after we get the result of stage 1. if we visualize the execution timeline, we can see there are many times that devices are idle (gray area), a.k.a. Pipeline bubbles

for Inference case

image

for Training

image

2. How to Reduce Pipeline Bubbles for Training?

2.1 Device Placement

Idea: Slice the branches of a neural network into multiple stages so they can be calculated concurrently

image

Limitation: (1) only work for NN with branch. (2). dev utilization still low device placement needs to be combined with the other pipeline schedules discussed later to further improve device utilization

2.2 Synchronous Pipeline Parallel Schedule

Idea: Modify pipeline schedule to improve efficiency, but keep the computation and convergence semantics exactly the same as if training with a single device.

✅ Pros:

❌ Cons:

2.2.1 GPipe

image

Limitation: memory usage for storing intermediate activation becomes higher as num of micro-batch increases

2.2.2 1F1B:

optimize memory usage of GPipe by Perform backward as early as possible, so we don't need to keep all intermediate activation (of all mini-batches) in memory

image

2.2.3 Interleaved 1F1B

image

2.2.4 TeraPipe

apply pipeline parallelism for transformer model

image

2.2.5 Chimera

optimize 1F1B by inserting extra pipelines

image

2.3 Asynchronous Pipeline Parallel Schedule

Idea: Start next round of forward pass before backward pass finishes.

✅ Pros:

❌ Cons:

2.3.1 AMPNet

different stages can have different version of weights.

eg, in this example, suppose stage1 on dev1, stage2 on dev2, stage3 on dev3. For data#1, its forward path of stage 1 and 2 are using initial weights. But its forward path of stage 3 (and backward path of all stages) is using weights updated by data#0

it might bring noise to training process. hard to work for larger dataset

image

2.3.2 Pipedream

image

2.4 Automatic Stage Partitioning

The Assumption of previous Pipeline Parallel Schedule algorithms: assume balanced stages: running time (latency) of each stage are the same. Pipeline schedules works best with balanced stages.

image

however, Imbalance stage will create more pipeline bubble. so, an important research problem is to Minimize maximum stage latency & maximize parallelization

there are generally two types of solutions:

2.4.1 Reinforcement Learning Based (mainly for device placement):

image

2.4.2 Optimization (Dynamic Programming/Linear Programming) Based:

image

2.5 Summary

image

pentium3 commented 9 months ago

Intra-op parallelism

There are generally 2 ways to achieve Intra-op parallelism

1. parallelize a single operator

image

image

2. parallelize a subgraph

a case of Intra-op Parallelism

in the following example, we first look at a subgraph of the computation graph of a 2-layer MLP. there are 2 ways to partition input matrix to achieve intra-op parallelism:

image

when we merge multiple subgraphs to the whole computation graph, since different operators’(in diff subgraph) parallelization strategies require different partition format of the same tensor, we need to consider re-partition Communication Cost:

image

in this example, if subgraph 1 and 2 use different parallelization strategies, we need to re-partition the output of relu, which leads to some communication cost.

image

Problem formulation

image

3. representative works using of Intra-op Parallelism

3.1 Model-specific Intra-op Parallel Strategies

AlexNet

image

Megaton-LM

weights are large, so we partition weights. lightweight operators like dropout are replicated.

image

GShard MoE

partition the weight of MoE layers since it's large.

image

ZeRO Optimizer

image

4. Combining inter- and intra-operator parallelism scales to more devices.

assign each stage to a device mesh (which contains multiple dev connected via diff mesh)

image

5. summary

pentium3 commented 9 months ago

New frontiers: auto-parallelization

X: best supported models of each system

Y: techniques used by each system

image

Auto-parallelization Problem: (automatically) find best combination of inter-op and intra-op strategies to maximize the performance of a model on a cluster of devices.

image

challenges

The Search Space is Huge.

existing Automatic Parallelization Methods

use 3 types of algorithms to find Parallelization strategy

image

General Recipe of solving Automatic Parallelization problem:

image

  1. define a search space of diff strategies
  2. reduce search space using heuristics
  3. apply search algorithm to find best strategy
  4. evaluate if the found strategy works well (by deploying strategy or human-designed cost model)
  5. evaluation result can give feedback to search algorithm, if appliable

representative existing works

1. FlexFlow: use MCMC Search-based method

image

image

not practical with large model / cluster

2. ColocRL (Device Placement Optimization with Reinforcement Learning)

only consider inter-op

3. Alpa: Optimization-based Method

intuition: map inter-op(less communication but more device idle time) to slow connections in the cluster, map intra-op(more communication but less device idle time) to fast connections in the cluster. -> Minimize Computation cost + Communication cost

Hierarchically reduce search space:

for more details see https://github.com/pentium3/sys_reading/issues/228

comparison of 3 Automatic Parallelization Methods

image

image

Future ML Systems and Challenges

automatic transform (scaling && scheduling) any prototype model code developed on a single machine, to a large scale parallel version that run on a cluster of devices, with achieving any performance goal.

image

image