pentium3 commented 1 year ago

pentium3 commented 9 months ago

https://alpa.ai/icml22-tutorial.html

https://sites.google.com/view/icml-2022-big-model

Trends Driving Big Models - Ion Stoica

intro

as models become larger, hardware (compute && memory) capacity increasing slowly.
it's hard to fit the whole model to a single machine

so we need to parallelize the model

two ways to parallelize model

1. inter-operator parallelism:

partition model by layers/stages. different tensor operators run on different GPUs.
Pipeline execution on both forward and backward paths
GPUs can be on the same machine or different machines -> could lead to diff communication pattern

2. intra-operator parallelism:

partition one layer/stage. one operator can run across multiple GPUs

Challenges

Huge optimization space:

Data parallelism
Model parallelism
- Inter-operator parallelism
- Intra-operator parallelism

Growing diversity of specialized hardware (e.g., GPUs, TPUs)

Huge and growing diversity of NN architectures

pentium3 commented 9 months ago

https://icml.cc/virtual/2022/tutorial/18440

How to train and serve big models?

Using parallelization.

Scope of the tutorial: Parallelization techniques and systems for big models

Techniques to train and serve big models
- data, operator, pipeline, expert, “3D” parallelism
- inter-op and intra-op parallelism
Available tools to start using big models now
- big model libraries: Megatron-LM, DeepSpeed, XLA/GSPMD, Alpa
- big model weights: OPT-175B, BLOOM
New frontiers and opportunities in systems for big models
- Auto-parallelization

pentium3 commented 9 months ago

Preliminaries

Historical context and problem overview

1. Data parallelism with Parameter Server

allow asynchronously training

however, many existing system like pytorch DDP use all-reduce to update parameter synchronously, since communication between GPUs (within single machine) is fast enough

2. Computational Graph and Placement

TensorFlow: DL computation formulated as a dataflow graph

memory of one GPU is limited, so we need to place different node of dataflow graph to different devices(GPUs)

for big model, deciding placement is a hard problem.

Background

1. distributed DL Computation

computation pattern of DL training:

for each epoch {
    for a batch of input data {
        ▽L(): forward propagation: compute prediction with model
        ▽L(): backward propagation: apply loss function and then compute gradient 
        f(): use gradient to update model weights.
    }
}

2. Classic Views of ML Parallelisms: Data Parallelism / Model Parallelism

Data Parallelism:

partition input data. replicate model. assume whole model can fit a single device.
general and precise.

Model Parallelism

partition model, replicate data.
but by partition model, it could be partitioning parameters OR computation function. The view creates ambiguity for methods that neither partitions data nor the model computation.

New view (this tutorial): Inter-op parallelism / Intra-op parallelism

built on two pillars: computational graph and device cluster
This view is based on their computing characteristics.
This view facilitates the development of new parallelism methods

1. formulate DL Computation as Graph

components:

L: a two layer MLE and with MSE loss
▽L(): compute gradient
f(): update weights using simple sgd

3 steps:

forward: compute MSE loss based on parameter and input data D
backward: compute gradient of every parameter
update weights using gradient

2. formulate compute resources as Device Cluster

3. Problem formulation: How to partition the computational graph on the device cluster?

4. compute pattern inside Intra- and Inter-op Parallelism

In the following example, which computes a matrix multiplication:

partition strategy 1: Intra-op parallelism

(1):

X is replicated on dev1 and dev2
dev1: result1_left = matmul(X, w1_left)
dev2: result1_right = matmul(X, w1_right)

(2):

dev1: result2_up = matmul(result1_left, w2_up)
dev2: result2_down = matmul(result1_right, w2_down)
result2 = np.sum(result2_up, result2_down) [require collective all-reduce communication to aggregate results across all devices]

pattern of Collective Communication: (from NCCL documentation)

used all-reduce to aggregate partial results across all devices to a full tensor onto devices

partition strategy 2: Inter-op parallelism

need to send result1 from dev1 to dev2. only P2P communication
dev2 is waiting while dev1 is running

pattern of Point-to-point Communication:

5. summarize:

we can formulate the ultimate problem with this new view of parallelism:

What’s the best way to execute the graph subject to memory and communication constraints?

pentium3 commented 9 months ago

Inter-op parallelism

suppose we partition a Computational Graph to 4 stages with inter-op parallelism, and put each stage on one device:

pipeline parallelism

1. pipeline bubble

due to data dependency, stage 2 can compute only after we get the result of stage 1. if we visualize the execution timeline, we can see there are many times that devices are idle (gray area), a.k.a. Pipeline bubbles

Pipeline bubble percentage = bubble_area / total_area

for Inference case

to reduce pipeline bubble, use pipeline parallelism. eg, we feed 4 input (batches) a,b,c,d at one time in this example.
Pipeline bubbles percentage = (D - 1) / (D - 1 + N) with D devices and N inputs

for Training

harder to utilize pipeline parallelism, since we need to finish forward-backward-update path of one input before moving on to next input.
Pipeline bubble percentage = (D - 1) / D, assuming D devices

2. How to Reduce Pipeline Bubbles for Training?

2.1 Device Placement

Idea: Slice the branches of a neural network into multiple stages so they can be calculated concurrently

Limitation: (1) only work for NN with branch. (2). dev utilization still low device placement needs to be combined with the other pipeline schedules discussed later to further improve device utilization

2.2 Synchronous Pipeline Parallel Schedule

Idea: Modify pipeline schedule to improve efficiency, but keep the computation and convergence semantics exactly the same as if training with a single device.

✅ Pros:

Keep the convergence semantics. The training process is exactly the same as training the neural network on a single device.

❌ Cons:

still have Pipeline bubbles.
Reducing pipeline bubbles typically requires splitting inputs into smaller components, but too small input to the neural network will reduce the hardware efficiency.

2.2.1 GPipe

Limitation: memory usage for storing intermediate activation becomes higher as num of micro-batch increases

2.2.2 1F1B:

optimize memory usage of GPipe by Perform backward as early as possible, so we don't need to keep all intermediate activation (of all mini-batches) in memory

2.2.3 Interleaved 1F1B

Pro: Higher pipeline efficiency with fewer pipeline bubbles.
Con: More communication overhead between stages

2.2.4 TeraPipe

apply pipeline parallelism for transformer model

2.2.5 Chimera

optimize 1F1B by inserting extra pipelines

2.3 Asynchronous Pipeline Parallel Schedule

Idea: Start next round of forward pass before backward pass finishes.

✅ Pros:

No Pipeline bubbles.

❌ Cons:

Break the synchronous training semantics. Now the training will involve stalled gradient.
Algorithms may store multiple versions of model weights for consistency.

2.3.1 AMPNet

different stages can have different version of weights.

eg, in this example, suppose stage1 on dev1, stage2 on dev2, stage3 on dev3. For data#1, its forward path of stage 1 and 2 are using initial weights. But its forward path of stage 3 (and backward path of all stages) is using weights updated by data#0

it might bring noise to training process. hard to work for larger dataset

2.3.2 Pipedream

2.4 Automatic Stage Partitioning

The Assumption of previous Pipeline Parallel Schedule algorithms: assume balanced stages: running time (latency) of each stage are the same. Pipeline schedules works best with balanced stages.

however, Imbalance stage will create more pipeline bubble. so, an important research problem is to Minimize maximum stage latency & maximize parallelization

there are generally two types of solutions:

2.4.1 Reinforcement Learning Based (mainly for device placement):

2.4.2 Optimization (Dynamic Programming/Linear Programming) Based:

2.5 Summary

pentium3 commented 9 months ago

Intra-op parallelism

There are generally 2 ways to achieve Intra-op parallelism

1. parallelize a single operator

2. parallelize a subgraph

a case of Intra-op Parallelism

in the following example, we first look at a subgraph of the computation graph of a 2-layer MLP. there are 2 ways to partition input matrix to achieve intra-op parallelism:

when we merge multiple subgraphs to the whole computation graph, since different operators’(in diff subgraph) parallelization strategies require different partition format of the same tensor, we need to consider re-partition Communication Cost:

in this example, if subgraph 1 and 2 use different parallelization strategies, we need to re-partition the output of relu, which leads to some communication cost.

Problem formulation

3. representative works using of Intra-op Parallelism

3.1 Model-specific Intra-op Parallel Strategies

AlexNet

Megaton-LM

weights are large, so we partition weights. lightweight operators like dropout are replicated.

GShard MoE

partition the weight of MoE layers since it's large.

ZeRO Optimizer

4. Combining inter- and intra-operator parallelism scales to more devices.

assign each stage to a device mesh (which contains multiple dev connected via diff mesh)

5. summary

We can parallelize a single operator by exploiting its internal parallelism
To do this for a whole computational graph, we need to choose strategies for all nodes in the graph to minimize the communication cost
Intra-op and inter-op can be combined

pentium3 commented 9 months ago

New frontiers: auto-parallelization

X: best supported models of each system

Y: techniques used by each system

Auto-parallelization Problem: (automatically) find best combination of inter-op and intra-op strategies to maximize the performance of a model on a cluster of devices.

challenges

The Search Space is Huge.

100-10K num of operators in a real model (nodes to color)
80-200+ types if operators (type of nodes)
10-1000 num of devices in a cluster (available colors)

existing Automatic Parallelization Methods

use 3 types of algorithms to find Parallelization strategy

General Recipe of solving Automatic Parallelization problem:

define a search space of diff strategies
reduce search space using heuristics
apply search algorithm to find best strategy
evaluate if the found strategy works well (by deploying strategy or human-designed cost model)
evaluation result can give feedback to search algorithm, if appliable

representative existing works

1. FlexFlow: use MCMC Search-based method

not practical with large model / cluster

2. ColocRL (Device Placement Optimization with Reinforcement Learning)

only consider inter-op

3. Alpa: Optimization-based Method

intuition: map inter-op(less communication but more device idle time) to slow connections in the cluster, map intra-op(more communication but less device idle time) to fast connections in the cluster. -> Minimize Computation cost + Communication cost

Hierarchically reduce search space:

first use inter-op to partition computation graph, and assign each stage to a subset of dev (use DP)
then apply intra-op for each stage (use ILP)
inter-op optimizer can call intra-op optimizer multiple times and get feedback

for more details see https://github.com/pentium3/sys_reading/issues/228

comparison of 3 Automatic Parallelization Methods

Future ML Systems and Challenges

automatic transform (scaling && scheduling) any prototype model code developed on a single machine, to a large scale parallel version that run on a cluster of devices, with achieving any performance goal.

pentium3 / sys_reading

Big Model Tutorial Techniques and Systems to Train and Serve Bigger Models #259

Trends Driving Big Models - Ion Stoica

intro

two ways to parallelize model

1. inter-operator parallelism:

2. intra-operator parallelism:

Challenges

How to train and serve big models?

Preliminaries

Historical context and problem overview

1. Data parallelism with Parameter Server

2. Computational Graph and Placement

Background

1. distributed DL Computation

2. Classic Views of ML Parallelisms: Data Parallelism / Model Parallelism

New view (this tutorial): Inter-op parallelism / Intra-op parallelism

1. formulate DL Computation as Graph

2. formulate compute resources as Device Cluster

3. Problem formulation: How to partition the computational graph on the device cluster?

4. compute pattern inside Intra- and Inter-op Parallelism

5. summarize:

Inter-op parallelism

pipeline parallelism

1. pipeline bubble

for Inference case

for Training

2. How to Reduce Pipeline Bubbles for Training?

2.1 Device Placement

2.2 Synchronous Pipeline Parallel Schedule

2.3 Asynchronous Pipeline Parallel Schedule

2.4 Automatic Stage Partitioning

2.5 Summary

Intra-op parallelism

1. parallelize a single operator

2. parallelize a subgraph

a case of Intra-op Parallelism

Problem formulation

3. representative works using of Intra-op Parallelism

3.1 Model-specific Intra-op Parallel Strategies

AlexNet

Megaton-LM

GShard MoE

ZeRO Optimizer

4. Combining inter- and intra-operator parallelism scales to more devices.

5. summary

New frontiers: auto-parallelization

challenges

existing Automatic Parallelization Methods

representative existing works

1. FlexFlow: use MCMC Search-based method

2. ColocRL (Device Placement Optimization with Reinforcement Learning)

3. Alpa: Optimization-based Method

comparison of 3 Automatic Parallelization Methods

Future ML Systems and Challenges