Closed pentium3 closed 9 months ago
https://alpa.ai/icml22-tutorial.html
https://sites.google.com/view/icml-2022-big-model
so we need to parallelize the model
Huge optimization space:
Growing diversity of specialized hardware (e.g., GPUs, TPUs)
Huge and growing diversity of NN architectures
https://icml.cc/virtual/2022/tutorial/18440
Using parallelization.
Scope of the tutorial: Parallelization techniques and systems for big models
allow asynchronously training
however, many existing system like pytorch DDP use all-reduce to update parameter synchronously, since communication between GPUs (within single machine) is fast enough
TensorFlow: DL computation formulated as a dataflow graph
memory of one GPU is limited, so we need to place different node of dataflow graph to different devices(GPUs)
for big model, deciding placement is a hard problem.
computation pattern of DL training:
for each epoch {
for a batch of input data {
▽L(): forward propagation: compute prediction with model
▽L(): backward propagation: apply loss function and then compute gradient
f(): use gradient to update model weights.
}
}
Data Parallelism:
Model Parallelism
components:
3 steps:
In the following example, which computes a matrix multiplication:
partition strategy 1: Intra-op parallelism
(1):
(2):
pattern of Collective Communication: (from NCCL documentation)
used all-reduce to aggregate partial results across all devices to a full tensor onto devices
partition strategy 2: Inter-op parallelism
pattern of Point-to-point Communication:
we can formulate the ultimate problem with this new view of parallelism:
What’s the best way to execute the graph subject to memory and communication constraints?
suppose we partition a Computational Graph to 4 stages with inter-op parallelism, and put each stage on one device:
due to data dependency, stage 2 can compute only after we get the result of stage 1. if we visualize the execution timeline, we can see there are many times that devices are idle (gray area), a.k.a. Pipeline bubbles
Idea: Slice the branches of a neural network into multiple stages so they can be calculated concurrently
Limitation: (1) only work for NN with branch. (2). dev utilization still low device placement needs to be combined with the other pipeline schedules discussed later to further improve device utilization
Idea: Modify pipeline schedule to improve efficiency, but keep the computation and convergence semantics exactly the same as if training with a single device.
✅ Pros:
❌ Cons:
2.2.1 GPipe
Limitation: memory usage for storing intermediate activation becomes higher as num of micro-batch increases
2.2.2 1F1B:
optimize memory usage of GPipe by Perform backward as early as possible, so we don't need to keep all intermediate activation (of all mini-batches) in memory
2.2.3 Interleaved 1F1B
2.2.4 TeraPipe
apply pipeline parallelism for transformer model
2.2.5 Chimera
optimize 1F1B by inserting extra pipelines
Idea: Start next round of forward pass before backward pass finishes.
✅ Pros:
❌ Cons:
2.3.1 AMPNet
different stages can have different version of weights.
eg, in this example, suppose stage1 on dev1, stage2 on dev2, stage3 on dev3. For data#1, its forward path of stage 1 and 2 are using initial weights. But its forward path of stage 3 (and backward path of all stages) is using weights updated by data#0
it might bring noise to training process. hard to work for larger dataset
2.3.2 Pipedream
The Assumption of previous Pipeline Parallel Schedule algorithms: assume balanced stages: running time (latency) of each stage are the same. Pipeline schedules works best with balanced stages.
however, Imbalance stage will create more pipeline bubble. so, an important research problem is to Minimize maximum stage latency & maximize parallelization
there are generally two types of solutions:
2.4.1 Reinforcement Learning Based (mainly for device placement):
2.4.2 Optimization (Dynamic Programming/Linear Programming) Based:
There are generally 2 ways to achieve Intra-op parallelism
in the following example, we first look at a subgraph of the computation graph of a 2-layer MLP. there are 2 ways to partition input matrix to achieve intra-op parallelism:
when we merge multiple subgraphs to the whole computation graph, since different operators’(in diff subgraph) parallelization strategies require different partition format of the same tensor, we need to consider re-partition Communication Cost:
in this example, if subgraph 1 and 2 use different parallelization strategies, we need to re-partition the output of relu, which leads to some communication cost.
weights are large, so we partition weights. lightweight operators like dropout are replicated.
partition the weight of MoE layers since it's large.
assign each stage to a device mesh (which contains multiple dev connected via diff mesh)
X: best supported models of each system
Y: techniques used by each system
Auto-parallelization Problem: (automatically) find best combination of inter-op and intra-op strategies to maximize the performance of a model on a cluster of devices.
The Search Space is Huge.
use 3 types of algorithms to find Parallelization strategy
General Recipe of solving Automatic Parallelization problem:
not practical with large model / cluster
only consider inter-op
intuition: map inter-op(less communication but more device idle time) to slow connections in the cluster, map intra-op(more communication but less device idle time) to fast connections in the cluster. -> Minimize Computation cost + Communication cost
Hierarchically reduce search space:
for more details see https://github.com/pentium3/sys_reading/issues/228
automatic transform (scaling && scheduling) any prototype model code developed on a single machine, to a large scale parallel version that run on a cluster of devices, with achieving any performance goal.
https://zhuanlan.zhihu.com/p/562741952