pentium3 / sys_reading

system paper reading notes
234 stars 12 forks source link

Alpa: Automating Inter- and Intra-Operator Parallelism for Distributed Deep Learning #228

Closed pentium3 closed 7 months ago

pentium3 commented 1 year ago

https://arxiv.org/pdf/2201.12023.pdf

pentium3 commented 1 year ago

https://www.usenix.org/conference/osdi22/presentation/zheng-lianmin

pentium3 commented 1 year ago

https://alpa.ai/opt

pentium3 commented 1 year ago

https://zhuanlan.zhihu.com/p/521211578

https://zhuanlan.zhihu.com/p/571836701

pentium3 commented 10 months ago

some background:

tensor axes (batch or non-batch):

In the context of tensors in deep learning, the term "axes" refers to the dimensions of the tensor. Understanding these axes is crucial for correctly structuring data and performing operations on tensors. There are two main types of axes commonly discussed: batch and non-batch axes.

Batch Axis: The batch axis is the dimension of the tensor that represents the number of data samples in a batch. In deep learning, it's common to process data in batches for efficiency reasons. For instance, when training a neural network, instead of updating the model's parameters with every single data point, it's more efficient to update them after processing a batch of data. In a tensor representing a batch of data, the batch axis is typically the first axis (axis 0 in Python, which uses zero-based indexing). For example, in a tensor representing a batch of images, the shape might be [batch_size,height,width,channels], where batch_size is the size of the batch.

Non-Batch Axes: Non-batch axes are the other dimensions of the tensor that represent the actual data dimensions (of one data sample). For instance, in a 4D tensor for images, the non-batch axes would represent the height, width, and color channels of the images. The interpretation of these axes depends on the specific data and the application. In the case of time series data, for example, one of the non-batch axes might represent time steps.

Example: Consider a tensor with the shape [32,224,224,3]. This tensor could represent a batch of 32 images. 32 is the batch size – it's the size of the batch axis. 224,224,3 represent the height, width, and number of color channels of each image – these are the non-batch axes.

Understanding the concept of batch and non-batch axes is vital for designing neural network architectures, preprocessing data, and specifying operations in deep learning frameworks. The batch axis allows for efficient processing of large datasets, while the non-batch axes represent the actual dimensions of the data being processed.

dataflow graph

DL computation is commonly represented by popular ML frameworks as a dataflow graph. Edges in the graph represent multi-dimensional tensors (an input tensor can contain a batch of input records); nodes are computational operators, such as matrix multiplication (matmul), that transform input tensors into output tensors. Training a DL model for one iteration consists of computing a loss by forwarding a batch of data through the graph, deriving the updates via a reverse backward pass, and applying the updates to the parameters via weight update operations. In practice, model developers define the dataflow graph. An execution engine then optimizes and executes it on a compute device.

complicated models like Transformers and GPT (Generative Pre-trained Transformer) are still based on computation graphs, although these graphs are much more complex compared to simpler models. These models consist of multiple layers and operations, such as attention mechanisms, feed-forward networks, and activation functions. Each of these components can be represented as a series of nodes in a computation graph.

Example: see Figure 5 in https://arxiv.org/pdf/1706.04972.pdf

Dynamic Graphs vs. Static Graphs

Dynamic Graphs: Some deep learning frameworks, like PyTorch, create dynamic computation graphs, meaning the graph is built up on-the-fly as operations are performed. This is particularly useful for models like GPT, where the size of the graph can depend on the input (e.g., the sequence length). The computation graph is not predefined but is constructed during runtime as these operations are executed. This allows for flexibility and ease of modification of the graph as the program runs.

Static Graphs: Other frameworks, like TensorFlow, traditionally used static graphs, which are defined prior to running the model. a static computation graph is a graph in which the structure is defined and compiled before the actual computation occurs. However, TensorFlow has also introduced dynamic graph capabilities with its Eager Execution mode.

In dynamic graph frameworks, the computation graph is built as the code executes. This is different from static graphs where the entire graph is defined and compiled before execution. The term "dynamic graph" in the context of deep learning frameworks like PyTorch or TensorFlow's Eager Execution refers to more than just the ability to handle varying input sizes. It encompasses several key characteristics:

In summary, while handling varying input tensor sizes is an important feature of dynamic graphs, it's just one part of the broader flexibility and interactivity that these graphs provide. This flexibility is especially useful in research and development settings where rapid prototyping and experimentation are necessary.

pentium3 commented 7 months ago

summary

key problem

workload

large DNN model training. Here "large" means hundreds of billions (数千亿) of parameters, which means ~350GB.

optimization goal

improve throughput

configurations to tune

partition and decide the placement of Computational Graphs on diff GPUs, with the mix of two ways with diff overhead/resource utilization tradeoff:

scenario

datacenter. Each machine has multiple GPUs. Machines are homogeneous.

GPUs on the same machine are connected with high bandwidth. Different machines are connected with (relatively) lower bandwidth.

technique

ILP and DP

dynamic workload?

No. training workload is a "batch" job than runs on a fixed training dataset.

multi-tenant?

No. the training job has a fixed set of resources, and the cluster is dedicated to one training job.

implementation

Alpa uses Jax as the frontend and XLA as the backend. The compiler passes are implemented on Jax’s and XLA’s intermediate representation (i.e., Jaxpr and HLO). For the distributed runtime, we use Ray [37] actor to implement the device mesh worker, XLA runtime for executing computation, and NCCL [41] for communication.

Problem and motivation

- what is the problem this paper is solving?
- why is it important?
- why is it challenging?

see https://github.com/pentium3/sys_reading/issues/259#issuecomment-1958514111

Main ideas and insights

- describe the paper gist in 1-2 sentences
- what is important to remember? What did we learn?

The paper proposed a new view of abstracting distributed ML training scheduling plan, that could incorporate previous distributed ML diagrams. And proposed optimization method

Solution description

- explain how the solution work

Given a model description, in the form of a Jax intermediate representation (IR), which is a computation graph (CG). Alpa works as follows: [ch3]

  1. inter-op: [ch5]

    • slice CG to multiple stages, slice dev cluster to multiple device meshes (a subgraph of dev topology).
    • use Dynamic Programming to assign stages to meshes, while trying to minimize the end-to-end pipeline execution latency (depend on latency of each stage calculated by intra-op).
      • will enumerate all possible stages and submeshes. see formula (2) and ch5.2
    • then call intra-op on each stage-mesh pair to query the execution cost of this assignment
  2. intra-op: [ch4]

    • optimize the intra-op parallel execution plan of the stage running on its assigned mesh, by partition operator across devices
    • use ILP to minimize execution cost (latency) of this stage. The total execution cost of a computational graph G(V,E) is the sum of the compute and communication costs on all nodes v and the resharding costs on all edges e.
    • reports the cost back to the inter-op pass

By repeatedly querying the intra-op pass for each allocation of a stage-mesh pair, the inter-op pass uses the DP to minimize the inter-op parallel execution latency and obtains the best slicing scheme of stages and meshes.

Important results

- describe the experimental setup
- summarize the main results

workload: training large-scale models with billions of parameters, including GPT-3, GShard Mixtureof-Experts (MoE), and Wide-ResNet.

devices: a cluster containing 8 nodes. each node has 8 V100 16GB GPUs. The 8 GPUs in a node are connected via NVLink. The 8 nodes are launched within one placement group with 25Gbps cross-node bandwidth.

baseline: Megatron-LM v2 for GPT-3. DeepSpeed for MoE. Simplified Alpa (to mimic PipeDream and Dapple) for Wide-ResNet.

Evaluation metric: throughput

results:

Limitations and opportunities for improvement

- when doesn't it work?
- what assumptions does the paper make and when are they valid?

Closely related work

- list of main competitors and how they differ

see https://zhuanlan.zhihu.com/p/676892633

Follow-up research ideas (Optional)

If you were to base your next research project on this paper, what would you do?
Propose concrete ways to achieve one or more of the following:
- Build a better (faster, more efficient, more user-friendly...) system to solve the same problem
- Solve a generalization of the problem
- Address one of the work's limitations
- Solve the same problem in a different context
- Solve the problem in a much larger scale
- Apply the paper's methods to a different (but similar) problem
- Solve a new problem created by this work
pentium3 commented 7 months ago

followup: https://github.com/pentium3/sys_reading/issues/347