Closed pentium3 closed 7 months ago
In the context of tensors in deep learning, the term "axes" refers to the dimensions of the tensor. Understanding these axes is crucial for correctly structuring data and performing operations on tensors. There are two main types of axes commonly discussed: batch and non-batch axes.
Batch Axis: The batch axis is the dimension of the tensor that represents the number of data samples in a batch. In deep learning, it's common to process data in batches for efficiency reasons. For instance, when training a neural network, instead of updating the model's parameters with every single data point, it's more efficient to update them after processing a batch of data. In a tensor representing a batch of data, the batch axis is typically the first axis (axis 0 in Python, which uses zero-based indexing). For example, in a tensor representing a batch of images, the shape might be [batch_size,height,width,channels], where batch_size is the size of the batch.
Non-Batch Axes: Non-batch axes are the other dimensions of the tensor that represent the actual data dimensions (of one data sample). For instance, in a 4D tensor for images, the non-batch axes would represent the height, width, and color channels of the images. The interpretation of these axes depends on the specific data and the application. In the case of time series data, for example, one of the non-batch axes might represent time steps.
Example: Consider a tensor with the shape [32,224,224,3]. This tensor could represent a batch of 32 images. 32 is the batch size – it's the size of the batch axis. 224,224,3 represent the height, width, and number of color channels of each image – these are the non-batch axes.
Understanding the concept of batch and non-batch axes is vital for designing neural network architectures, preprocessing data, and specifying operations in deep learning frameworks. The batch axis allows for efficient processing of large datasets, while the non-batch axes represent the actual dimensions of the data being processed.
DL computation is commonly represented by popular ML frameworks as a dataflow graph. Edges in the graph represent multi-dimensional tensors (an input tensor can contain a batch of input records); nodes are computational operators, such as matrix multiplication (matmul), that transform input tensors into output tensors. Training a DL model for one iteration consists of computing a loss by forwarding a batch of data through the graph, deriving the updates via a reverse backward pass, and applying the updates to the parameters via weight update operations. In practice, model developers define the dataflow graph. An execution engine then optimizes and executes it on a compute device.
complicated models like Transformers and GPT (Generative Pre-trained Transformer) are still based on computation graphs, although these graphs are much more complex compared to simpler models. These models consist of multiple layers and operations, such as attention mechanisms, feed-forward networks, and activation functions. Each of these components can be represented as a series of nodes in a computation graph.
Example: see Figure 5 in https://arxiv.org/pdf/1706.04972.pdf
Dynamic Graphs: Some deep learning frameworks, like PyTorch, create dynamic computation graphs, meaning the graph is built up on-the-fly as operations are performed. This is particularly useful for models like GPT, where the size of the graph can depend on the input (e.g., the sequence length). The computation graph is not predefined but is constructed during runtime as these operations are executed. This allows for flexibility and ease of modification of the graph as the program runs.
Static Graphs: Other frameworks, like TensorFlow, traditionally used static graphs, which are defined prior to running the model. a static computation graph is a graph in which the structure is defined and compiled before the actual computation occurs. However, TensorFlow has also introduced dynamic graph capabilities with its Eager Execution mode.
In dynamic graph frameworks, the computation graph is built as the code executes. This is different from static graphs where the entire graph is defined and compiled before execution. The term "dynamic graph" in the context of deep learning frameworks like PyTorch or TensorFlow's Eager Execution refers to more than just the ability to handle varying input sizes. It encompasses several key characteristics:
In summary, while handling varying input tensor sizes is an important feature of dynamic graphs, it's just one part of the broader flexibility and interactivity that these graphs provide. This flexibility is especially useful in research and development settings where rapid prototyping and experimentation are necessary.
large DNN model training. Here "large" means hundreds of billions (数千亿) of parameters, which means ~350GB.
improve throughput
partition and decide the placement of Computational Graphs on diff GPUs, with the mix of two ways with diff overhead/resource utilization tradeoff:
datacenter. Each machine has multiple GPUs. Machines are homogeneous.
GPUs on the same machine are connected with high bandwidth. Different machines are connected with (relatively) lower bandwidth.
ILP and DP
No. training workload is a "batch" job than runs on a fixed training dataset.
No. the training job has a fixed set of resources, and the cluster is dedicated to one training job.
Alpa uses Jax as the frontend and XLA as the backend. The compiler passes are implemented on Jax’s and XLA’s intermediate representation (i.e., Jaxpr and HLO). For the distributed runtime, we use Ray [37] actor to implement the device mesh worker, XLA runtime for executing computation, and NCCL [41] for communication.
- what is the problem this paper is solving?
- why is it important?
- why is it challenging?
see https://github.com/pentium3/sys_reading/issues/259#issuecomment-1958514111
- describe the paper gist in 1-2 sentences
- what is important to remember? What did we learn?
The paper proposed a new view of abstracting distributed ML training scheduling plan, that could incorporate previous distributed ML diagrams. And proposed optimization method
- explain how the solution work
Given a model description, in the form of a Jax intermediate representation (IR), which is a computation graph (CG). Alpa works as follows: [ch3]
inter-op: [ch5]
intra-op: [ch4]
By repeatedly querying the intra-op pass for each allocation of a stage-mesh pair, the inter-op pass uses the DP to minimize the inter-op parallel execution latency and obtains the best slicing scheme of stages and meshes.
- describe the experimental setup
- summarize the main results
workload: training large-scale models with billions of parameters, including GPT-3, GShard Mixtureof-Experts (MoE), and Wide-ResNet.
devices: a cluster containing 8 nodes. each node has 8 V100 16GB GPUs. The 8 GPUs in a node are connected via NVLink. The 8 nodes are launched within one placement group with 25Gbps cross-node bandwidth.
baseline: Megatron-LM v2 for GPT-3. DeepSpeed for MoE. Simplified Alpa (to mimic PipeDream and Dapple) for Wide-ResNet.
Evaluation metric: throughput
results:
- when doesn't it work?
- what assumptions does the paper make and when are they valid?
- list of main competitors and how they differ
see https://zhuanlan.zhihu.com/p/676892633
If you were to base your next research project on this paper, what would you do?
Propose concrete ways to achieve one or more of the following:
- Build a better (faster, more efficient, more user-friendly...) system to solve the same problem
- Solve a generalization of the problem
- Address one of the work's limitations
- Solve the same problem in a different context
- Solve the problem in a much larger scale
- Apply the paper's methods to a different (but similar) problem
- Solve a new problem created by this work
https://arxiv.org/pdf/2201.12023.pdf