jasperzhong / read-papers-and-code

My paper/code reading notes in Chinese
43 stars 3 forks source link

Misc '20 | NVIDIA A100 Tensor Core GPU Architecture #220

Closed jasperzhong closed 3 years ago

jasperzhong commented 3 years ago

https://images.nvidia.com/aem-dam/en-zz/Solutions/data-center/nvidia-ampere-architecture-whitepaper.pdf

DGX A100顺便也看下 https://images.nvidia.com/aem-dam/Solutions/Data-Center/nvidia-dgx-a100-datasheet.pdf

jasperzhong commented 3 years ago

DGX A100 topology. image

这个topo非常有意思. 有几个重要升级:

NB. 这完全是为训练AI而生的架构.

jasperzhong commented 3 years ago

另外还有个CUDA Graph的概念. 就是把多个opeartors变成一次kernel launch,减少kernel launch开销. define-once/run-repeatedly.

https://developer.nvidia.com/blog/cuda-graphs/

A task graph consists of a series of operations, such as memory copies and kernel launches, connected by dependencies, and is defined separately from its execution. Task graphs enable a define-once/run-repeatedly execution flow. A predefined task graph allows launch of any number of kernels in one single operation, greatly improving application efficiency and performance.

Execution of work on the GPU breaks down into three stages: launch, grid initialization, and kernel execution.

注意下面的小字:CPU执行时间的缩短是由于Launch时间缩短,这是一次性的; 而Task graph也允许CUDA driver做一些优化,因为整个workflow对于driver是可见的,包括执行、数据传输、同步等等,所以可以加快执行速度. 比如下图是减少了其中的Grid initialization时间. image

jasperzhong commented 3 years ago

GA100 GPU. GA100是GPU codename. 之前还有GP100, GV100. image

128个SM.

SM架构. image

A100支持的Tensor Core Precisions非常多. 之前只有FP16,现在还可以INT1/4/8. TF32, BF16. INT1?!那不就是binary吗? image

与V100的各种精度速度对比. image

——————————————————

题外话:

这上面说V100的FP16 TC能有125TFLOPS. 而普通的FP16只有31.4TFLOPS. 从cuDNN文档来看,需要设置math mode为CUDNN_TENSOR_OP_MATH.

https://docs.nvidia.com/deeplearning/cudnn/developer-guide/index.html#tensor_ops

文档里有段话引起了我的注意:

For example, the result of multiplying two matrices using Tensor Core operations is very close, but not always identical, to the result achieved using a sequence of scalar floating-point operations. For this reason, the cuDNN library requires an explicit user opt-in before enabling the use of Tensor Core operations. However, experiments with training common deep learning models show negligible differences between using Tensor Core operations and scalar floating point paths, as measured by both the final network accuracy and the iteration count to convergence. Consequently, the cuDNN library treats both modes of operation as functionally indistinguishable and allows for the scalar paths to serve as legitimate fallbacks for cases in which the use of Tensor Core operations is unsuitable.

关于tensor core原理:

Tensor Cores operate on FP16 input data with FP32 accumulation. image

Futher reading: https://developer.nvidia.com/blog/programming-tensor-cores-cuda-9/

jasperzhong commented 3 years ago

TF32这个format比较奇葩,它有19位,不是2的倍数,真不知道怎么对齐. image

看上去还是用FP32存?但是看上去TOPS快了8倍!怎么会快怎么多???不过FP16和BF16的加速更为惊人!tensor core nb! image

INT表现也很惊人,INT8可以达到FP16两倍的速度. 最快的BINARY可以达到4992TOPS,是FP32的256倍!!!不过BINARY能做训练么?有点好奇.

意外的是,FP64居然取得了和FP32一样的速度. 不过不知道FP64有啥应用场景.

jasperzhong commented 3 years ago

关于MIG (Multi-instance GPU) Architecture,一个A100可以拆成7个.

之前的Volta架构有一个MPS (multi-process server)支持,允许多个应用同时执行,使用不同的SMs.

image

里面有一句话:

However, because memory system resources were shared across all the applications, one application could interfere with the others if it had high demands for DRAM bandwidth or its requests oversubscribed the L2 cache. Volta MPS, which remains fully supported on Ampere, was designed for sharing the GPU among applications from a single user, but not for multi-user or multi-tenant use cases.

所以MPS还是适合single user多个applications.

而MIG不一样. MIG是把一个GPU拆成多个GPU partitions,叫做GPU instances. 每个instance的SMs拥有独立且隔离的memory通道. 也就是说它们在memory层面是隔离的. 而之前的MPS还是之间会有一些干扰. MIG还能做到fault isolation.

一个GPU instance包括多个GPU Slices. 而一个A100 GPU一共包含7个GPU slices. 一个GPU instances还可以切分多个Compute instances,这几个Compute instances是共享memory的.

image

jasperzhong commented 3 years ago

关于A100对sparsity的支持. 首先要知道model compression有两种sparsity.

一种是fine-grained sparsity. 就是保留nodes,但是不规则地去掉一些边. 这会导致计算和访存不规律,反而会导致性能下降. image

另一种是coarse grained sparsity. 简单来说就是去掉一块subnetwork. 这种方法保持了workload地parallel nature,能够提高性能,但是可能会损失一些accuracy. image

而fine-grained structured sparsity兼具上述两者的优点. 这也是A100所支持的. 它会要求不同层有相等数量的sparse connections. 比如第二层和第三层的node,都有两个spare connections. image

A100支持2:4 structured sparsity on rows. image

性能几乎无损失. 而速度几乎翻倍!!! image

这个非常有意思,找个时间尝试下!