Misc '20 | NVIDIA A100 Tensor Core GPU Architecture

jasperzhong commented 3 years ago

https://images.nvidia.com/aem-dam/en-zz/Solutions/data-center/nvidia-ampere-architecture-whitepaper.pdf

DGX A100顺便也看下 https://images.nvidia.com/aem-dam/Solutions/Data-Center/nvidia-dgx-a100-datasheet.pdf

jasperzhong commented 3 years ago

DGX A100 topology.

这个topo非常有意思. 有几个重要升级:

A100有12个NVLink slots，是V100的两倍，所以直接上了NVSwitch. 一共6个NVSwitch，每个NVSwitch上面的dashboard连8个slots，下面的dashboard连8个slots. 这和DGX 16xV100的比较像.
两个NUMA node（Dual AMD Rome 7742, 128 cores total)，每个node连两个PCIe gen 4.0 switch，每个PCIe switch连两个InifiniBand 200GigE.和两个A100. 注意A100可以直接走GPUDirect RDMA.
每个PCIe switch还连了两个NVMe Flash. 注意A100可以走GPUDirect Storage.

NB. 这完全是为训练AI而生的架构.

jasperzhong commented 3 years ago

另外还有个CUDA Graph的概念. 就是把多个opeartors变成一次kernel launch，减少kernel launch开销. define-once/run-repeatedly.

https://developer.nvidia.com/blog/cuda-graphs/

A task graph consists of a series of operations, such as memory copies and kernel launches, connected by dependencies, and is defined separately from its execution. Task graphs enable a define-once/run-repeatedly execution flow. A predefined task graph allows launch of any number of kernels in one single operation, greatly improving application efficiency and performance.

Execution of work on the GPU breaks down into three stages: launch, grid initialization, and kernel execution.

注意下面的小字：CPU执行时间的缩短是由于Launch时间缩短，这是一次性的; 而Task graph也允许CUDA driver做一些优化，因为整个workflow对于driver是可见的，包括执行、数据传输、同步等等，所以可以加快执行速度. 比如下图是减少了其中的Grid initialization时间.

jasperzhong commented 3 years ago

GA100 GPU. GA100是GPU codename. 之前还有GP100, GV100.

128个SM.

SM架构.

A100支持的Tensor Core Precisions非常多. 之前只有FP16，现在还可以INT1/4/8. TF32, BF16. INT1？！那不就是binary吗？

与V100的各种精度速度对比.

——————————————————

题外话：

这上面说V100的FP16 TC能有125TFLOPS. 而普通的FP16只有31.4TFLOPS. 从cuDNN文档来看，需要设置math mode为CUDNN_TENSOR_OP_MATH.

https://docs.nvidia.com/deeplearning/cudnn/developer-guide/index.html#tensor_ops

文档里有段话引起了我的注意:

For example, the result of multiplying two matrices using Tensor Core operations is very close, but not always identical, to the result achieved using a sequence of scalar floating-point operations. For this reason, the cuDNN library requires an explicit user opt-in before enabling the use of Tensor Core operations. However, experiments with training common deep learning models show negligible differences between using Tensor Core operations and scalar floating point paths, as measured by both the final network accuracy and the iteration count to convergence. Consequently, the cuDNN library treats both modes of operation as functionally indistinguishable and allows for the scalar paths to serve as legitimate fallbacks for cases in which the use of Tensor Core operations is unsuitable.

关于tensor core原理：

Tensor Cores operate on FP16 input data with FP32 accumulation.

Futher reading: https://developer.nvidia.com/blog/programming-tensor-cores-cuda-9/

jasperzhong commented 3 years ago

TF32这个format比较奇葩，它有19位，不是2的倍数，真不知道怎么对齐.

看上去还是用FP32存？但是看上去TOPS快了8倍！怎么会快怎么多？？？不过FP16和BF16的加速更为惊人！tensor core nb！

INT表现也很惊人，INT8可以达到FP16两倍的速度. 最快的BINARY可以达到4992TOPS，是FP32的256倍！！！不过BINARY能做训练么？有点好奇.

意外的是，FP64居然取得了和FP32一样的速度. 不过不知道FP64有啥应用场景.

jasperzhong commented 3 years ago

关于MIG (Multi-instance GPU) Architecture，一个A100可以拆成7个.

之前的Volta架构有一个MPS (multi-process server)支持，允许多个应用同时执行，使用不同的SMs.

里面有一句话:

However, because memory system resources were shared across all the applications, one application could interfere with the others if it had high demands for DRAM bandwidth or its requests oversubscribed the L2 cache. Volta MPS, which remains fully supported on Ampere, was designed for sharing the GPU among applications from a single user, but not for multi-user or multi-tenant use cases.

所以MPS还是适合single user多个applications.

而MIG不一样. MIG是把一个GPU拆成多个GPU partitions，叫做GPU instances. 每个instance的SMs拥有独立且隔离的memory通道. 也就是说它们在memory层面是隔离的. 而之前的MPS还是之间会有一些干扰. MIG还能做到fault isolation.

一个GPU instance包括多个GPU Slices. 而一个A100 GPU一共包含7个GPU slices. 一个GPU instances还可以切分多个Compute instances，这几个Compute instances是共享memory的.