microsoft / nnfusion

A flexible and efficient deep neural network (DNN) compiler that generates high-performance executable from a DNN model description.
MIT License
959 stars 163 forks source link

QUESTION: any document about how vDevice and vEU are implemented? #370

Open puddingfjz opened 2 years ago

puddingfjz commented 2 years ago

Hi,

I am interested in Rammer and I am trying to understand of how Rammer is implemented. I have some questions about the code:

  1. how vDevice and vEU are implemented?

  2. how the mapping from a vEU to an SM is done (which part of code is related)?

Thanks a lot!

xysmlx commented 2 years ago

Hi, here is the terminology mapping between the paper and the open-source implementation.

The vDevice is defined in src/nnfusion/engine/pass/graph/blockfusion/block_parallel_device.hpp as BlockParallelDevice. The vEU (block_executor) in CUDA and ROCm GPUs is implemented as a thread block.

In current implementation, the mapping from a vEU to a SM is automatically done by CUDA because SMs in CUDA GPUs and CUs in ROCm GPUs are homogeneous. You may take a look at the generated kernel code.

puddingfjz commented 2 years ago

Hi,

Thanks a lot for the reply. I will take a look at the generated kernel code.

zqj2333 commented 2 years ago

Hi, here is the terminology mapping between the paper and the open-source implementation.

The vDevice is defined in src/nnfusion/engine/pass/graph/blockfusion/block_parallel_device.hpp as BlockParallelDevice. The vEU (block_executor) in CUDA and ROCm GPUs is implemented as a thread block.

In current implementation, the mapping from a vEU to a SM is automatically done by CUDA because SMs in CUDA GPUs and CUs in ROCm GPUs are homogeneous. You may take a look at the generated kernel code.

Hi, I found that in current implementation,the "RESOURCE_CAPACITY" is set to "4 80", and "DEFAULT_BE" is set to "10240"(10240=4 80 32), the "80" is the number of SM in a V100 GPU. It seems that "4" in RESOURCE_CAPACITY and 432 in DEFAULT_BE means "Over-Allocation of blocks" to improve hardware utilization, because I found "DEFINE_int32(fnum_non_cpu, 1, "Number of gpus.")" in code from osdi20_artifact, and it has not been set to specific value in command from codegen_and_build.sh. But in "5.1 Experimental Setup" from paper, there are four NVIDIA Tesla V100 (16GB), how does the "Four" have been reflected in the source code? Thanks a lot for response!

xysmlx commented 2 years ago

Hi, here is the terminology mapping between the paper and the open-source implementation. The vDevice is defined in src/nnfusion/engine/pass/graph/blockfusion/block_parallel_device.hpp as BlockParallelDevice. The vEU (block_executor) in CUDA and ROCm GPUs is implemented as a thread block. In current implementation, the mapping from a vEU to a SM is automatically done by CUDA because SMs in CUDA GPUs and CUs in ROCm GPUs are homogeneous. You may take a look at the generated kernel code.

Hi, I found that in current implementation,the "RESOURCE_CAPACITY" is set to "4 80", and "DEFAULT_BE" is set to "10240"(10240=4 80 32), the "80" is the number of SM in a V100 GPU. It seems that "4" in RESOURCE_CAPACITY and 432 in DEFAULT_BE means "Over-Allocation of blocks" to improve hardware utilization, because I found "DEFINE_int32(fnum_non_cpu, 1, "Number of gpus.")" in code from osdi20_artifact, and it has not been set to specific value in command from codegen_and_build.sh. But in "5.1 Experimental Setup" from paper, there are four NVIDIA Tesla V100 (16GB), how does the "Four" have been reflected in the source code? Thanks a lot for response!

Hi @zqj2333 , "DEFINE_int32(fnum_non_cpu, 1, "Number of gpus.")" is not used in the blockfusion pass. In -fblockfusion_level=1, it uses global kernel launch for barriers, so there will be no deadlocks. Therefore, we can set the DEFAULT_BE larger to improve hardware utilization. It is the configuration for setting the number of vEUs. We set a default number for general cases and you can set this number for specific hardware. RESOURCE_CAPACITY is a heuristic rule to fetch "resource-efficient" kernel if the kernel DB has.

"Four" is not used. Our testbed has "Four" gpus and we need to report it. but we only used one of them.

zqj2333 commented 2 years ago

Hi, here is the terminology mapping between the paper and the open-source implementation. The vDevice is defined in src/nnfusion/engine/pass/graph/blockfusion/block_parallel_device.hpp as BlockParallelDevice. The vEU (block_executor) in CUDA and ROCm GPUs is implemented as a thread block. In current implementation, the mapping from a vEU to a SM is automatically done by CUDA because SMs in CUDA GPUs and CUs in ROCm GPUs are homogeneous. You may take a look at the generated kernel code.

Hi, I found that in current implementation,the "RESOURCE_CAPACITY" is set to "4 80", and "DEFAULT_BE" is set to "10240"(10240=4 80 32), the "80" is the number of SM in a V100 GPU. It seems that "4" in RESOURCE_CAPACITY and 432 in DEFAULT_BE means "Over-Allocation of blocks" to improve hardware utilization, because I found "DEFINE_int32(fnum_non_cpu, 1, "Number of gpus.")" in code from osdi20_artifact, and it has not been set to specific value in command from codegen_and_build.sh. But in "5.1 Experimental Setup" from paper, there are four NVIDIA Tesla V100 (16GB), how does the "Four" have been reflected in the source code? Thanks a lot for response!

Hi @zqj2333 , "DEFINE_int32(fnum_non_cpu, 1, "Number of gpus.")" is not used in the blockfusion pass. In -fblockfusion_level=1, it uses global kernel launch for barriers, so there will be no deadlocks. Therefore, we can set the DEFAULT_BE larger to improve hardware utilization. It is the configuration for setting the number of vEUs. We set a default number for general cases and you can set this number for specific hardware. RESOURCE_CAPACITY is a heuristic rule to fetch "resource-efficient" kernel if the kernel DB has.

"Four" is not used. Our testbed has "Four" gpus and we need to report it. but we only used one of them.

Thanks for your response! In the heuristic rule, why is the RESOURCE_CAPACITY set as 320 rather than 80 or other number?

xysmlx commented 2 years ago

Hi, here is the terminology mapping between the paper and the open-source implementation. The vDevice is defined in src/nnfusion/engine/pass/graph/blockfusion/block_parallel_device.hpp as BlockParallelDevice. The vEU (block_executor) in CUDA and ROCm GPUs is implemented as a thread block. In current implementation, the mapping from a vEU to a SM is automatically done by CUDA because SMs in CUDA GPUs and CUs in ROCm GPUs are homogeneous. You may take a look at the generated kernel code.

Hi, I found that in current implementation,the "RESOURCE_CAPACITY" is set to "4 80", and "DEFAULT_BE" is set to "10240"(10240=4 80 32), the "80" is the number of SM in a V100 GPU. It seems that "4" in RESOURCE_CAPACITY and 432 in DEFAULT_BE means "Over-Allocation of blocks" to improve hardware utilization, because I found "DEFINE_int32(fnum_non_cpu, 1, "Number of gpus.")" in code from osdi20_artifact, and it has not been set to specific value in command from codegen_and_build.sh. But in "5.1 Experimental Setup" from paper, there are four NVIDIA Tesla V100 (16GB), how does the "Four" have been reflected in the source code? Thanks a lot for response!

Hi @zqj2333 , "DEFINE_int32(fnum_non_cpu, 1, "Number of gpus.")" is not used in the blockfusion pass. In -fblockfusion_level=1, it uses global kernel launch for barriers, so there will be no deadlocks. Therefore, we can set the DEFAULT_BE larger to improve hardware utilization. It is the configuration for setting the number of vEUs. We set a default number for general cases and you can set this number for specific hardware. RESOURCE_CAPACITY is a heuristic rule to fetch "resource-efficient" kernel if the kernel DB has. "Four" is not used. Our testbed has "Four" gpus and we need to report it. but we only used one of them.

Thanks for your response! In the heuristic rule, why is the RESOURCE_CAPACITY set as 320 rather than 80 or other number?

It's an empirical parameter to see whether a rKernel candidate can occupy the whole GPU. Because a SM of the GPU can run multiple thread blocks and a thread block usually cannot occupy the whole SM, we set it to 4x80 instead of 80. Note that the SM of V100 has 4 warp schedulers, that is to say, there should be at least 4 warps to occupy a SM. When the size of a thread block is 32 (i.e., a warp), only ">=4x80" has the opportunity to occupy the V100 GPU (but 4x80x32 also usually cannot occupy a V100).

zqj2333 commented 2 years ago

Hi, here is the terminology mapping between the paper and the open-source implementation. The vDevice is defined in src/nnfusion/engine/pass/graph/blockfusion/block_parallel_device.hpp as BlockParallelDevice. The vEU (block_executor) in CUDA and ROCm GPUs is implemented as a thread block. In current implementation, the mapping from a vEU to a SM is automatically done by CUDA because SMs in CUDA GPUs and CUs in ROCm GPUs are homogeneous. You may take a look at the generated kernel code.

Hi, I found that in current implementation,the "RESOURCE_CAPACITY" is set to "4 80", and "DEFAULT_BE" is set to "10240"(10240=4 80 32), the "80" is the number of SM in a V100 GPU. It seems that "4" in RESOURCE_CAPACITY and 432 in DEFAULT_BE means "Over-Allocation of blocks" to improve hardware utilization, because I found "DEFINE_int32(fnum_non_cpu, 1, "Number of gpus.")" in code from osdi20_artifact, and it has not been set to specific value in command from codegen_and_build.sh. But in "5.1 Experimental Setup" from paper, there are four NVIDIA Tesla V100 (16GB), how does the "Four" have been reflected in the source code? Thanks a lot for response!

Hi @zqj2333 , "DEFINE_int32(fnum_non_cpu, 1, "Number of gpus.")" is not used in the blockfusion pass. In -fblockfusion_level=1, it uses global kernel launch for barriers, so there will be no deadlocks. Therefore, we can set the DEFAULT_BE larger to improve hardware utilization. It is the configuration for setting the number of vEUs. We set a default number for general cases and you can set this number for specific hardware. RESOURCE_CAPACITY is a heuristic rule to fetch "resource-efficient" kernel if the kernel DB has. "Four" is not used. Our testbed has "Four" gpus and we need to report it. but we only used one of them.

Thanks for your response! In the heuristic rule, why is the RESOURCE_CAPACITY set as 320 rather than 80 or other number?

It's an empirical parameter to see whether a rKernel candidate can occupy the whole GPU. Because a SM of the GPU can run multiple thread blocks and a thread block usually cannot occupy the whole SM, we set it to 4x80 instead of 80. Note that the SM of V100 has 4 warp schedulers, that is to say, there should be at least 4 warps to occupy a SM. When the size of a thread block is 32 (i.e., a warp), only ">=4x80" has the opportunity to occupy the V100 GPU (but 4x80x32 also usually cannot occupy a V100).

Thanks a lot!I have understood it!