microsoft / nnfusion

A flexible and efficient deep neural network (DNN) compiler that generates high-performance executable from a DNN model description.
MIT License
937 stars 157 forks source link

[Bug fix] Register Fusion Pass fuse policy assign wrong output edges #514

Open LeiWang1999 opened 1 year ago

LeiWang1999 commented 1 year ago
  1. Add Support for int16_t load ( bloom fp16 model

  2. for Register fusion pass (welder) fused node with multiple outputs, current code makes a wrong assignment of output edge, which will cause mistakes in some cases.

3. re-write the CUDA_ARCH string in Cuda Codegen CMakeList.txt in a more friendly way.

in current way of

-gencode arch=compute_60,code=sm_60 -gencode arch=compute_61,code=sm_61 -gencode arch=compute_70,code=sm_70 -gencode arch=compute_75,code=sm_75 -gencode arch=compute_80,code=sm_80 -gencode arch=compute_86,code=sm_86

if we wanna use some features which must be in sm_86, we should comment the low cuda arch gencode flag, otherwise we will get an compilation error.

ptxas /tmp/tmpxft_0000e00e_00000000-11_nnfusion_rt.compute_60.ptx, line 43059; error   : Feature '.m16n8k16' requires .target sm_80 or higher

with the new CUDA_ARCH SET way

SET(CUDA_ARCH "-gencode=arch=compute_60,code=compute_60 -gencode=arch=compute_61,code=compute_61 -gencode=arch=compute_70,code=compute_70 -gencode=arch=compute_75,code=compute_75 -gencode=arch=compute_80,code=compute_80" CACHE STRING "target architecture")

we no longer have this concern.

  1. bug fix
    void cuda::FusionCudaEmitter::set_launch_config()
    {
    auto block = m_fusion_group["block_size"];
    auto grid = m_fusion_group["grid_size"];
    block[0].get_to(m_blockDim.x);
    block[1].get_to(m_blockDim.y);
    block[2].get_to(m_blockDim.z);
    grid[0].get_to(m_gridDim.x);
    grid[1].get_to(m_gridDim.y);
    grid[1].get_to(m_gridDim.z);
    }

    should be grid[2].get_to(m_gridDim.z);