[Bug fix] Register Fusion Pass fuse policy assign wrong output edges

Add Support for int16_t load ( bloom fp16 model
for Register fusion pass (welder) fused node with multiple outputs, current code makes a wrong assignment of output edge, which will cause mistakes in some cases.

3. re-write the CUDA_ARCH string in Cuda Codegen CMakeList.txt in a more friendly way.

in current way of

-gencode arch=compute_60,code=sm_60 -gencode arch=compute_61,code=sm_61 -gencode arch=compute_70,code=sm_70 -gencode arch=compute_75,code=sm_75 -gencode arch=compute_80,code=sm_80 -gencode arch=compute_86,code=sm_86

if we wanna use some features which must be in sm_86, we should comment the low cuda arch gencode flag, otherwise we will get an compilation error.

ptxas /tmp/tmpxft_0000e00e_00000000-11_nnfusion_rt.compute_60.ptx, line 43059; error   : Feature '.m16n8k16' requires .target sm_80 or higher

with the new CUDA_ARCH SET way

SET(CUDA_ARCH "-gencode=arch=compute_60,code=compute_60 -gencode=arch=compute_61,code=compute_61 -gencode=arch=compute_70,code=compute_70 -gencode=arch=compute_75,code=compute_75 -gencode=arch=compute_80,code=compute_80" CACHE STRING "target architecture")

we no longer have this concern.

bug fix

void cuda::FusionCudaEmitter::set_launch_config()
{
auto block = m_fusion_group["block_size"];
auto grid = m_fusion_group["grid_size"];
block[0].get_to(m_blockDim.x);
block[1].get_to(m_blockDim.y);
block[2].get_to(m_blockDim.z);
grid[0].get_to(m_gridDim.x);
grid[1].get_to(m_gridDim.y);
grid[1].get_to(m_gridDim.z);
}

should be grid[2].get_to(m_gridDim.z);

microsoft / nnfusion

[Bug fix] Register Fusion Pass fuse policy assign wrong output edges #514

3. re-write the CUDA_ARCH string in Cuda Codegen CMakeList.txt in a more friendly way.