gpgpu-sim / gpgpu-sim_distribution

GPGPU-Sim provides a detailed simulation model of contemporary NVIDIA GPUs running CUDA and/or OpenCL workloads. It includes support for features such as TensorCores and CUDA Dynamic Parallelism as well as a performance visualization tool, AerialVisoin, and an integrated energy model, GPUWattch.
Other
1.02k stars 487 forks source link

Segmentation fault (core dumped) when running gpgpusim 4.0/4.2 with cutlass 1.3(maybe due to .loc instruction syntax error in PTX) #296

Open eecspan opened 1 month ago

eecspan commented 1 month ago

I'm trying to run gpgpusim with cutlass, I followed the documentation requirements, using Cutlass 1.3 and testing with examples from Cutlass 1.3. However, regardless of whether I use GPGPU-Sim 4.0, GPGPU-Sim 4.2, or GPGPU-Sim under Accel-Sim, all result in a segmentation fault and program crashes: image Upon examining the output of GPGPU-Sim, there is a syntax error when executing PTX, as shown below.

GPGPU-Sim PTX: __cudaRegisterFunction _ZN7cutlass4gemm16gemm_kernel_nolbINS0_12GemmMainloopINS0_10GemmTraitsINS0_11SgemmConfigINS_5ShapeILi8ELi128ELi128ELi1EEENS5_ILi8ELi8ELi8ELi1EEELi1ELi1ELb0EEENS0_16GlobalLoadStreamILNS_11GemmOperand4KindE0ENS0_20GemmGlobalIteratorAbINS0_20GemmGlobalTileTraitsILSB_0ELNS_12MatrixLayout4KindE1EKfNS5_ILi1ELi8ELi128ELi1EEENS5_ILi1ELi8ELi32ELi1EEELi1EEEiEENS_17TileStoreIteratorINS0_27GemmSharedStoreTileAbTraitsIfNS5_ILi2ELi8ELi128ELi1EEESI_Li1EEEfLNS_15IteratorAdvance4KindE1ELNS_11MemorySpace4KindE1EifLNS_19FragmentElementType4KindE0ENS5_ILi0ELi0ELi0ELi0EEEEENS_4CopyINS_8FragmentIfLi4ELm16EEEEEEENS9_ILSB_1ENSC_INSD_ILSB_1ELSF_1ESG_NS5_ILi1ELi128ELi8ELi1EEENS5_ILi1ELi32ELi8ELi1EEELi1EEEiEENSL_INS0_35GemmSharedStoreWithSkewTileAbTraitsIfSN_S13_Li1ELi4EEEfLSQ_1ELSS_1EifLSU_0ESV_EES10_EENS0_16SharedLoadStreamINS_16TileLoadIteratorINS0_25GemmSharedLoadTileATraitsISG_S6_NS5_ILi1ELi4ELi2ELi1EEENS5_ILi1ELi4ELi8ELi1EEENS5_ILi1ELi1ELi1ELi1EEELi2ELi4ELi0EEEfLSQ_1ELSS_1EifLSU_0ESV_EENSX_INSY_IfLi8ELm16EEEEEEENS1A_INS1B_INS0_25GemmSharedLoadTileBTraitsISG_S6_S1D_S1E_S1F_Li2ELi4ELi4EEEfLSQ_1ELSS_1EifLSU_0ESV_EES1J_EENS0_12GemmEpilogueINS0_28SimplifiedGemmEpilogueTraitsIS8_NS0_13LinearScalingIfNS0_19FragmentMultiplyAddIffLb1EEEEEiNS0_24GemmEpilogueTraitsHelperIS8_S1U_iEEEEEENS0_20IdentityBlockSwizzleEiNS0_17ClearAccumulatorsIfLi1EEEEEEEEEvNT_6ParamsE : hostFun 0x0x55fc5d804630, fat_cubin_handle = 1
GPGPU-Sim PTX: Parsing basic_gemm.sm_75.ptx
GPGPU-Sim PTX: allocating shared region for "_ZN7cutlass4gemm21GemmSharedStorageBaseE" from 0x0 to 0x0 (shared memory space)
GPGPU-Sim PTX: instruction assembly for function '_Z23InitializeMatrix_kernelPfiiii'...   done.
GPGPU-Sim PTX: Warning -- ignoring pragma 'nounroll'
GPGPU-Sim PTX: instruction assembly for function '_Z20ReferenceGemm_kerneliiifPKfiS0_ifPfi'...   done.
basic_gemm.sm_75.ptx:233 Syntax error:

   .loc 3 170 9, function_name $L__info_string0, inlined_at 2 81 3
               ^

GPGPU-Sim PTX: finished parsing EMBEDDED .ptx file basic_gemm.sm_75.ptx
GPGPU-Sim PTX: loading globals with explicit initializers... 
GPGPU-Sim PTX: finished loading globals (0 bytes total).
GPGPU-Sim PTX: loading constants with explicit initializers...  done.
GPGPU-Sim PTX: Loading PTXInfo from basic_gemm.sm_75.ptx
GPGPU-Sim PTX: Kernel '_ZN7cutlass4gemm16gemm_kernel_nolbINS0_12GemmMainloopINS0_10GemmTraitsINS0_11SgemmConfigINS_5ShapeILi8ELi128ELi128ELi1EEENS5_ILi8ELi8ELi8ELi1EEELi1ELi1ELb0EEENS0_16GlobalLoadStreamILNS_11GemmOperand4KindE0ENS0_20GemmGlobalIteratorAbINS0_20GemmGlobalTileTraitsILSB_0ELNS_12MatrixLayout4KindE1EKfNS5_ILi1ELi8ELi128ELi1EEENS5_ILi1ELi8ELi32ELi1EEELi1EEEiEENS_17TileStoreIteratorINS0_27GemmSharedStoreTileAbTraitsIfNS5_ILi2ELi8ELi128ELi1EEESI_Li1EEEfLNS_15IteratorAdvance4KindE1ELNS_11MemorySpace4KindE1EifLNS_19FragmentElementType4KindE0ENS5_ILi0ELi0ELi0ELi0EEEEENS_4CopyINS_8FragmentIfLi4ELm16EEEEEEENS9_ILSB_1ENSC_INSD_ILSB_1ELSF_1ESG_NS5_ILi1ELi128ELi8ELi1EEENS5_ILi1ELi32ELi8ELi1EEELi1EEEiEENSL_INS0_35GemmSharedStoreWithSkewTileAbTraitsIfSN_S13_Li1ELi4EEEfLSQ_1ELSS_1EifLSU_0ESV_EES10_EENS0_16SharedLoadStreamINS_16TileLoadIteratorINS0_25GemmSharedLoadTileATraitsISG_S6_NS5_ILi1ELi4ELi2ELi1EEENS5_ILi1ELi4ELi8ELi1EEENS5_ILi1ELi1ELi1ELi1EEELi2ELi4ELi0EEEfLSQ_1ELSS_1EifLSU_0ESV_EENSX_INSY_IfLi8ELm16EEEEEEENS1A_INS1B_INS0_25GemmSharedLoadTileBTraitsISG_S6_S1D_S1E_S1F_Li2ELi4ELi4EEEfLSQ_1ELSS_1EifLSU_0ESV_EES1J_EENS0_12GemmEpilogueINS0_28SimplifiedGemmEpilogueTraitsIS8_NS0_13LinearScalingIfNS0_19FragmentMultiplyAddIffLb1EEEEEiNS0_24GemmEpilogueTraitsHelperIS8_S1U_iEEEEEENS0_20IdentityBlockSwizzleEiNS0_17ClearAccumulatorsIfLi1EEEEEEEEEvNT_6ParamsE' : regs=124, lmem=0, smem=0, cmem=872
GPGPU-Sim PTX: Kernel '_Z20ReferenceGemm_kerneliiifPKfiS0_ifPfi' : regs=52, lmem=0, smem=0, cmem=412
GPGPU-Sim PTX: Kernel '_Z23InitializeMatrix_kernelPfiiii' : regs=8, lmem=0, smem=0, cmem=376
GPGPU-Sim PTX: __cudaRegisterFunction _Z20ReferenceGemm_kerneliiifPKfiS0_ifPfi : hostFun 0x0x55fc5d8027a0, fat_cubin_handle = 1
GPGPU-Sim PTX: __cudaRegisterFunction _Z23InitializeMatrix_kernelPfiiii : hostFun 0x0x55fc5d802990, fat_cubin_handle = 1
GPGPU-Sim PTX: Setting up arguments for 8 bytes starting at 0x7fff06ec4c10..
GPGPU-Sim PTX: Setting up arguments for 4 bytes starting at 0x7fff06ec4bf8..
GPGPU-Sim PTX: Setting up arguments for 4 bytes starting at 0x7fff06ec4bfc..
GPGPU-Sim PTX: Setting up arguments for 4 bytes starting at 0x7fff06ec4c00..
GPGPU-Sim PTX: Setting up arguments for 4 bytes starting at 0x7fff06ec4c04..

The error message indicates that the error occurred during the execution of cudaLaunch for the address 0x55fc5d804630. This corresponds to the function hostFun at address 0x55fc5d804630 when it was being registered with __cudaRegisterFunction. The occurrence of a syntax error at this point leads me to suspect that this error caused the cudaLaunch crash.

The relevant PTX code is as follows:

.loc    3 170 9, function_name $L__info_string0, inlined_at 2 81 3
.loc    4 85 18, function_name $L__info_string1, inlined_at 3 170 9
.loc    4 70 86, function_name $L__info_string2, inlined_at 4 85 18

The first ptx code executes correctly, while the second code encounters a syntax error.

Therefore, is it because GPGPU-Sim does not support the second syntax of loc instruction as shown in the figure?

9a49fa6ab5fb15c9f39e602ca4d833c

Here is the OS version: Ubuntu 18.04.6 LTS The cuda toolkit version: Cuda compilation tools, release 11.7, V11.7.99 The gcc version: gcc version 7.5.0 (Ubuntu 7.5.0-3ubuntu1~18.04)

Looking forward to someone providing assistance. Thanks a lot~

eecspan commented 1 month ago

I finally solved this problem. According to the method described at https://github.com/sxzhang1993/Run-cutlass-with-gpgpu-sim, it uses CUDA 9.1. In CUDA 9.1, the generated .loc instructions only have the first syntax, not the second syntax. However, CUDA 9.1 does not support the Turing architecture. If you want to use the Turing architecture, you can use CUDA 11, but the aforementioned problem will occur. I found that .loc is related to debugging. In cutlass_bench, the -lineinfo option is added during compilation. If we omit this option, no .loc instructions will be generated. We can comment out the -lineinfo option in cutlass_bench/CMakeLists.txt, and the final generated PTX will not contain .loc instructions. However, using GPGPU-Sim 4.0 will cause the error mentioned in https://github.com/gpgpu-sim/gpgpu-sim_distribution/issues/247. We need to use GPGPU-Sim 4.2.