Open grisuthedragon opened 5 months ago
So I could write a MWE example performing a GEMM on CPU oder GPU.
https://gist.github.com/grisuthedragon/0fa99935086a5945171ef63f185bbcee
With
STAPU_NCUDA=0 STARPU_SCHED=dmdas ./gemm_gpu
it works.
With
STAPU_NCUDA=1 STARPU_SCHED=dmdas ./gemm_gpu
it gives sometimes random errors.
And with
STAPU_NCUDA=4 STARPU_SCHED=dmdas ./gemm_gpu
permanently fails.
Also turning off the CPUs let the code fail:
STAPU_NCUDA=4 STARPU_SCHED=dmdas STARPU_NCPU=0 ./gemm_gpu
If it fails, it seems that the execution mostly got slow before.
The same holds true for dmda
, lws
, ... schedulers
GCC: 12 STARPU: 1.4.4 CUDA: 12.2.128 / 4x A100 CPU: AMD Epyc 2x 32 Cores
Hello,
I tried your MWE, but I'm getting
The codelet <gemm_kernel> defines the access mode 3 for the buffer 2 which is different from the mode 2 given to starpu_task_insert
and indeed the codelet says STARPU_RW
for the last argument, while the insert call is STARPU_W
(and thus not a wonder that computation is getting wrong since it doesn't declare to starpu that it wants to read the previous value)
Sorry, that was a copy and paste error. But changing the call in the beta == 1
case to STARPU_RW
it results in the same incorrect results. I updated the gist as well.
Btw. I did not get your error message.
I did not get your error message.
If you configure with --enable-fast
, you're jumping without a parachute
changing the call in the
beta == 1
case toSTARPU_RW
it results in the same incorrect results
I do get correct results on a 3-gpu machine, with various schedulers.
I installed StarPU via Spack and disabled enable-fast
by now, but it does not change the behavior when using cuda.
Here is my environment file for spack: spack.yaml
:
spack:
# add package specs to the `specs` list
specs:
- gcc@12.3.0+binutils+graphite
- starpu@1.4.4~mpi+cuda~fast
- hdf5+hl~mpi
- cmake
- intel-oneapi-mkl threads=openmp
- cuda@12.2
- gdb
- hwloc
view: true
concretizer:
unify: true
packages:
all:
compiler:
- gcc@12.3.0
The code runs on
I did some additional test with varying BLAS implementations and got the following results
Intel OneMKL + CUDA 12.2 + GCC12 + StarPU 1.4.4:
OpenBLAS 0.3.26 + CUDA 12.2 + GCC12 + StarPU 1.4.4:
Very slow, failed
@sthibaul Can you give some more details about your environment?
I used this source: @ gemm_gpu.c.txt
with this spec:
spack:
# add package specs to the `specs` list
specs:
- gcc@12.3.0+binutils+graphite
- starpu@1.4.4~mpi+cuda~fast
- hdf5+hl~mpi
- cmake
- intel-oneapi-mkl threads=openmp
- cuda@12.3
- gdb
- hwloc
view: true
concretizer:
unify: true
packages:
all:
compiler:
- gcc@12.2.0
cuda:
buildable: false
externals:
- spec: cuda@12.3
prefix: /usr/local/cuda-12.3/
compilers:
- compiler:
spec: gcc@=12.2.0
paths:
cc: /cm/shared/modules/intel/skylake/compiler/gcc/12.2.0/bin/gcc
cxx: /cm/shared/modules/intel/skylake/compiler/gcc/12.2.0/bin/g++
f77: /cm/shared/modules/intel/skylake/compiler/gcc/12.2.0/bin/gfortran
fc: /cm/shared/modules/intel/skylake/compiler/gcc/12.2.0/bin/gfortran
flags: {}
operating_system: centos7
target: x86_64
modules: []
environment: {}
extra_rpaths: []
compiled with
gcc --std=gnu99 gemm_gpu.c -o gemm_gpu $( pkg-config --cflags starpu-1.4) $(pkg-config --libs starpu-1.4) -lcublas -lmkl_intel_lp64 -lmkl_sequential -lmkl_core
ran with
STARPU_WORKER_STATS=1 STARPU_SCHED=dmdas ./gemm_gpu
On Centos 7.6.1810, with two gpus, without any error.
I tried to add A[0]++
in the cpu codelet to make sure that errors get catched, and they do.
Note: the mkl/openblas library probably doesn't matter since you said it was when adding gpus that you had issues. You can even try with STARPU_NCPU=0
to rule out the cpu question.
I further look what happens and I upgraded to CUDA 12.4 to match your environment. I further organized an older system with two P100 instead of two to four A100 cards and there no error appears.
Back to the A100 system I get... Running
STARPU_NCUDA=1 STARPU_SCHED=dmdas compute-sanitizer --tool initcheck ./gemm_gpu
========= COMPUTE-SANITIZER
Start...
[starpu][starpu_interface_end_driver_copy_async] Warning: the submission of asynchronous transfer from NUMA 0 to CUDA 0 took a very long time (2.470755 ms)
For proper asynchronous transfer overlapping, data registered to StarPU must be allocated with starpu_malloc() or pinned with starpu_memory_pin()
Time: 4.59802
GFlops: 54.3712
========= ERROR SUMMARY: 0 errors
but running with
STARPU_NCUDA=2 STARPU_SCHED=dmdas compute-sanitizer --tool initcheck ./gemm_gpu
I get dozens of errors like
========= Uninitialized __global__ memory read of size 8 bytes
========= at void cutlass::Kernel2<cutlass_80_tensorop_d884gemm_32x32_16x5_nn_align1>(T1::Params)+0xe00
========= by thread (67,0,0) in block (0,0,0)
========= Address 0x2ad0ba000fb8
========= Saved host backtrace up to driver entry point at kernel launch time
========= Host Frame: [0x32e94f]
========= in /lib64/libcuda.so.1
========= Host Frame: [0x19758dc]
========= in /mechthild/home/koehlerm/spack/opt/spack/linux-centos7-zen2/gcc-12.3.0/cuda-12.4.0-lgoh44yqh4it7gbujfp7yvlmkxwwyro2/lib64/libcublasLt.so.12
========= Host Frame: [0x13c9924]
========= in /mechthild/home/koehlerm/spack/opt/spack/linux-centos7-zen2/gcc-12.3.0/cuda-12.4.0-lgoh44yqh4it7gbujfp7yvlmkxwwyro2/lib64/libcublasLt.so.12
========= Host Frame: [0x8d5e70]
========= in /mechthild/home/koehlerm/spack/opt/spack/linux-centos7-zen2/gcc-12.3.0/cuda-12.4.0-lgoh44yqh4it7gbujfp7yvlmkxwwyro2/lib64/libcublasLt.so.12
========= Host Frame: [0xa10a0e]
========= in /mechthild/home/koehlerm/spack/opt/spack/linux-centos7-zen2/gcc-12.3.0/cuda-12.4.0-lgoh44yqh4it7gbujfp7yvlmkxwwyro2/lib64/libcublasLt.so.12
========= Host Frame:cublasLtDDDMatmul [0xa2c440]
========= in /mechthild/home/koehlerm/spack/opt/spack/linux-centos7-zen2/gcc-12.3.0/cuda-12.4.0-lgoh44yqh4it7gbujfp7yvlmkxwwyro2/lib64/libcublasLt.so.12
========= Host Frame: [0x86acf4]
========= in /mechthild/home/koehlerm/spack/opt/spack/linux-centos7-zen2/gcc-12.3.0/cuda-12.4.0-lgoh44yqh4it7gbujfp7yvlmkxwwyro2/lib64/libcublas.so.12
========= Host Frame: [0x86d1b5]
========= in /mechthild/home/koehlerm/spack/opt/spack/linux-centos7-zen2/gcc-12.3.0/cuda-12.4.0-lgoh44yqh4it7gbujfp7yvlmkxwwyro2/lib64/libcublas.so.12
========= Host Frame: [0xb322c4]
========= in /mechthild/home/koehlerm/spack/opt/spack/linux-centos7-zen2/gcc-12.3.0/cuda-12.4.0-lgoh44yqh4it7gbujfp7yvlmkxwwyro2/lib64/libcublas.so.12
========= Host Frame:cublasDgemm_v2 [0x2cf279]
========= in /mechthild/home/koehlerm/spack/opt/spack/linux-centos7-zen2/gcc-12.3.0/cuda-12.4.0-lgoh44yqh4it7gbujfp7yvlmkxwwyro2/lib64/libcublas.so.12
========= Host Frame:cublas_mult in /mechthild/home/koehlerm/work/software/starputests/gemm_gpu.c:63 [0x15a5]
========= in /mechthild/home/koehlerm/work/software/starputests/./gemm_gpu
========= Host Frame:execute_job_on_cuda in drivers/cuda/driver_cuda.c:2009 [0x10e602]
========= in /mechthild/home/koehlerm/spack/var/spack/environments/gcc-cuda-mkl2024-1-0/.spack-env/view/lib/libstarpu-1.4.so.4
========= Host Frame:_starpu_cuda_driver_run_once in drivers/cuda/driver_cuda.c:2160 [0x10ed53]
========= in /mechthild/home/koehlerm/spack/var/spack/environments/gcc-cuda-mkl2024-1-0/.spack-env/view/lib/libstarpu-1.4.so.4
========= Host Frame:_starpu_cuda_worker in drivers/cuda/driver_cuda.c:2325 [0x10f710]
========= in /mechthild/home/koehlerm/spack/var/spack/environments/gcc-cuda-mkl2024-1-0/.spack-env/view/lib/libstarpu-1.4.so.4
========= Host Frame:start_thread [0x7ea4]
========= in /lib64/libpthread.so.0
========= Host Frame:clone [0xfe9fc]
========= in /lib64/libc.so.6
The reason seems that on the A100 cards, cutlass is used in GEMM operations, on P100 not.
I updated my installation to StarPU 1.4.6 and CUDA 12.5. and run the "dgemm" example from examples/mult
on my 4x A100 machine. In this way, the error gets independent of my code.
Now the following errors appear
$ STARPU_SCHED=dmdas STARPU_NCUDA=1 compute-sanitizer --tool initcheck ./dgemm
========= COMPUTE-SANITIZER
# x y z ms GFlop/s
3840 3840 3840 1327 85.3
========= ERROR SUMMARY: 0 errors
$ STARPU_SCHED=dmdas STARPU_NCUDA=2 compute-sanitizer --tool initcheck ./dgemm
========= COMPUTE-SANITIZER
# x y z ms GFlop/s
3840 3840 3840 864 131.1
========= ERROR SUMMARY: 0 errors
$ STARPU_SCHED=dmdas STARPU_NCUDA=3 compute-sanitizer --tool initcheck ./dgemm
========= COMPUTE-SANITIZER
# x y z ms GFlop/s
[starpu][starpu_cuda_report_error] Error: oops in _starpu_cuda_copy_interface_from_cuda_to_cpu (drivers/cuda/driver_cuda.c:1680)... 719: unspecified launch failure
[starpu][starpu_cublas_report_error] oops in cublas_mult (mult/xgemm.c:147)... 13: execution failed /mechthild/home/koehlerm/spack/var/spack/environments/gcc12-cuda12-openblas/.spack-env/view/lib/libstarpu-1.4.so.6(starpu_cublas_report_error+0x79)[0x2abcf1959b79] /mechthild/home/koehlerm/spack/var/spack/environments/gcc12-cuda12-openblas/.spack-env/view/lib/libstarpu-1.4.so.6(+0x10f173)[0x2abcf195d173] /mechthild/home/koehlerm/spack/var/spack/environments/gcc12-cuda12-openblas/.spack-env/view/lib/libstarpu-1.4.so.6(_starpu_cuda_driver_run_once+0x304)[0x2abcf195d7e4] /mechthild/home/koehlerm/spack/var/spack/environments/gcc12-cuda12-openblas/.spack-env/view/lib/libstarpu-1.4.so.6(+0x1101a1)[0x2abcf195e1a1] /lib64/libpthread.so.0(+0x7ea5)[0x2abcf1a3fea5] /lib64/libc.so.6(clone+0x6d)[0x2abd10ee09fd] [starpu][abort][starpu_cublas_report_error()@drivers/cuda/driver_cuda.c:2488] ========= Error: process didn't terminate successfully ========= Target application returned an error ========= ERROR SUMMARY: 0 errors
* 4 GPUs (always crashes)
$ STARPU_SCHED=dmdas STARPU_NCUDA=4 compute-sanitizer --tool initcheck ./dgemm ========= COMPUTE-SANITIZER
[starpu][starpu_cuda_report_error] Error: oops in _starpu_cuda_copy_interface_from_cuda_to_cpu (drivers/cuda/driver_cuda.c:1680)... 719: unspecified launch failure
/mechthild/home/koehlerm/spack/var/spack/environments/gcc12-cuda12-openblas/.spack-env/view/lib/libstarpu-1.4.so.6(starpu_cuda_report_error+0x7b)[0x2b1ce4d41c9b] /mechthild/home/koehlerm/spack/var/spack/environments/gcc12-cuda12-openblas/.spack-env/view/lib/libstarpu-1.4.so.6(+0x10e8a3)[0x2b1ce4d448a3] /mechthild/home/koehlerm/spack/var/spack/environments/gcc12-cuda12-openblas/.spack-env/view/lib/libstarpu-1.4.so.6(+0xadae5)[0x2b1ce4ce3ae5] /mechthild/home/koehlerm/spack/var/spack/environments/gcc12-cuda12-openblas/.spack-env/view/lib/libstarpu-1.4.so.6(+0xae624)[0x2b1ce4ce4624] /mechthild/home/koehlerm/spack/var/spack/environments/gcc12-cuda12-openblas/.spack-env/view/lib/libstarpu-1.4.so.6(+0xae836)[0x2b1ce4ce4836] /mechthild/home/koehlerm/spack/var/spack/environments/gcc12-cuda12-openblas/.spack-env/view/lib/libstarpu-1.4.so.6(+0xae9b9)[0x2b1ce4ce49b9] /mechthild/home/koehlerm/spack/var/spack/environments/gcc12-cuda12-openblas/.spack-env/view/lib/libstarpu-1.4.so.6(+0xaeaa5)[0x2b1ce4ce4aa5] /mechthild/home/koehlerm/spack/var/spack/environments/gcc12-cuda12-openblas/.spack-env/view/lib/libstarpu-1.4.so.6(_starpu_cuda_driver_run_once+0x8ca)[0x2b1ce4d45daa] /mechthild/home/koehlerm/spack/var/spack/environments/gcc12-cuda12-openblas/.spack-env/view/lib/libstarpu-1.4.so.6(+0x1101a1)[0x2b1ce4d461a1] /lib64/libpthread.so.0(+0x7ea5)[0x2b1ce4e27ea5] /lib64/libc.so.6(clone+0x6d)[0x2b1d042c89fd] [starpu][abort][starpu_cuda_report_error()@drivers/cuda/driver_cuda.c:2494]
[starpu][starpu_cuda_report_error] Error: oops in _starpu_cuda_test_request_completion (drivers/cuda/driver_cuda.c:1547)... 719: unspecified launch failure
/mechthild/home/koehlerm/spack/var/spack/environments/gcc12-cuda12-openblas/.spack-env/view/lib/libstarpu-1.4.so.6(starpu_cuda_report_error+0x7b)[0x2b1ce4d41c9b] /mechthild/home/koehlerm/spack/var/spack/environments/gcc12-cuda12-openblas/.spack-env/view/lib/libstarpu-1.4.so.6(+0x10e40a)[0x2b1ce4d4440a] /mechthild/home/koehlerm/spack/var/spack/environments/gcc12-cuda12-openblas/.spack-env/view/lib/libstarpu-1.4.so.6(+0xac4f2)[0x2b1ce4ce24f2] /mechthild/home/koehlerm/spack/var/spack/environments/gcc12-cuda12-openblas/.spack-env/view/lib/libstarpu-1.4.so.6(+0xae7dd)[0x2b1ce4ce47dd] /mechthild/home/koehlerm/spack/var/spack/environments/gcc12-cuda12-openblas/.spack-env/view/lib/libstarpu-1.4.so.6(+0xae98d)[0x2b1ce4ce498d] /mechthild/home/koehlerm/spack/var/spack/environments/gcc12-cuda12-openblas/.spack-env/view/lib/libstarpu-1.4.so.6(+0xaeaa5)[0x2b1ce4ce4aa5] /mechthild/home/koehlerm/spack/var/spack/environments/gcc12-cuda12-openblas/.spack-env/view/lib/libstarpu-1.4.so.6(_starpu_cuda_driver_run_once+0x8ca)[0x2b1ce4d45daa] /mechthild/home/koehlerm/spack/var/spack/environments/gcc12-cuda12-openblas/.spack-env/view/lib/libstarpu-1.4.so.6(+0x1101a1)[0x2b1ce4d461a1] /lib64/libpthread.so.0(+0x7ea5)[0x2b1ce4e27ea5] /lib64/libc.so.6(clone+0x6d)[0x2b1d042c89fd] [starpu][abort][starpu_cuda_report_error()@drivers/cuda/driver_cuda.c:2494] ========= Error: process didn't terminate successfully ========= Target application returned an error ========= ERROR SUMMARY: 0 errors
or
========= COMPUTE-SANITIZER
[starpu][starpu_cublas_report_error] oops in cublas_mult (mult/xgemm.c:147)... 13: execution failed /mechthild/home/koehlerm/spack/var/spack/environments/gcc12-cuda12-openblas/.spack-env/view/lib/libstarpu-1.4.so.6(starpu_cublas_report_error+0x79)[0x2ab39dcdcb79] /mechthild/home/koehlerm/spack/var/spack/environments/gcc12-cuda12-openblas/.spack-env/view/lib/libstarpu-1.4.so.6(+0x10f173)[0x2ab39dce0173] /mechthild/home/koehlerm/spack/var/spack/environments/gcc12-cuda12-openblas/.spack-env/view/lib/libstarpu-1.4.so.6(_starpu_cuda_driver_run_once+0x304)[0x2ab39dce07e4] /mechthild/home/koehlerm/spack/var/spack/environments/gcc12-cuda12-openblas/.spack-env/view/lib/libstarpu-1.4.so.6(+0x1101a1)[0x2ab39dce11a1] /lib64/libpthread.so.0(+0x7ea5)[0x2ab39ddc2ea5] /lib64/libc.so.6(clone+0x6d)[0x2ab3bd2639fd] [starpu][abort][starpu_cublas_report_error()@drivers/cuda/driver_cuda.c:2488] ========= Error: process didn't terminate successfully ========= Target application returned an error ========= ERROR SUMMARY: 0 errors
It seems that during the updates introduces between 1.3 and 1.4, the asynchronous partitioning is broken. In basic, we have a code
The submit / unsubmit we leave to the STARPU runtime. The kernels required for the computing the task are available as CPU and CUDA implementation. Now we observed the following cases.
StarPU 1.3.11 / CUDA 11.8 / GCC 12
StarPU 1.3.11 / CUDA 12.2/ GCC 12
StarPU 1.4.4 / CUDA 12.2/ GCC 12
The tasks are only gemm operations from CUBLAS or MKL. Due to ongoing research, I could not share the code and does not have time to build an MWE til now. But in general it seems to have something in common with https://gitlab.inria.fr/starpu/starpu/-/issues/43.