starpu-runtime / starpu

This is a mirror of https://gitlab.inria.fr/starpu/starpu where our development happens, but contributions are welcome here too!
https://starpu.gitlabpages.inria.fr/
GNU Lesser General Public License v2.1
58 stars 13 forks source link

Incorrect results with asynchronous partitioning on CUDA devices and STARPU 1.4. #37

Open grisuthedragon opened 5 months ago

grisuthedragon commented 5 months ago

It seems that during the updates introduces between 1.3 and 1.4, the asynchronous partitioning is broken. In basic, we have a code

starpu_data_partion_plan(....) ; 

execute tasks on the partioned dataset

starpu_data_partition_clean(...); 

The submit / unsubmit we leave to the STARPU runtime. The kernels required for the computing the task are available as CPU and CUDA implementation. Now we observed the following cases.

StarPU 1.3.11 / CUDA 11.8 / GCC 12

StarPU 1.3.11 / CUDA 12.2/ GCC 12

StarPU 1.4.4 / CUDA 12.2/ GCC 12

The tasks are only gemm operations from CUBLAS or MKL. Due to ongoing research, I could not share the code and does not have time to build an MWE til now. But in general it seems to have something in common with https://gitlab.inria.fr/starpu/starpu/-/issues/43.

grisuthedragon commented 5 months ago

So I could write a MWE example performing a GEMM on CPU oder GPU.

https://gist.github.com/grisuthedragon/0fa99935086a5945171ef63f185bbcee

With

STAPU_NCUDA=0 STARPU_SCHED=dmdas ./gemm_gpu

it works.

With

STAPU_NCUDA=1 STARPU_SCHED=dmdas ./gemm_gpu

it gives sometimes random errors.

And with

STAPU_NCUDA=4 STARPU_SCHED=dmdas ./gemm_gpu

permanently fails.

Also turning off the CPUs let the code fail:

STAPU_NCUDA=4 STARPU_SCHED=dmdas STARPU_NCPU=0 ./gemm_gpu

If it fails, it seems that the execution mostly got slow before.

The same holds true for dmda, lws, ... schedulers

GCC: 12 STARPU: 1.4.4 CUDA: 12.2.128 / 4x A100 CPU: AMD Epyc 2x 32 Cores

sthibaul commented 5 months ago

Hello,

I tried your MWE, but I'm getting

The codelet <gemm_kernel> defines the access mode 3 for the buffer 2 which is different from the mode 2 given to starpu_task_insert

and indeed the codelet says STARPU_RW for the last argument, while the insert call is STARPU_W (and thus not a wonder that computation is getting wrong since it doesn't declare to starpu that it wants to read the previous value)

grisuthedragon commented 5 months ago

Sorry, that was a copy and paste error. But changing the call in the beta == 1 case to STARPU_RW it results in the same incorrect results. I updated the gist as well.

Btw. I did not get your error message.

sthibaul commented 5 months ago

I did not get your error message.

If you configure with --enable-fast, you're jumping without a parachute

changing the call in the beta == 1 case to STARPU_RW it results in the same incorrect results

I do get correct results on a 3-gpu machine, with various schedulers.

grisuthedragon commented 5 months ago

I installed StarPU via Spack and disabled enable-fast by now, but it does not change the behavior when using cuda.

Here is my environment file for spack: spack.yaml:

spack:
  # add package specs to the `specs` list
  specs:
  - gcc@12.3.0+binutils+graphite
  - starpu@1.4.4~mpi+cuda~fast
  - hdf5+hl~mpi
  - cmake
  - intel-oneapi-mkl threads=openmp
  - cuda@12.2
  - gdb
  - hwloc
  view: true
  concretizer:
    unify: true
  packages:
    all:
      compiler:
        - gcc@12.3.0

The code runs on

grisuthedragon commented 4 months ago

I did some additional test with varying BLAS implementations and got the following results

Intel OneMKL + CUDA 12.2 + GCC12 + StarPU 1.4.4:

OpenBLAS 0.3.26 + CUDA 12.2 + GCC12 + StarPU 1.4.4:

sthibaul commented 4 months ago

I used this source: @ gemm_gpu.c.txt

with this spec:

spack:
  # add package specs to the `specs` list
  specs:
  - gcc@12.3.0+binutils+graphite
  - starpu@1.4.4~mpi+cuda~fast
  - hdf5+hl~mpi
  - cmake
  - intel-oneapi-mkl threads=openmp
  - cuda@12.3
  - gdb
  - hwloc
  view: true
  concretizer:
    unify: true
  packages:
    all:
      compiler:
      - gcc@12.2.0
    cuda:
      buildable: false
      externals:
      - spec: cuda@12.3
        prefix: /usr/local/cuda-12.3/
  compilers:
  - compiler:
      spec: gcc@=12.2.0
      paths:
        cc: /cm/shared/modules/intel/skylake/compiler/gcc/12.2.0/bin/gcc
        cxx: /cm/shared/modules/intel/skylake/compiler/gcc/12.2.0/bin/g++
        f77: /cm/shared/modules/intel/skylake/compiler/gcc/12.2.0/bin/gfortran
        fc: /cm/shared/modules/intel/skylake/compiler/gcc/12.2.0/bin/gfortran
      flags: {}
      operating_system: centos7
      target: x86_64
      modules: []
      environment: {}
      extra_rpaths: []

compiled with

gcc --std=gnu99 gemm_gpu.c -o gemm_gpu $( pkg-config --cflags starpu-1.4) $(pkg-config --libs starpu-1.4)  -lcublas -lmkl_intel_lp64 -lmkl_sequential -lmkl_core

ran with

STARPU_WORKER_STATS=1 STARPU_SCHED=dmdas ./gemm_gpu

On Centos 7.6.1810, with two gpus, without any error.

I tried to add A[0]++ in the cpu codelet to make sure that errors get catched, and they do.

Note: the mkl/openblas library probably doesn't matter since you said it was when adding gpus that you had issues. You can even try with STARPU_NCPU=0 to rule out the cpu question.

grisuthedragon commented 3 months ago

I further look what happens and I upgraded to CUDA 12.4 to match your environment. I further organized an older system with two P100 instead of two to four A100 cards and there no error appears.

Back to the A100 system I get... Running

STARPU_NCUDA=1 STARPU_SCHED=dmdas compute-sanitizer --tool initcheck ./gemm_gpu
========= COMPUTE-SANITIZER
Start... 
[starpu][starpu_interface_end_driver_copy_async] Warning: the submission of asynchronous transfer from NUMA 0 to CUDA 0 took a very long time (2.470755 ms)
For proper asynchronous transfer overlapping, data registered to StarPU must be allocated with starpu_malloc() or pinned with starpu_memory_pin()
Time: 4.59802
GFlops: 54.3712
========= ERROR SUMMARY: 0 errors

but running with

STARPU_NCUDA=2 STARPU_SCHED=dmdas compute-sanitizer --tool initcheck ./gemm_gpu

I get dozens of errors like

========= Uninitialized __global__ memory read of size 8 bytes
=========     at void cutlass::Kernel2<cutlass_80_tensorop_d884gemm_32x32_16x5_nn_align1>(T1::Params)+0xe00
=========     by thread (67,0,0) in block (0,0,0)
=========     Address 0x2ad0ba000fb8
=========     Saved host backtrace up to driver entry point at kernel launch time
=========     Host Frame: [0x32e94f]
=========                in /lib64/libcuda.so.1
=========     Host Frame: [0x19758dc]
=========                in /mechthild/home/koehlerm/spack/opt/spack/linux-centos7-zen2/gcc-12.3.0/cuda-12.4.0-lgoh44yqh4it7gbujfp7yvlmkxwwyro2/lib64/libcublasLt.so.12
=========     Host Frame: [0x13c9924]
=========                in /mechthild/home/koehlerm/spack/opt/spack/linux-centos7-zen2/gcc-12.3.0/cuda-12.4.0-lgoh44yqh4it7gbujfp7yvlmkxwwyro2/lib64/libcublasLt.so.12
=========     Host Frame: [0x8d5e70]
=========                in /mechthild/home/koehlerm/spack/opt/spack/linux-centos7-zen2/gcc-12.3.0/cuda-12.4.0-lgoh44yqh4it7gbujfp7yvlmkxwwyro2/lib64/libcublasLt.so.12
=========     Host Frame: [0xa10a0e]
=========                in /mechthild/home/koehlerm/spack/opt/spack/linux-centos7-zen2/gcc-12.3.0/cuda-12.4.0-lgoh44yqh4it7gbujfp7yvlmkxwwyro2/lib64/libcublasLt.so.12
=========     Host Frame:cublasLtDDDMatmul [0xa2c440]
=========                in /mechthild/home/koehlerm/spack/opt/spack/linux-centos7-zen2/gcc-12.3.0/cuda-12.4.0-lgoh44yqh4it7gbujfp7yvlmkxwwyro2/lib64/libcublasLt.so.12
=========     Host Frame: [0x86acf4]
=========                in /mechthild/home/koehlerm/spack/opt/spack/linux-centos7-zen2/gcc-12.3.0/cuda-12.4.0-lgoh44yqh4it7gbujfp7yvlmkxwwyro2/lib64/libcublas.so.12
=========     Host Frame: [0x86d1b5]
=========                in /mechthild/home/koehlerm/spack/opt/spack/linux-centos7-zen2/gcc-12.3.0/cuda-12.4.0-lgoh44yqh4it7gbujfp7yvlmkxwwyro2/lib64/libcublas.so.12
=========     Host Frame: [0xb322c4]
=========                in /mechthild/home/koehlerm/spack/opt/spack/linux-centos7-zen2/gcc-12.3.0/cuda-12.4.0-lgoh44yqh4it7gbujfp7yvlmkxwwyro2/lib64/libcublas.so.12
=========     Host Frame:cublasDgemm_v2 [0x2cf279]
=========                in /mechthild/home/koehlerm/spack/opt/spack/linux-centos7-zen2/gcc-12.3.0/cuda-12.4.0-lgoh44yqh4it7gbujfp7yvlmkxwwyro2/lib64/libcublas.so.12
=========     Host Frame:cublas_mult in /mechthild/home/koehlerm/work/software/starputests/gemm_gpu.c:63 [0x15a5]
=========                in /mechthild/home/koehlerm/work/software/starputests/./gemm_gpu
=========     Host Frame:execute_job_on_cuda in drivers/cuda/driver_cuda.c:2009 [0x10e602]
=========                in /mechthild/home/koehlerm/spack/var/spack/environments/gcc-cuda-mkl2024-1-0/.spack-env/view/lib/libstarpu-1.4.so.4
=========     Host Frame:_starpu_cuda_driver_run_once in drivers/cuda/driver_cuda.c:2160 [0x10ed53]
=========                in /mechthild/home/koehlerm/spack/var/spack/environments/gcc-cuda-mkl2024-1-0/.spack-env/view/lib/libstarpu-1.4.so.4
=========     Host Frame:_starpu_cuda_worker in drivers/cuda/driver_cuda.c:2325 [0x10f710]
=========                in /mechthild/home/koehlerm/spack/var/spack/environments/gcc-cuda-mkl2024-1-0/.spack-env/view/lib/libstarpu-1.4.so.4
=========     Host Frame:start_thread [0x7ea4]
=========                in /lib64/libpthread.so.0
=========     Host Frame:clone [0xfe9fc]
=========                in /lib64/libc.so.6

The reason seems that on the A100 cards, cutlass is used in GEMM operations, on P100 not.

grisuthedragon commented 2 months ago

I updated my installation to StarPU 1.4.6 and CUDA 12.5. and run the "dgemm" example from examples/mult on my 4x A100 machine. In this way, the error gets independent of my code.

Now the following errors appear

[starpu][starpu_cublas_report_error] oops in cublas_mult (mult/xgemm.c:147)... 13: execution failed /mechthild/home/koehlerm/spack/var/spack/environments/gcc12-cuda12-openblas/.spack-env/view/lib/libstarpu-1.4.so.6(starpu_cublas_report_error+0x79)[0x2abcf1959b79] /mechthild/home/koehlerm/spack/var/spack/environments/gcc12-cuda12-openblas/.spack-env/view/lib/libstarpu-1.4.so.6(+0x10f173)[0x2abcf195d173] /mechthild/home/koehlerm/spack/var/spack/environments/gcc12-cuda12-openblas/.spack-env/view/lib/libstarpu-1.4.so.6(_starpu_cuda_driver_run_once+0x304)[0x2abcf195d7e4] /mechthild/home/koehlerm/spack/var/spack/environments/gcc12-cuda12-openblas/.spack-env/view/lib/libstarpu-1.4.so.6(+0x1101a1)[0x2abcf195e1a1] /lib64/libpthread.so.0(+0x7ea5)[0x2abcf1a3fea5] /lib64/libc.so.6(clone+0x6d)[0x2abd10ee09fd] [starpu][abort][starpu_cublas_report_error()@drivers/cuda/driver_cuda.c:2488] ========= Error: process didn't terminate successfully ========= Target application returned an error ========= ERROR SUMMARY: 0 errors

* 4 GPUs (always crashes)

$ STARPU_SCHED=dmdas STARPU_NCUDA=4 compute-sanitizer --tool initcheck ./dgemm ========= COMPUTE-SANITIZER

x y z ms GFlop/s

[starpu][starpu_cuda_report_error] Error: oops in _starpu_cuda_copy_interface_from_cuda_to_cpu (drivers/cuda/driver_cuda.c:1680)... 719: unspecified launch failure

/mechthild/home/koehlerm/spack/var/spack/environments/gcc12-cuda12-openblas/.spack-env/view/lib/libstarpu-1.4.so.6(starpu_cuda_report_error+0x7b)[0x2b1ce4d41c9b] /mechthild/home/koehlerm/spack/var/spack/environments/gcc12-cuda12-openblas/.spack-env/view/lib/libstarpu-1.4.so.6(+0x10e8a3)[0x2b1ce4d448a3] /mechthild/home/koehlerm/spack/var/spack/environments/gcc12-cuda12-openblas/.spack-env/view/lib/libstarpu-1.4.so.6(+0xadae5)[0x2b1ce4ce3ae5] /mechthild/home/koehlerm/spack/var/spack/environments/gcc12-cuda12-openblas/.spack-env/view/lib/libstarpu-1.4.so.6(+0xae624)[0x2b1ce4ce4624] /mechthild/home/koehlerm/spack/var/spack/environments/gcc12-cuda12-openblas/.spack-env/view/lib/libstarpu-1.4.so.6(+0xae836)[0x2b1ce4ce4836] /mechthild/home/koehlerm/spack/var/spack/environments/gcc12-cuda12-openblas/.spack-env/view/lib/libstarpu-1.4.so.6(+0xae9b9)[0x2b1ce4ce49b9] /mechthild/home/koehlerm/spack/var/spack/environments/gcc12-cuda12-openblas/.spack-env/view/lib/libstarpu-1.4.so.6(+0xaeaa5)[0x2b1ce4ce4aa5] /mechthild/home/koehlerm/spack/var/spack/environments/gcc12-cuda12-openblas/.spack-env/view/lib/libstarpu-1.4.so.6(_starpu_cuda_driver_run_once+0x8ca)[0x2b1ce4d45daa] /mechthild/home/koehlerm/spack/var/spack/environments/gcc12-cuda12-openblas/.spack-env/view/lib/libstarpu-1.4.so.6(+0x1101a1)[0x2b1ce4d461a1] /lib64/libpthread.so.0(+0x7ea5)[0x2b1ce4e27ea5] /lib64/libc.so.6(clone+0x6d)[0x2b1d042c89fd] [starpu][abort][starpu_cuda_report_error()@drivers/cuda/driver_cuda.c:2494]

[starpu][starpu_cuda_report_error] Error: oops in _starpu_cuda_test_request_completion (drivers/cuda/driver_cuda.c:1547)... 719: unspecified launch failure

/mechthild/home/koehlerm/spack/var/spack/environments/gcc12-cuda12-openblas/.spack-env/view/lib/libstarpu-1.4.so.6(starpu_cuda_report_error+0x7b)[0x2b1ce4d41c9b] /mechthild/home/koehlerm/spack/var/spack/environments/gcc12-cuda12-openblas/.spack-env/view/lib/libstarpu-1.4.so.6(+0x10e40a)[0x2b1ce4d4440a] /mechthild/home/koehlerm/spack/var/spack/environments/gcc12-cuda12-openblas/.spack-env/view/lib/libstarpu-1.4.so.6(+0xac4f2)[0x2b1ce4ce24f2] /mechthild/home/koehlerm/spack/var/spack/environments/gcc12-cuda12-openblas/.spack-env/view/lib/libstarpu-1.4.so.6(+0xae7dd)[0x2b1ce4ce47dd] /mechthild/home/koehlerm/spack/var/spack/environments/gcc12-cuda12-openblas/.spack-env/view/lib/libstarpu-1.4.so.6(+0xae98d)[0x2b1ce4ce498d] /mechthild/home/koehlerm/spack/var/spack/environments/gcc12-cuda12-openblas/.spack-env/view/lib/libstarpu-1.4.so.6(+0xaeaa5)[0x2b1ce4ce4aa5] /mechthild/home/koehlerm/spack/var/spack/environments/gcc12-cuda12-openblas/.spack-env/view/lib/libstarpu-1.4.so.6(_starpu_cuda_driver_run_once+0x8ca)[0x2b1ce4d45daa] /mechthild/home/koehlerm/spack/var/spack/environments/gcc12-cuda12-openblas/.spack-env/view/lib/libstarpu-1.4.so.6(+0x1101a1)[0x2b1ce4d461a1] /lib64/libpthread.so.0(+0x7ea5)[0x2b1ce4e27ea5] /lib64/libc.so.6(clone+0x6d)[0x2b1d042c89fd] [starpu][abort][starpu_cuda_report_error()@drivers/cuda/driver_cuda.c:2494] ========= Error: process didn't terminate successfully ========= Target application returned an error ========= ERROR SUMMARY: 0 errors

or 

========= COMPUTE-SANITIZER

x y z ms GFlop/s

[starpu][starpu_cublas_report_error] oops in cublas_mult (mult/xgemm.c:147)... 13: execution failed /mechthild/home/koehlerm/spack/var/spack/environments/gcc12-cuda12-openblas/.spack-env/view/lib/libstarpu-1.4.so.6(starpu_cublas_report_error+0x79)[0x2ab39dcdcb79] /mechthild/home/koehlerm/spack/var/spack/environments/gcc12-cuda12-openblas/.spack-env/view/lib/libstarpu-1.4.so.6(+0x10f173)[0x2ab39dce0173] /mechthild/home/koehlerm/spack/var/spack/environments/gcc12-cuda12-openblas/.spack-env/view/lib/libstarpu-1.4.so.6(_starpu_cuda_driver_run_once+0x304)[0x2ab39dce07e4] /mechthild/home/koehlerm/spack/var/spack/environments/gcc12-cuda12-openblas/.spack-env/view/lib/libstarpu-1.4.so.6(+0x1101a1)[0x2ab39dce11a1] /lib64/libpthread.so.0(+0x7ea5)[0x2ab39ddc2ea5] /lib64/libc.so.6(clone+0x6d)[0x2ab3bd2639fd] [starpu][abort][starpu_cublas_report_error()@drivers/cuda/driver_cuda.c:2488] ========= Error: process didn't terminate successfully ========= Target application returned an error ========= ERROR SUMMARY: 0 errors