starpu-runtime / starpu

This is a mirror of https://gitlab.inria.fr/starpu/starpu where our development happens, but contributions are welcome here too!
https://starpu.gitlabpages.inria.fr/
GNU Lesser General Public License v2.1
58 stars 13 forks source link

Runtime error in the Cholesky examples #12

Closed weslleyspereira closed 1 year ago

weslleyspereira commented 1 year ago

I get the following error when I try to run the example cholesky_implicit:

$./build_install/lib/starpu/examples/cholesky_implicit 
Invalid MIT-MAGIC-COOKIE-1 key[starpu][weslleyp-XPS-15-9510][initialize_lws_policy] Warning: you are running the default lws scheduler, which is not a very smart scheduler, while the system has GPUs or several memory nodes. Make sure to read the StarPU documentation about adding performance models in order to be able to use the dmda or dmdas scheduler instead.
./build_install/lib/starpu/examples/cholesky_implicit(+0x5c22)[0x5619c208cc22]
/home/weslleyp/storage/starpu/build_install/lib/libstarpu-1.4.so.1(+0x127fcb)[0x7fec7f26efcb]
/home/weslleyp/storage/starpu/build_install/lib/libstarpu-1.4.so.1(_starpu_cuda_driver_run_once+0x60b)[0x7fec7f26f99b]
/home/weslleyp/storage/starpu/build_install/lib/libstarpu-1.4.so.1(+0x128ffd)[0x7fec7f26fffd]
/lib/x86_64-linux-gnu/libpthread.so.0(+0x8609)[0x7fec7f12c609]
/lib/x86_64-linux-gnu/libc.so.6(clone+0x43)[0x7fec66375133]
cholesky_implicit: cholesky/cholesky_kernels.c:289: chol_common_codelet_update_potrf: Assertion `0 && "sstatus == CUSOLVER_STATUS_SUCCESS"' failed.
Aborted (core dumped)

More details

nx 960
sub11 0x7fea6ea8c000
ld 960
workspace 0x7fea6f194000
Lwork 936

My configurations

starpu$ readelf -d  build_install/lib/libstarpu-1.4.so

Dynamic section at offset 0x193b28 contains 37 entries:
  Tag        Type                         Name/Value
 0x0000000000000001 (NEEDED)             Shared library: [librt.so.1]
 0x0000000000000001 (NEEDED)             Shared library: [libpthread.so.0]
 0x0000000000000001 (NEEDED)             Shared library: [libOpenCL.so.1]
 0x0000000000000001 (NEEDED)             Shared library: [libcudart.so.12]
 0x0000000000000001 (NEEDED)             Shared library: [libcublas.so.12]
 0x0000000000000001 (NEEDED)             Shared library: [libcusparse.so.12]
 0x0000000000000001 (NEEDED)             Shared library: [libcusolver.so.11]
 0x0000000000000001 (NEEDED)             Shared library: [libnvidia-ml.so.1]
 0x0000000000000001 (NEEDED)             Shared library: [libhwloc.so.15]
 0x0000000000000001 (NEEDED)             Shared library: [libdl.so.2]
 0x0000000000000001 (NEEDED)             Shared library: [libm.so.6]
 0x0000000000000001 (NEEDED)             Shared library: [libc.so.6]
 0x0000000000000001 (NEEDED)             Shared library: [libgcc_s.so.1]
 0x000000000000000e (SONAME)             Library soname: [libstarpu-1.4.so.1]
 0x000000000000000c (INIT)               0x23000
 0x000000000000000d (FINI)               0x132130
 0x0000000000000019 (INIT_ARRAY)         0x194190
 0x000000000000001b (INIT_ARRAYSZ)       8 (bytes)
 0x000000000000001a (FINI_ARRAY)         0x194198
 0x000000000000001c (FINI_ARRAYSZ)       8 (bytes)
 0x000000006ffffef5 (GNU_HASH)           0x2f0
 0x0000000000000005 (STRTAB)             0xc7d8
 0x0000000000000006 (SYMTAB)             0x2c68
 0x000000000000000a (STRSZ)              45299 (bytes)
 0x000000000000000b (SYMENT)             24 (bytes)
 0x0000000000000003 (PLTGOT)             0x195000
 0x0000000000000002 (PLTRELSZ)           20280 (bytes)
 0x0000000000000014 (PLTREL)             RELA
 0x0000000000000017 (JMPREL)             0x1d920
 0x0000000000000007 (RELA)               0x187c0
 0x0000000000000008 (RELASZ)             20832 (bytes)
 0x0000000000000009 (RELAENT)            24 (bytes)
 0x000000006ffffffe (VERNEED)            0x185c0
 0x000000006fffffff (VERNEEDNUM)         11
 0x000000006ffffff0 (VERSYM)             0x178cc
 0x000000006ffffff9 (RELACOUNT)          641
 0x0000000000000000 (NULL)               0x0

Please, let me know if you need more information. Thanks!

nfurmento commented 1 year ago

Hello, could you please try again with the latest master branch ? we did fix many bugs. If it keeps failing, please send us your StarPU file config.log. Thanks, Nathalie

sthibaul commented 1 year ago

Hello, CUSOLVER_STATUS_INTERNAL_ERROR looks like something bad happening in cuda. I tried to install ubuntu 20.04, and the cholesky example went all fine with our RTX 6000/8000. Apparently your cuda isn't the cuda as shipped by ubuntu, how did you install it? I'm afraid the problem could be specific to your card. Perhaps the cuda tools can report more precise errors than the mere CUSOLVER_STATUS_INTERNAL_ERROR? Perhaps dmesg can tell more?

weslleyspereira commented 1 year ago

Hi, I have just run the example cholesky_implicit after a fresh start of StarPU (commit 368cf4d4f59176780f06d02825d53680478328a7). I obtained the same error. Please, see the log: config.log

I am using the driver nvidia-driver-525. Please, see details below:

weslleyp@weslleyp-XPS-15-9510:~$ nvidia-smi
Mon May  1 17:54:31 2023       
+-----------------------------------------------------------------------------+
| NVIDIA-SMI 525.105.17   Driver Version: 525.105.17   CUDA Version: 12.0     |
|-------------------------------+----------------------+----------------------+
| GPU  Name        Persistence-M| Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp  Perf  Pwr:Usage/Cap|         Memory-Usage | GPU-Util  Compute M. |
|                               |                      |               MIG M. |
|===============================+======================+======================|
|   0  NVIDIA GeForce ...  On   | 00000000:01:00.0 Off |                  N/A |
| N/A   49C    P0    N/A /  40W |      9MiB /  4096MiB |      0%      Default |
|                               |                      |                  N/A |
+-------------------------------+----------------------+----------------------+

+-----------------------------------------------------------------------------+
| Processes:                                                                  |
|  GPU   GI   CI        PID   Type   Process name                  GPU Memory |
|        ID   ID                                                   Usage      |
|=============================================================================|
|    0   N/A  N/A      1817      G   /usr/lib/xorg/Xorg                  4MiB |
|    0   N/A  N/A      3503      G   /usr/lib/xorg/Xorg                  4MiB |
+-----------------------------------------------------------------------------+
weslleyp@weslleyp-XPS-15-9510:~$ sudo apt install nvidia-driver-525 nvidia-dkms-525
Reading package lists... Done
Building dependency tree       
Reading state information... Done
nvidia-dkms-525 is already the newest version (525.105.17-0ubuntu1).
nvidia-driver-525 is already the newest version (525.105.17-0ubuntu1).
0 upgraded, 0 newly installed, 0 to remove and 6 not upgraded.

This is not the open kernel version. Please, see the image below: image

weslleyspereira commented 1 year ago

Perhaps dmesg can tell more?

Where should I add this command?

sthibaul commented 1 year ago

Perhaps dmesg can tell more?

Where should I add this command?

you can run it with sudo

sthibaul commented 1 year ago

Weslley S. Pereira, le lun. 01 mai 2023 17:02:49 -0700, a ecrit:

@.***:~$ sudo apt install nvidia-driver-525 nvidia-dkms-525

Which apt repository is this coming from?

weslleyspereira commented 1 year ago

Which apt repository is this coming from?

https://developer.download.nvidia.com/compute/cuda/repos/ubuntu2004/x86_64

weslleyspereira commented 1 year ago

Output of dmesg --follow:

NVRM: Xid (PCI:0000:01:00): 31, pid=154328, name=cholesky_implic, Ch 00000018, intr 00000000. MMU Fault: ENGINE GRAPHICS GPCCLIENT_T1_0 faulted @ 0x0_00000000. Fault is of type FAULT_PDE ACCESS_TYPE_VIRT_WRITE
sthibaul commented 1 year ago

Just realizing:

 0x0000000000000001 (NEEDED)             Shared library: [libcudart.so.12]
 0x0000000000000001 (NEEDED)             Shared library: [libcublas.so.12]
 0x0000000000000001 (NEEDED)             Shared library: [libcusparse.so.12]
 0x0000000000000001 (NEEDED)             Shared library: [libcusolver.so.11]

It looks bogus to have cuda libraries version 12, but cusolver version 11.

weslleyspereira commented 1 year ago

Thanks! That may be it. I don't think I have libcusolver.so.11 installed at all. I am looking into that right now.

weslleyspereira commented 1 year ago

@sthibaul, what cuda version / repository did you use for the tests in your ubuntu 20.04 with RTX 6000/8000 ? Thanks.

sthibaul commented 1 year ago

I was using the stock ubuntu cuda 10 stack.

sthibaul commented 1 year ago

Thanks! That may be it. I don't think I have libcusolver.so.11 installed at all. I am looking into that right now.

Well, you do have an libcusolver.so.11 installed, otherwise the starpu build wouldn't have found it :)

weslleyspereira commented 1 year ago

So, about libcusolver.so.12. I have it installed. But the package major version hasn't changed, it is still libcusolver.so.11:

$ dpkg -L libcusolver-12-0
/.
/usr
/usr/local
/usr/local/cuda-12.0
/usr/local/cuda-12.0/targets
/usr/local/cuda-12.0/targets/x86_64-linux
/usr/local/cuda-12.0/targets/x86_64-linux/lib
/usr/local/cuda-12.0/targets/x86_64-linux/lib/libcusolver.so.11.4.3.1
/usr/local/cuda-12.0/targets/x86_64-linux/lib/libcusolverMg.so.11.4.3.1
/usr/share
/usr/share/doc
/usr/share/doc/libcusolver-12-0
/usr/share/doc/libcusolver-12-0/changelog.Debian.gz
/usr/share/doc/libcusolver-12-0/copyright
/usr/local/cuda-12.0/lib64
/usr/local/cuda-12.0/targets/x86_64-linux/lib/libcusolver.so.11
/usr/local/cuda-12.0/targets/x86_64-linux/lib/libcusolverMg.so.11
$ dpkg -L libcusolver-12-1
/.
/usr
/usr/local
/usr/local/cuda-12.1
/usr/local/cuda-12.1/targets
/usr/local/cuda-12.1/targets/x86_64-linux
/usr/local/cuda-12.1/targets/x86_64-linux/lib
/usr/local/cuda-12.1/targets/x86_64-linux/lib/libcusolver.so.11.4.5.107
/usr/local/cuda-12.1/targets/x86_64-linux/lib/libcusolverMg.so.11.4.5.107
/usr/share
/usr/share/doc
/usr/share/doc/libcusolver-12-1
/usr/share/doc/libcusolver-12-1/changelog.Debian.gz
/usr/share/doc/libcusolver-12-1/copyright
/usr/local/cuda-12.1/lib64
/usr/local/cuda-12.1/targets/x86_64-linux/lib/libcusolver.so.11
/usr/local/cuda-12.1/targets/x86_64-linux/lib/libcusolverMg.so.11

So, I don't see a real problem in this case.

In any case, I will try to install cuda-10 instead.

weslleyspereira commented 1 year ago

About the Cholesky example. I see:

https://github.com/starpu-runtime/starpu/blob/9563a47472940f4be9f199ffba10d40ef327cb44/examples/cholesky/cholesky_kernels.c#L555-L563

Wouldn't it be starpu_variable_data_register(&scratch, -1, 0, Lwork * sizeof(float));. This change does not solve the issue I am having, but I wanted to point this anyway.

sthibaul commented 1 year ago

About the Cholesky example. I see:

https://github.com/starpu-runtime/starpu/blob/9563a47472940f4be9f199ffba10d40ef327cb44/examples/cholesky/cholesky_kernels.c#L555-L563

Wouldn't it be starpu_variable_data_register(&scratch, -1, 0, Lwork * sizeof(float));. This change does not solve the issue I am having, but I wanted to point this anyway.

The documentation is not very clear on this, but the code snippets I can find in various places seem to agree on this indeed.

weslleyspereira commented 1 year ago

I saw it here: https://docs.nvidia.com/cuda/cusolver/index.html?highlight=bufferSize#id5 "xyz_bufferSize returns bufferSize for each device. The size is number of elements, not number of bytes."

sthibaul commented 1 year ago

I saw it here: https://docs.nvidia.com/cuda/cusolver/index.html?highlight=bufferSize#id5 "xyz_bufferSize returns bufferSize for each device. The size is number of elements, not number of bytes."

Ah, I was looking at the cuda 11 documentation, I guess they fixed that in cuda 12

sthibaul commented 1 year ago

After installing cuda 12.0 from that repository, I could reproduce the issue with starpu, but also with this small example:

#include <stdio.h>
#include <assert.h>
#include <cuda_runtime.h>
#include <cusolverDn.h>

int main(void) {
    cusolverDnHandle_t handle;
    cusolverStatus_t ret;
    ret = cusolverDnCreate(&handle);
    assert(ret == CUSOLVER_STATUS_SUCCESS);

    int Lwork;
    int nx = 320, ld = 320;
    cusolverDnSpotrf_bufferSize(handle, CUBLAS_FILL_MODE_LOWER, nx, NULL, ld, &Lwork);

    float *ptr;
    cudaMalloc((void**) &ptr, Lwork * sizeof(float));
    float *ptr2;
    cudaMalloc((void**) &ptr, ld * nx * sizeof(float));

    int devInfo;
    ret = cusolverDnSpotrf(handle, CUBLAS_FILL_MODE_LOWER, nx, ptr2, ld, ptr, Lwork, &devInfo);
    assert(ret == CUSOLVER_STATUS_SUCCESS);

    return 0;
}

so it doesn't seems like a starpu-specific issue, but inside cuda itself?

weslleyspereira commented 1 year ago

Oh! Ok. I am convinced. I confirm this error does not appear when using libcusolver.so.10. I will close this issue. Thanks a lot for the guidance here!

Do you have a channel to report those bugs to Nvidia? I never did it.

sthibaul commented 1 year ago

I'd say rather first report it to ubuntu. That 11 vs 12 thing is really odd.