Closed weslleyspereira closed 1 year ago
Hello, could you please try again with the latest master branch ? we did fix many bugs. If it keeps failing, please send us your StarPU file config.log. Thanks, Nathalie
Hello,
CUSOLVER_STATUS_INTERNAL_ERROR
looks like something bad happening in cuda.
I tried to install ubuntu 20.04, and the cholesky example went all fine with our RTX 6000/8000.
Apparently your cuda isn't the cuda as shipped by ubuntu, how did you install it? I'm afraid the problem could be specific to your card. Perhaps the cuda tools can report more precise errors than the mere CUSOLVER_STATUS_INTERNAL_ERROR
? Perhaps dmesg
can tell more?
Hi, I have just run the example cholesky_implicit
after a fresh start of StarPU (commit 368cf4d4f59176780f06d02825d53680478328a7). I obtained the same error. Please, see the log:
config.log
I am using the driver nvidia-driver-525
. Please, see details below:
weslleyp@weslleyp-XPS-15-9510:~$ nvidia-smi
Mon May 1 17:54:31 2023
+-----------------------------------------------------------------------------+
| NVIDIA-SMI 525.105.17 Driver Version: 525.105.17 CUDA Version: 12.0 |
|-------------------------------+----------------------+----------------------+
| GPU Name Persistence-M| Bus-Id Disp.A | Volatile Uncorr. ECC |
| Fan Temp Perf Pwr:Usage/Cap| Memory-Usage | GPU-Util Compute M. |
| | | MIG M. |
|===============================+======================+======================|
| 0 NVIDIA GeForce ... On | 00000000:01:00.0 Off | N/A |
| N/A 49C P0 N/A / 40W | 9MiB / 4096MiB | 0% Default |
| | | N/A |
+-------------------------------+----------------------+----------------------+
+-----------------------------------------------------------------------------+
| Processes: |
| GPU GI CI PID Type Process name GPU Memory |
| ID ID Usage |
|=============================================================================|
| 0 N/A N/A 1817 G /usr/lib/xorg/Xorg 4MiB |
| 0 N/A N/A 3503 G /usr/lib/xorg/Xorg 4MiB |
+-----------------------------------------------------------------------------+
weslleyp@weslleyp-XPS-15-9510:~$ sudo apt install nvidia-driver-525 nvidia-dkms-525
Reading package lists... Done
Building dependency tree
Reading state information... Done
nvidia-dkms-525 is already the newest version (525.105.17-0ubuntu1).
nvidia-driver-525 is already the newest version (525.105.17-0ubuntu1).
0 upgraded, 0 newly installed, 0 to remove and 6 not upgraded.
This is not the open kernel version. Please, see the image below:
Perhaps dmesg can tell more?
Where should I add this command?
Perhaps dmesg can tell more?
Where should I add this command?
you can run it with sudo
Weslley S. Pereira, le lun. 01 mai 2023 17:02:49 -0700, a ecrit:
@.***:~$ sudo apt install nvidia-driver-525 nvidia-dkms-525
Which apt repository is this coming from?
Which apt repository is this coming from?
https://developer.download.nvidia.com/compute/cuda/repos/ubuntu2004/x86_64
Output of dmesg --follow
:
NVRM: Xid (PCI:0000:01:00): 31, pid=154328, name=cholesky_implic, Ch 00000018, intr 00000000. MMU Fault: ENGINE GRAPHICS GPCCLIENT_T1_0 faulted @ 0x0_00000000. Fault is of type FAULT_PDE ACCESS_TYPE_VIRT_WRITE
Just realizing:
0x0000000000000001 (NEEDED) Shared library: [libcudart.so.12]
0x0000000000000001 (NEEDED) Shared library: [libcublas.so.12]
0x0000000000000001 (NEEDED) Shared library: [libcusparse.so.12]
0x0000000000000001 (NEEDED) Shared library: [libcusolver.so.11]
It looks bogus to have cuda libraries version 12, but cusolver version 11.
Thanks! That may be it. I don't think I have libcusolver.so.11 installed at all. I am looking into that right now.
@sthibaul, what cuda version / repository did you use for the tests in your ubuntu 20.04 with RTX 6000/8000 ? Thanks.
I was using the stock ubuntu cuda 10 stack.
Thanks! That may be it. I don't think I have libcusolver.so.11 installed at all. I am looking into that right now.
Well, you do have an libcusolver.so.11 installed, otherwise the starpu build wouldn't have found it :)
So, about libcusolver.so.12. I have it installed. But the package major version hasn't changed, it is still libcusolver.so.11
:
$ dpkg -L libcusolver-12-0
/.
/usr
/usr/local
/usr/local/cuda-12.0
/usr/local/cuda-12.0/targets
/usr/local/cuda-12.0/targets/x86_64-linux
/usr/local/cuda-12.0/targets/x86_64-linux/lib
/usr/local/cuda-12.0/targets/x86_64-linux/lib/libcusolver.so.11.4.3.1
/usr/local/cuda-12.0/targets/x86_64-linux/lib/libcusolverMg.so.11.4.3.1
/usr/share
/usr/share/doc
/usr/share/doc/libcusolver-12-0
/usr/share/doc/libcusolver-12-0/changelog.Debian.gz
/usr/share/doc/libcusolver-12-0/copyright
/usr/local/cuda-12.0/lib64
/usr/local/cuda-12.0/targets/x86_64-linux/lib/libcusolver.so.11
/usr/local/cuda-12.0/targets/x86_64-linux/lib/libcusolverMg.so.11
$ dpkg -L libcusolver-12-1
/.
/usr
/usr/local
/usr/local/cuda-12.1
/usr/local/cuda-12.1/targets
/usr/local/cuda-12.1/targets/x86_64-linux
/usr/local/cuda-12.1/targets/x86_64-linux/lib
/usr/local/cuda-12.1/targets/x86_64-linux/lib/libcusolver.so.11.4.5.107
/usr/local/cuda-12.1/targets/x86_64-linux/lib/libcusolverMg.so.11.4.5.107
/usr/share
/usr/share/doc
/usr/share/doc/libcusolver-12-1
/usr/share/doc/libcusolver-12-1/changelog.Debian.gz
/usr/share/doc/libcusolver-12-1/copyright
/usr/local/cuda-12.1/lib64
/usr/local/cuda-12.1/targets/x86_64-linux/lib/libcusolver.so.11
/usr/local/cuda-12.1/targets/x86_64-linux/lib/libcusolverMg.so.11
So, I don't see a real problem in this case.
In any case, I will try to install cuda-10 instead.
About the Cholesky example. I see:
Wouldn't it be starpu_variable_data_register(&scratch, -1, 0, Lwork * sizeof(float));
. This change does not solve the issue I am having, but I wanted to point this anyway.
About the Cholesky example. I see:
Wouldn't it be
starpu_variable_data_register(&scratch, -1, 0, Lwork * sizeof(float));
. This change does not solve the issue I am having, but I wanted to point this anyway.
The documentation is not very clear on this, but the code snippets I can find in various places seem to agree on this indeed.
I saw it here: https://docs.nvidia.com/cuda/cusolver/index.html?highlight=bufferSize#id5 "xyz_bufferSize returns bufferSize for each device. The size is number of elements, not number of bytes."
I saw it here: https://docs.nvidia.com/cuda/cusolver/index.html?highlight=bufferSize#id5 "xyz_bufferSize returns bufferSize for each device. The size is number of elements, not number of bytes."
Ah, I was looking at the cuda 11 documentation, I guess they fixed that in cuda 12
After installing cuda 12.0 from that repository, I could reproduce the issue with starpu, but also with this small example:
#include <stdio.h>
#include <assert.h>
#include <cuda_runtime.h>
#include <cusolverDn.h>
int main(void) {
cusolverDnHandle_t handle;
cusolverStatus_t ret;
ret = cusolverDnCreate(&handle);
assert(ret == CUSOLVER_STATUS_SUCCESS);
int Lwork;
int nx = 320, ld = 320;
cusolverDnSpotrf_bufferSize(handle, CUBLAS_FILL_MODE_LOWER, nx, NULL, ld, &Lwork);
float *ptr;
cudaMalloc((void**) &ptr, Lwork * sizeof(float));
float *ptr2;
cudaMalloc((void**) &ptr, ld * nx * sizeof(float));
int devInfo;
ret = cusolverDnSpotrf(handle, CUBLAS_FILL_MODE_LOWER, nx, ptr2, ld, ptr, Lwork, &devInfo);
assert(ret == CUSOLVER_STATUS_SUCCESS);
return 0;
}
so it doesn't seems like a starpu-specific issue, but inside cuda itself?
Oh! Ok. I am convinced. I confirm this error does not appear when using libcusolver.so.10
. I will close this issue.
Thanks a lot for the guidance here!
Do you have a channel to report those bugs to Nvidia? I never did it.
I'd say rather first report it to ubuntu. That 11 vs 12 thing is really odd.
I get the following error when I try to run the example
cholesky_implicit
:More details
cholesky_compil cholesky_grain_tag cholesky_tag cholesky_tile_tag
sstatus
evaluates toCUSOLVER_STATUS_INTERNAL_ERROR
and the problem occurs in the first timechol_common_codelet_update_potrf
is called.axpy dgemm alloc add_vectors
.My configurations
Please, let me know if you need more information. Thanks!