mumax / 3

GPU-accelerated micromagnetic simulator
Other
455 stars 150 forks source link

mumax3 raises CUDA_ERROR_UNKNOWN #273

Open CasperSchippers opened 3 years ago

CasperSchippers commented 3 years ago

Hi,

When I try to run mumax3 (version 3.10, happens both with pre-compiled binaries and when built from source) I get the following error:

Try running: sudo nvidia-modprobe -u

/home/azken/go/src/github.com/mumax/3/cuda/init.go:60 CUDA_ERROR_UNKNOWN

I tried the nvidia-modprobe suggestion, but that doesn't seem to help. I'm working on Arch Linux (Linux version 5.9.1.arch1-1), with the following driver-versions: nvidia 455.28-7 nvidia-utils 455.28-1 cuda 11.1.0-2

the nvidia-smi command gives the following output:

Wed Oct 28 11:18:56 2020
+-----------------------------------------------------------------------------+
| NVIDIA-SMI 455.28       Driver Version: 455.28       CUDA Version: 11.1     |
|-------------------------------+----------------------+----------------------+
| GPU  Name        Persistence-M| Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp  Perf  Pwr:Usage/Cap|         Memory-Usage | GPU-Util  Compute M. |
|                               |                      |               MIG M. |
|===============================+======================+======================|
|   0  GeForce GTX TITAN   Off  | 00000000:01:00.0  On |                  N/A |
| 30%   37C    P8    16W / 250W |     38MiB /  6080MiB |      0%      Default |
|                               |                      |                  N/A |
+-------------------------------+----------------------+----------------------+

+-----------------------------------------------------------------------------+
| Processes:                                                                  |
|  GPU   GI   CI        PID   Type   Process name                  GPU Memory |
|        ID   ID                                                   Usage      |
|=============================================================================|
|    0   N/A  N/A       428      G   /usr/lib/Xorg                      35MiB |
+-----------------------------------------------------------------------------+

Could you help me?

Thanks in advance!

Regards, Casper

Artemkth commented 3 years ago

Hey @CasperSchippers , I ran into the same issue, does it still persist on your setup? If so can you check if following minimal example reproduces issue for you as well?

#include<iostream>
#include<cuda.h>

int main()
{
    std::cout << "Hello cuda world\n";
    auto Err = cuInit(0);
    if (Err != CUDA_SUCCESS)
        std::cout << "Got error: " << Err << " while initializing :'(" << std::endl;
    else
        std::cout << "CUDA init success!" << std::endl;
    return 0;
}

compile with nvcc -lcuda. 999 is unknown error. I got it after upgrading[debian/testing] drivers, cuda and kernel, so if you are in urgency you can try downgrading ;) If you still get it, I think there is good reason to close issue here and open one on distro maintainers website and nvidia support.

Artemkth commented 3 years ago

Please include dmesg output with your minimal sample run, I just looked and it is pretty informative

[15793.136747] nvidia_uvm: module uses symbols from proprietary module nvidia, inheriting taint.
[15793.137060] nvidia_uvm: Unknown symbol radix_tree_preloads (err -2)
[15793.137100] nvidia_uvm: Unknown symbol set_cpus_allowed_ptr (err -2)
[15793.137148] nvidia_uvm: Unknown symbol mmu_notifier_unregister (err -2)
[15793.137268] nvidia_uvm: Unknown symbol __mmu_notifier_register (err -2)

Edit: found a bug report, hope people having same issue find it helpful: https://bugs.archlinux.org/task/68312

CasperSchippers commented 3 years ago

Hi @Artemkth, thanks for the reply. At the moment, I have downgraded the system because I couldn't get it to work, and people needed it urgently. Whenever the system is available to me again, I will try your suggestion.