Closed ted-kord closed 1 year ago
I have re-installed Dex again a few times all with the same result. The cuda backend is not recognized.
option --backend: Bad option. Expected one of: ["llvm","llvm-mc","mlir","interpreter"]
.
I am running the sample script like this: dex script sierpinski.dx --backend llvm-cuda
.
`
I also get the same result (Expected one of: ["llvm","llvm-mc","mlir","interpreter"]
) when running with --backend llvm-cuda
. Some build and install log lines include:
STEP 11: RUN cd dex-lang && make
clang++-9 -fPIC -I/usr/local/cuda/include -DDEX_CUDA -std=c++11 -fno-exceptions -fno-rtti -DDEX_LIVE -c -emit-llvm src/lib/dexrt.cpp -o src/lib/dexrt.bc
stack build --flag dex:cuda
...
[ 1 of 50] Compiling CUDA
...
STEP 13: RUN cd dex-lang && make install
stack install --flag dex:optimized --flag dex:cuda
Also, nvidia-smi
works in the podman container I'm running inside of.
663 should fix the CUDA issues. I'll look into the multicore backend later, but it is even less polished than the CUDA one at this point.
I tried this branch. It doesn't reject the llvm-cuda backend outright, but it also doesn't seem to work for me:
$ dex --backend llvm-cuda repl
>=> Small = Fin 5
>=> x = linspace Small (-1.0) 1.5
CUDA driver error at cuDeviceGet (CUDA_ERROR_NOT_INITIALIZED): initialization error
Aborted (core dumped)
$ dex repl
>=> Small = Fin 5
>=> x = linspace Small (-1.0) 1.5
>=> x
[-1., -0.5, 0., 0.5, 1.]
>=>
When running my full script, I also get this output:
$ dex --backend llvm-cuda script --outfmt result-only basic.dx
dex: Not a cons list: (((Fin 5) => Float32))
CallStack (from HasCallStack):
error, called at src/lib/Builder.hs:443:8 in dex-0.1.0.0-EDxLzvE2iIKBpgSlPTyhod:Builder
unpackRightLeaningConsList, called at src/lib/Parallelize.hs:221:18 in dex-0.1.0.0-EDxLzvE2iIKBpgSlPTyhod:Parallelize
Of course, those are 1-2 separate issues from the original report. Just figured I'd mentioned the status. I'm not sure how they fit in the grand scheme of things, but if you want me to write up separate issues, let me know.
Hmm that's weird. I just ran your repl example and it seems to work out fine for me:
$ dex --backend llvm-cuda repl
>=> Small = Fin 5
>=> linspace Small (-1.0) (1.5)
[-1., -0.5, 0., 0.5, 1.]
Could you try opening up src/lib/dexrt.cpp
, replacing cuInit(0)
on line 310 with dex_ensure_has_cuda_context()
and recompiling? I don't think it should matter, but maybe for some CUDA versions it does?
And the issue you see in your second script looks like a legit bug inside the parallelization pass. I'll try to investigate it this week, but for now I opened #665.
If correct, the new error message after changing dexrt.cpp is more informative:
$ dex --backend llvm-cuda repl
CUDA driver error at cuInit (CUDA_ERROR_SYSTEM_DRIVER_MISMATCH): system has unsupported display driver / cuda driver combination
Aborted (core dumped)
Here are some other infos:
$ nvidia-smi
Mon Oct 18 13:48:01 2021
+-----------------------------------------------------------------------------+
| NVIDIA-SMI 470.63.01 Driver Version: 470.63.01 CUDA Version: 11.2 |
|-------------------------------+----------------------+----------------------+
| GPU Name Persistence-M| Bus-Id Disp.A | Volatile Uncorr. ECC |
| Fan Temp Perf Pwr:Usage/Cap| Memory-Usage | GPU-Util Compute M. |
| | | MIG M. |
|===============================+======================+======================|
| 0 NVIDIA GeForce ... Off | 00000000:01:00.0 Off | N/A |
| 0% 35C P8 11W / 170W | 5MiB / 12053MiB | 0% Default |
| | | N/A |
+-------------------------------+----------------------+----------------------+
+-----------------------------------------------------------------------------+
| Processes: |
| GPU GI CI PID Type Process name GPU Memory |
| ID ID Usage |
|=============================================================================|
+-----------------------------------------------------------------------------+
$ nvcc --version
nvcc: NVIDIA (R) Cuda compiler driver
Copyright (c) 2005-2021 NVIDIA Corporation
Built on Sun_Feb_14_21:12:58_PST_2021
Cuda compilation tools, release 11.2, V11.2.152
Build cuda_11.2.r11.2/compiler.29618528_0
Ah interesting! What kind of container are you using? Seems like it's an issue that some people have encountered before (e.g. here). I'd consider this to be a system misconfiguration on your side.
I'm using podman. For reference, here are the same commands on my host system.
$ nvidia-smi
Mon Oct 18 08:00:37 2021
+-----------------------------------------------------------------------------+
| NVIDIA-SMI 470.63.01 Driver Version: 470.63.01 CUDA Version: 11.4 |
|-------------------------------+----------------------+----------------------+
| GPU Name Persistence-M| Bus-Id Disp.A | Volatile Uncorr. ECC |
| Fan Temp Perf Pwr:Usage/Cap| Memory-Usage | GPU-Util Compute M. |
| | | MIG M. |
|===============================+======================+======================|
| 0 NVIDIA GeForce ... Off | 00000000:01:00.0 Off | N/A |
| 0% 29C P8 11W / 170W | 5MiB / 12053MiB | 0% Default |
| | | N/A |
+-------------------------------+----------------------+----------------------+
+-----------------------------------------------------------------------------+
| Processes: |
| GPU GI CI PID Type Process name GPU Memory |
| ID ID Usage |
|=============================================================================|
| 0 N/A N/A 1763 G /usr/lib/xorg/Xorg 4MiB |
+-----------------------------------------------------------------------------+
$ nvcc --version
nvcc: NVIDIA (R) Cuda compiler driver
Copyright (c) 2005-2021 NVIDIA Corporation
Built on Sun_Feb_14_21:12:58_PST_2021
Cuda compilation tools, release 11.2, V11.2.152
Build cuda_11.2.r11.2/compiler.29618528_0
But in any case, I'll see if I get a chance to explore, and if I find anything meaningful, I'll add it to the other issue and/or write up a new one.
(And if it's just an issue on my end, nothing further to add here, except for information for others, of course.)
tjpalmer's repl example works for me too but trying the CUDA backend with sierpinski.dx
and ode-integrator.dx
now produces a different error. For example,
dex script ode-integrator.dx --backend llvm-cuda
gives the error:
CUDA driver error at cuLaunchKernel (CUDA_ERROR_LAUNCH_OUT_OF_RESOURCES): too many resources requested for launch Aborted (core dumped)
@ted-kord can you please paste the output of nvidia-smi
in this thread?
This is the output of nvidia-smi :
$ nvcc --version nvcc: NVIDIA (R) Cuda compiler driver Copyright (c) 2005-2021 NVIDIA Corporation Built on Sun_Aug_15_21:14:11_PDT_2021 Cuda compilation tools, release 11.4, V11.4.120 Build cuda_11.4.r11.4/compiler.30300941_0
Uh sorry, I was wondering what card you have, but it's not visible on the screenshot
It's NVIDIA Corporation TU106M [GeForce RTX 2060 Max-Q]
.
Closing as obsolete
I get a
segmentation core dumped
when I tried the llvm-mc multicore backend on some of the examples: sierpinski, ode_integrator and nn.With the GPU backend (llvm-cuda), I get the following:
option --backend: Bad option. Expected one of: ["llvm","llvm-mc","mlir","interpreter"]
.I have the cuda sdk/toolkit installed and it was picked up when installing Dex.