google-research / dex-lang

Research language for array processing in the Haskell/ML family
BSD 3-Clause "New" or "Revised" License
1.56k stars 106 forks source link

GPU and Multicore backends not working #651

Closed ted-kord closed 1 year ago

ted-kord commented 2 years ago

I get a segmentation core dumped when I tried the llvm-mc multicore backend on some of the examples: sierpinski, ode_integrator and nn.

With the GPU backend (llvm-cuda), I get the following: option --backend: Bad option. Expected one of: ["llvm","llvm-mc","mlir","interpreter"].

I have the cuda sdk/toolkit installed and it was picked up when installing Dex.

ted-kord commented 2 years ago

I have re-installed Dex again a few times all with the same result. The cuda backend is not recognized. option --backend: Bad option. Expected one of: ["llvm","llvm-mc","mlir","interpreter"].

I am running the sample script like this: dex script sierpinski.dx --backend llvm-cuda . `

tjpalmer commented 2 years ago

I also get the same result (Expected one of: ["llvm","llvm-mc","mlir","interpreter"]) when running with --backend llvm-cuda. Some build and install log lines include:

STEP 11: RUN cd dex-lang && make
clang++-9 -fPIC -I/usr/local/cuda/include -DDEX_CUDA -std=c++11 -fno-exceptions -fno-rtti -DDEX_LIVE -c -emit-llvm src/lib/dexrt.cpp -o src/lib/dexrt.bc
stack build --flag dex:cuda
...
[ 1 of 50] Compiling CUDA
...
STEP 13: RUN cd dex-lang && make install
stack install  --flag dex:optimized --flag dex:cuda

Also, nvidia-smi works in the podman container I'm running inside of.

apaszke commented 2 years ago

663 should fix the CUDA issues. I'll look into the multicore backend later, but it is even less polished than the CUDA one at this point.

tjpalmer commented 2 years ago

663 should fix the CUDA issues. I'll look into the multicore backend later, but it is even less polished than the CUDA one at this point.

I tried this branch. It doesn't reject the llvm-cuda backend outright, but it also doesn't seem to work for me:

$ dex --backend llvm-cuda repl
>=> Small = Fin 5

>=> x = linspace Small (-1.0) 1.5
CUDA driver error at cuDeviceGet (CUDA_ERROR_NOT_INITIALIZED): initialization error
Aborted (core dumped)
$ dex repl
>=> Small = Fin 5

>=> x = linspace Small (-1.0) 1.5

>=> x
[-1., -0.5, 0., 0.5, 1.]
>=> 

When running my full script, I also get this output:

$ dex --backend llvm-cuda script --outfmt result-only basic.dx 
dex: Not a cons list: (((Fin 5) => Float32))
CallStack (from HasCallStack):
  error, called at src/lib/Builder.hs:443:8 in dex-0.1.0.0-EDxLzvE2iIKBpgSlPTyhod:Builder
  unpackRightLeaningConsList, called at src/lib/Parallelize.hs:221:18 in dex-0.1.0.0-EDxLzvE2iIKBpgSlPTyhod:Parallelize
tjpalmer commented 2 years ago

Of course, those are 1-2 separate issues from the original report. Just figured I'd mentioned the status. I'm not sure how they fit in the grand scheme of things, but if you want me to write up separate issues, let me know.

apaszke commented 2 years ago

Hmm that's weird. I just ran your repl example and it seems to work out fine for me:

$ dex --backend llvm-cuda repl
>=> Small = Fin 5

>=> linspace Small (-1.0) (1.5)
[-1., -0.5, 0., 0.5, 1.]

Could you try opening up src/lib/dexrt.cpp, replacing cuInit(0) on line 310 with dex_ensure_has_cuda_context() and recompiling? I don't think it should matter, but maybe for some CUDA versions it does?

apaszke commented 2 years ago

And the issue you see in your second script looks like a legit bug inside the parallelization pass. I'll try to investigate it this week, but for now I opened #665.

tjpalmer commented 2 years ago

If correct, the new error message after changing dexrt.cpp is more informative:

$ dex --backend llvm-cuda repl
CUDA driver error at cuInit (CUDA_ERROR_SYSTEM_DRIVER_MISMATCH): system has unsupported display driver / cuda driver combination
Aborted (core dumped)

Here are some other infos:

$ nvidia-smi
Mon Oct 18 13:48:01 2021       
+-----------------------------------------------------------------------------+
| NVIDIA-SMI 470.63.01    Driver Version: 470.63.01    CUDA Version: 11.2     |
|-------------------------------+----------------------+----------------------+
| GPU  Name        Persistence-M| Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp  Perf  Pwr:Usage/Cap|         Memory-Usage | GPU-Util  Compute M. |
|                               |                      |               MIG M. |
|===============================+======================+======================|
|   0  NVIDIA GeForce ...  Off  | 00000000:01:00.0 Off |                  N/A |
|  0%   35C    P8    11W / 170W |      5MiB / 12053MiB |      0%      Default |
|                               |                      |                  N/A |
+-------------------------------+----------------------+----------------------+

+-----------------------------------------------------------------------------+
| Processes:                                                                  |
|  GPU   GI   CI        PID   Type   Process name                  GPU Memory |
|        ID   ID                                                   Usage      |
|=============================================================================|
+-----------------------------------------------------------------------------+
$ nvcc --version
nvcc: NVIDIA (R) Cuda compiler driver
Copyright (c) 2005-2021 NVIDIA Corporation
Built on Sun_Feb_14_21:12:58_PST_2021
Cuda compilation tools, release 11.2, V11.2.152
Build cuda_11.2.r11.2/compiler.29618528_0
apaszke commented 2 years ago

Ah interesting! What kind of container are you using? Seems like it's an issue that some people have encountered before (e.g. here). I'd consider this to be a system misconfiguration on your side.

tjpalmer commented 2 years ago

I'm using podman. For reference, here are the same commands on my host system.

$ nvidia-smi 
Mon Oct 18 08:00:37 2021       
+-----------------------------------------------------------------------------+
| NVIDIA-SMI 470.63.01    Driver Version: 470.63.01    CUDA Version: 11.4     |
|-------------------------------+----------------------+----------------------+
| GPU  Name        Persistence-M| Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp  Perf  Pwr:Usage/Cap|         Memory-Usage | GPU-Util  Compute M. |
|                               |                      |               MIG M. |
|===============================+======================+======================|
|   0  NVIDIA GeForce ...  Off  | 00000000:01:00.0 Off |                  N/A |
|  0%   29C    P8    11W / 170W |      5MiB / 12053MiB |      0%      Default |
|                               |                      |                  N/A |
+-------------------------------+----------------------+----------------------+

+-----------------------------------------------------------------------------+
| Processes:                                                                  |
|  GPU   GI   CI        PID   Type   Process name                  GPU Memory |
|        ID   ID                                                   Usage      |
|=============================================================================|
|    0   N/A  N/A      1763      G   /usr/lib/xorg/Xorg                  4MiB |
+-----------------------------------------------------------------------------+
$ nvcc --version
nvcc: NVIDIA (R) Cuda compiler driver
Copyright (c) 2005-2021 NVIDIA Corporation
Built on Sun_Feb_14_21:12:58_PST_2021
Cuda compilation tools, release 11.2, V11.2.152
Build cuda_11.2.r11.2/compiler.29618528_0

But in any case, I'll see if I get a chance to explore, and if I find anything meaningful, I'll add it to the other issue and/or write up a new one.

tjpalmer commented 2 years ago

(And if it's just an issue on my end, nothing further to add here, except for information for others, of course.)

ted-kord commented 2 years ago

tjpalmer's repl example works for me too but trying the CUDA backend with sierpinski.dx and ode-integrator.dx now produces a different error. For example,

dex script ode-integrator.dx --backend llvm-cuda gives the error:

CUDA driver error at cuLaunchKernel (CUDA_ERROR_LAUNCH_OUT_OF_RESOURCES): too many resources requested for launch Aborted (core dumped)

apaszke commented 2 years ago

@ted-kord can you please paste the output of nvidia-smi in this thread?

ted-kord commented 2 years ago

This is the output of nvidia-smi : nvidia-smi-output

$ nvcc --version nvcc: NVIDIA (R) Cuda compiler driver Copyright (c) 2005-2021 NVIDIA Corporation Built on Sun_Aug_15_21:14:11_PDT_2021 Cuda compilation tools, release 11.4, V11.4.120 Build cuda_11.4.r11.4/compiler.30300941_0

apaszke commented 2 years ago

Uh sorry, I was wondering what card you have, but it's not visible on the screenshot

ted-kord commented 2 years ago

It's NVIDIA Corporation TU106M [GeForce RTX 2060 Max-Q].

dougalm commented 1 year ago

Closing as obsolete