CUDA Support - Githubissues

ax3l commented 7 years ago

Hi ROOT team,

I am opening this issue to document and discuss our plan to add CUDA support to Cling. Maybe someone can open and link an issue on https://root.cern/bugs, I neither have a CERN nor an external account and the tracker is not public for people without registration :)

The discussion started after a post of mine in the cling-dev mailing list, where @Axel-Naumann picked me up and we spinned of a longer private discussion.

On 07.11.2016 10:34, Huebl, Axel wrote: [...] I am trying to compile a simple CUDA program, which is supported by clang already:

[...] internal include issues when starting cling via ./cling -x cuda --cuda-path=$CUDA_ROOT --cuda-gpu-arch=sm_35 -L$CUDA_ROOT/lib -lstdc++ -lcudart [...]

By now, @Axel-Naumann did already fix all issues during startup and we think we are now at the point where one could work on accessing clang's PTX emitter to generate PTX code and then pass it to the driver API.

$ cling -x cuda --cuda-gpu-arch=sm_35 -nocudainc
atexit not in Module!
at_quick_exit not in Module!

****************** CLING ******************
* Type C++ code and press enter to run it *
*             Type .q to exit             *
*******************************************
[cling]$

Clang currently translates CUDA code by linking the PTX code in a fat binary and passing it the CUDA driver during runtime, which generates SASS code (shader assembly) from it to execute. This is similar to what nvcc provides, besides that nvcc can additionally generate and link SASS code for a specific compute architecture directly, but that's not important here (see this page for further details).

From our discussions I understood, that cling already has similar functionality for e.g. PowerPC in place to target specific emitters and execute their assembler artifacts. Can you guide us how one could add the same functionality for PTX?

At GCoE Dresden, which is a collaboration of research groups in and around Dresden (Technical University, Max-Planck, Helmholtz-Zentrum Dresden-Rossendorf), we are currently discussing the possibilities for interactive simulations, RT profiling and tuning, teaching, rapid prototyping and interactive simulations and much more that one could get from a CUDA capable interpreter. Long story short: exciting possibilities!

From what I know about the routines in ROOT, there is no wide spread manycore or GPU acceleration available up to now. Adding CUDA support from cling will provide native CUDA support in your framework which is probably something that could be of interest from your side. Maybe you also want to build on that and directly add general manycore support in a more performance portable and abstract way, a topic on which we have experience, too.

We have currently one interested student that could work on the topic and any support and docs would be greatly appreciated. Two other groups from TU Dresden and Max-Planck also seemed interested and we might be able to contribute further resources (although that is not up to me). Due to our GCoE we also have a fruitful collaboration with Nvidia, which might be necessary, too.

CCing @harrism you might be interested in this thread.

ax3l commented 7 years ago

Internal CERN/ROOT Jira cross-link: https://sft.its.cern.ch/jira/browse/ROOT-8593

ax3l commented 6 years ago

On 27.09.2017 19:32, Simeon Ehrig (@SimeonEhrig) wrote on LLVM-dev:

Dear LLVM-Developers and Vinod Grover,

we are trying to extend the cling C++ interpreter (https://github.com/root-project/cling) with CUDA functionality for Nvidia GPUs.

I already developed a prototype based on OrcJIT and am seeking for feedback. I am currently a stuck with a runtime issue, on which my interpreter prototype fails to execute kernels with a CUDA runtime error.

=== How to use the prototype

This application interprets cuda runtime code. The program needs the whole cuda-program (.cu-file) and its pre-compiled device code (as fatbin) as an input:
command: cuda-interpreter [source].cu [kernels].fatbin
I also implemented an alternative mode, which is generating an object file. The object file can be linked (ld) to an exectuable. This mode is just implemented to check if the LLVM module generation works as expected. Activate it by changing the define INTERPRET from 1 to 0 .

=== Implementation

The prototype is based on the clang example in

https://github.com/llvm-mirror/clang/tree/master/examples/clang-interpreter

I also pushed the source code to github with the install instructions and examples: https://github.com/SimeonEhrig/CUDA-Runtime-Interpreter

The device code generation can be performed with either clang's CUDA frontend or NVCC to ptx.

Here is the workflow in five stages:

generating ptx device code (a kind of nvidia assembler)

translate ptx to sass (machine code of ptx)

generate a fatbinray (a kind of wrapper for the device code)

generate host code object file (use fatbinary as input)

link to executable

(The exact commands are stored in the commands.txt in the github repo)

The interpreter replaces the 4th and 5th step. It interprets the host code with pre-compiled device code as fatbinary. The fatbinary (Step 1 to 3) will be generated with the clang compiler and the nvidia tools ptxas and fatbinary.

=== Test Cases and Issues

You find the test sources on GitHub in the directory "example_prog".

Run the tests with cuda-interpeter and the two arguments as above:

[1] path to the source code in "example_prog"

note: even for host-only code, use the file-ending .cu

[2] path to the runtime .fatbin

note: needs the file ending .fatbin

a fatbin file is necessary, but if the program doesn't need a kernel, the content of the file will ignore

Note: As a prototype, the input is just static and barely checked yet.

hello.cu: simple c++ hello world program with cmath library call sqrt() -> works without problems

pthread_test.cu: c++ program, which starts a second thread -> works without problems

fat_memory.cu: use cuda library and allocate about 191 MB of VRAM. After the allocation, the program waits for 3 seconds, so you can check the memory usage with the nvidia-smi -> works without problems

runtime.cu: combine cuda library with a simple cuda kernel -> Generating an object file, which can be linked (see 5th call in commands above -> ld ...) to a working executable.

The last example has the following issues: Running the executable works fine. Interpreting the code instead does not work. The Cuda Runtime returns the error 8 (cudaErrorInvalidDeviceFunction) , the kernel failed.

Do you have any idea how to proceed?

Best regards, Simeon Ehrig

Axel-Naumann commented 6 years ago

Yup, no reaction from anyone for now... Do you know whether anyone has ever tried OrcJIT and CUDA? Any traces anywhere?

ax3l commented 6 years ago

Only the few answers we got on the list [1] [2] so not much yet but there seems to be no fundamental issue why it should not work.

SimeonEhrig commented 6 years ago

At the moment, I've developed three version of the cuda-interpreter. The first was with a MCJIT-backend. The second used OrcJIT and the third use a implementation from cling (incrementalExecutor). All implementations have the same issue. But I fond some "hacks" at the interpreter.cpp for the executer, so I think I can't simple extract it and use it for my project. Next up, I try to hack the cling-Frontend to accept a cuda-kernel-launch and load the precompiled device-kernel from a file.

Beside, I try to check out, if the jit process and the normal linking with ld are the same process. So I can say, if I can link a object-file generated from cuda with a normal linker I can interpreter it with a normal JIT too. I found no modification at ld or lld for cuda, so I believe the stages after generating the object-file from cuda-C++ or C++ are the same. If I can prove it, I can say my problem is a configuration issue or a bug at the execution engine and the problem is generally solvable.

ax3l commented 6 years ago

Just a note for the interested reader: we are still actively working on it, started a collaboration and @SimeonEhrig continuing as a Master's thesis on it starting next month.

Also, the prototype is running now :)

DavidPoliakoff commented 6 years ago

@ax3l @SimeonEhrig Have either of you tried using Jitify or NVRTC within Cling? I run into surprising problems when I attempt to do so. I load libraries, include headers, and try this program:

#include<jitify.hpp>
const char* program_source1 = "my_program\n __global__\n void my_kernel() {\n }\n";
static jitify::JitCache kernel_cache;
dim3 grid(1);
dim3 block(1);
jitify::detail::vector<std::string> nothing;
jitify::Program program = kernel_cache.program(program_source1, nothing, nothing);
program.kernel("my_kernel").instantiate();//.configure(grid,block).launch();

This goes smoothly through PTX creation, printing out valid PTX here, but then crashes somewhere within this module creation code.

If anybody has had success with using NVRTC, I'd be curious to hear, I'd love to have that capability.

ax3l commented 6 years ago

@DavidPoliakoff Hi, yes NVRTC and the high-level C++ library Jitify around it are alternative approaches to add new CUDA kernels at runtime. But it's not "single-source" in the sense that the OpenCL-style stringification is something one has to like :)

I think it would be awesome if jitify works with cling. Maybe you want to open an independent issue for it since it is, as an orthogonal approach, outside of our current scope of having a single-source, fully-C++ CUDA backend.

As a note, in case we succeed you can also embed the cling interpreter in a compiled program and pass strings to it - just as in Jitify :)

SimeonEhrig commented 6 years ago

@DavidPoliakoff Hi, there is also a technical difference between Jitify and our approach. Jitify use the CUDA Driver API and we want to use the CUDA Runtime API. For cling, two differences of the APIs are really important:

the Runtime API allows single-source code
the Driver API is fully C/C++ compliant, Runtime API not

I have tested the CUDA Driver API (without the NVRTC API) and it worked out of the box, because it is a simple C/C++ API. The Runtime API needs some modification of the cling (or compiler in general). Therefore, it doesn't work out of the box.

I agree with Axel. It would be a good idea, if you open a new issue. I think your kind of problem is another like mine. I hope, the problem will be fixed. It's sounds really interesting to use jitify with cling.

DavidPoliakoff commented 6 years ago

Hey @ax3l and @SimeonEhrig, appreciate the response. I'll file this as a separate issue. I recently found out about embedding Cling in a compiled program, and have a project that centers around using that and Jitify to JIT RAJA kernels for Sierra. You have a really amazing feature there that I hope you advertise loudly, it's enabling us to do some fairly interesting things.

I'll talk more in the other issue, but the use case I was looking for was really to use CUDA in a Jupyter notebook, using the aforementioned JIT work I can get around "single-source" problems by automatically generating the strings. I'm backing off on that use case, but if Cling ever supports NVRTC or Jitify I'd love to pick it back up.

Anyway, I'll open another issue, thanks again for your responses

ax3l commented 6 years ago

Thank you for the warm words! We will definitely keep you posted and update here as well. @SimeonEhrig 's MA thesis is still in progress :)

mnicely commented 5 years ago

Has there been anymore progress on this front?

ax3l commented 5 years ago

Sure thing! We have a working prototype in this repo and are currently stabilizing and refactoring it in #284

ax3l commented 5 years ago

It's also worth noting, although it's an orthogonal effort, that @DavidPoliakoff and @hfinkel independently just published a nice paper on arXiv on their project called ClangJIT: https://arxiv.org/abs/1904.08555 (congrats, btw! repo: https://github.com/hfinkel/llvm-project-cxxjit) Don't mismatch it with the full flavored REPL in cling, but their approach for heavily templated C++ code is nice and intriguing.

Sorry for not coming back on this in the last months, I was busy writing my PhD thesis and have my defense coming up soon.

mnicely commented 5 years ago

Axel,

Thanks for the feedback. I'm trying to write a notebook using OpenMP, OpenACC, BLAS, cuBLAS, and CUDA. Just started this morning but I was able to get cuBLAS to work correctly. I plan to work on OpenMP and BLAS later this week.

And of course, there are issues when launching the CUDA kernel.

ax3l commented 5 years ago

With notebook you mean jupyter notebook? Probably also depends on the kernel you use. Here is one in the repo and we usually take xeus bindings: https://github.com/QuantStack/xeus-cling/pull/169

Feel free to open problems you experience (with reproducible code) with the latest master or even better cross-checked with the updates in PR #284 as independent issues.

Axel-Naumann commented 3 years ago

@SimeonEhrig what's left to do here?

SimeonEhrig commented 3 years ago

@Axel-Naumann Some basic things like CUDA __constant__ memory. I don't have a complete list because I haven't had time to fix the known issues. Also some compiler flags are not working at the moment, which are common. For example, I'm stuck with my PR to support the --cuda-path argument. I'm trying to solve the failed root tests, but it's not any easier to get into. Maybe the LLVM 9 upgrade will fix some of the tests.

fecet commented 1 year ago

what's the cuda version cling currently support？I simply try cling -x cuda and get ”unsupport cuda version”

Cling is installed from conda-forge

SimeonEhrig commented 1 year ago

The conda-forge base on LLVM 9. Therefore it is CUDA 10.1 The master branch base on LLVM 13. Therefore it is CUDA 11.2

Here is an overview about the compatibility: https://gist.github.com/ax3l/9489132#clang--x-cuda

fecet commented 1 year ago

Edit: So the command should be cling --cuda-path=/opt/cuda-10.1 --cuda-gpu-arch=sm_70 -x cuda -L /opt/cuda-10.1/lib

root-project / cling

CUDA Support #115