Automatic GPU offloading

jon-chuang commented 4 years ago

It seems to me that rayon's awareness of iterators, types and mutability ought to make automatic GPU offloading a feasible project.

An idea would be to use compiler plugins/proc-macros that precompile some of rayon's data structures to parse the par_iter (just via brute force e.g. pattern matching in ast) as a kernel and data movement, and automatically generate the wgpu-rs code necessary, and by isolating the kernel, and precompiling it into LLVM-IR (using some off the shelf solution), one can do Spir-V code generation at compile time. One must be able to extract all the info from the LLVM-IR and inline everything into a single kernel.

One can also look into composing kernels the same way iterators can be composed. Currently, is rayon able to compose closures such as .map(|x| x*2).map(|x| if x % 32 == 0 { x } else { x + 1})?

One can also manually compose functions e.g. here

@calebwin maybe relevant to you

Walther commented 4 years ago

This could be super interesting! I have a toy raytracer project where I would be delighted to play around with this. Sadly, I don't have much knowledge of either rayon's or spir-v's internals, or general purpose GPU compute kernels, to be of much help on the implementation side, without significant help.

In a broader perspective, it would be great to be able to write GPU-accelerated Rust without having to write too much in some GPU-specific DSL or framework - even if the computation would need to designed with a bit of GPU "style" in mind.

dlight commented 4 years ago

OpenMP, which is similiar in spirit to Rayon, is able to offload portions of C and C++ programs to GPUs. Maybe Rayon could leverage the existing OpenMP infrastructure instead of building one from scratch. But being able to compile Rust to SPIR-V, nvptx or amdgpu is also needed.

There's also sycl that leverages OpenCL to provide a high-level way to build heterogeneous C++ programs that runs on both CPU and GPU; could be an interesting model for a Rust library to based on.

Walther commented 4 years ago

Possibly relevant comment in this issue https://github.com/rayon-rs/rayon/issues/778

There's an unstable target for nvptx64-nvidia-cuda, but it doesn't seem like anyone is working on it. Here's the tracking bug: rust-lang/rust#38789

Walther commented 4 years ago

Related and probably of interest: https://github.com/EmbarkStudios/rust-gpu

rcarson3 commented 3 years ago

So as someone who's spent their fair share of time working with HPC applications and various abstraction layers/libraries to get things onto the GPU within C++ and actually run with a nice speed boost, it doesn't appear to me that Rayon is an appropriate crate for such GPU offloading. Another crate would be more suitable that's capable at the very least of fusing small kernels together and targeting various backends (CUDA, HIP, SYCL, and etc). Data management between the device and host code would probably also need to be its own crate as well seeing how that can be quite complex of an issue as well. You could probably look what's being done over in C++ land for some ideas for how to accomplish this (CARE, Umpire, RAJA, Kokkos, and so many others). Honestly, the field is evolving quite a bit at this time for what these abstraction layers would look like, and we're still figuring out ways to improve them and make them easier to use.

dlight commented 3 years ago

But why can't rayon just replicate whatever OpenMP does? Down to using the same LLVM mechanisms that implement OpenMP.

This is already done! You don't need to bring CUDA or other stuff to the picture.

Here's a link: Can OpenMP be used for GPUs?

cuviper commented 3 years ago

Rayon is a pure library, with no compiler integration at all.

rcarson3 commented 3 years ago

So, I'm well aware of the use of OpenMP for target offloading. It's pretty much the only reasonable way to get Fortran code to run on the GPU. You can even get fairly decent speed-ups using OpenMP target-offloading if you do everything right. However, OpenMP is deeply integrated with internals related to the compiler. I mean the C/C++/Fortran versions all require compiler directives (#pragma/!$OMP) for example. I don't know of many people who like monkey around with compiler intrinsic in order to get things to work.

If the Rayon team wanted to they definitely could go down that route, but I would imagine that would also require them working more with the rust compiler team to get things up and going. Alternatively, another crate could be formed to take up that task as well.

dlight commented 3 years ago

@cuviper yes, but perhaps it should hook into the compiler. Other languages do this, why not Rust?

(But in this case, this issue is filed at the wrong repo)

dlight commented 3 years ago

But another route would be to write a Rayon-like GPU library with rust-gpu, and make it an optional dependency of Rayon.

This sounds more actionable (as in, work in a MVP could be started right now), but currently rust-gpu is very restricted.

jon-chuang commented 3 years ago

@dlight I am in agreement, I think rust-gpu would be a very good route and one can even start experimenting immediately to integrate with the rayon interface on simple examples. I think it would be really cool if an fma can immediately work with a rayon branch. @cuviper thoughts on opening an experimental branch on this repo for that once some work is underway?

@rcarson3 I think in this case kernel fusion could possibly be automatic, this depends on how inlining is handled. But probably need to look into how rust-gpu handles functions with spir-v in greater depth.

In the case of rustc's ptx backend, ptx functions are generated; I think these are essentially fused kernels since there is no data movement between the device memory and registers between function calls.

Syntactically speaking, in terms of rayon, this is handled by chaining ops on a fixed iterator. Branching conditions like .filter(..) or step(..) will lead to thread divergence, however. Special code like thresholding or a specialised fn like pooled_filter/pooled_step can help to make them more efficient, for instance by generating a new iterator wherein selected data is temporarily buffered, sacrificing data movement to increase core efficiency. User can toggle and determine which is better for their use case.

In terms of partitioning data load, if we are talking about heterogeneous, multi-gpu/mixed CPU/GPU systems, one may like to rely on cached run-time based profiling. Since we are not guaranteed close to heterogeneous performance across cores like in multicore, which is all rayon is currently designed for.

@rcarson3 in terms of host code, web-gpu already does this, for spir-v kernels that rust-gpu generates.

rcarson3 commented 3 years ago

@jon-chuang it might make more sense to move this thread over to another repo (such as the rust-cuda/wg) or maybe start a new one up rather than keeping things here in this Rayon thread.

Also, I can't say I've really I'm really familiar with how web-gpu does things, since it's never been on my radar for usage in scientific computing. At least in that field a majority of the work in that field has up until recently been geared towards using CUDA. Although, things might now change with HIP and SYCL being a thing. I probably won't have the time to invest into this new Rust library, but I would strongly suggest looking into what's already been done in the C++ field for these type of things. A lot of time and research has been invested over there, and it wouldn't hurt to use those as a starting point for a Rust library.

jon-chuang commented 3 years ago

@rcarson3 sorry, I don't appreciate your bland and uninsightful opinions

cuviper commented 3 years ago

@jon-chuang - please keep it civil here. Better to say nothing if you have no constructive response.

cuviper commented 3 years ago

@cuviper thoughts on opening an experimental branch on this repo for that once some work is underway?

Anyone is free to experiment and collaborate on branches of their own fork. Once you have something to show, we can talk about whether that has an integration path back into the main Rayon.

Jasper-Bekkers commented 3 years ago

While everybody in the rust-gpu project is currently on holidays - feel free to reach out to us on our public Discord server (https://discord.gg/dAuKfZS) to see if we can align on this project.

jon-chuang commented 3 years ago

Some interesting resources: GPU work stealing (2011): http://www.owlnet.rice.edu/~jmg3/wstgpu.pdf (2016): https://pavanbalaji.github.io/pubs/2016/ccpe/ccpe16.ga_gpu.pdf

Challenges: https://distributed.dask.org/en/latest/work-stealing.html

Some arguments against work stealing for GPU: Jobs for GPUs are usually quite regular and avoid branches. This makes jobs rather regular in size. Further, communication between GPU and CPU is much larger than between cores. Although this can be managed by a GPU-dedicated worker thread, the ability to successfully balance and offload computations to the GPU depends on the job size and execution time. Obviously, if a job can execute quickly on the CPUs alone, the user should not opt for the CPU+GPU Version. Still, it begs the question of a scheduler that is aware of the potential cost in advance. I do not know of the wisdom of building a setup specific microbenchmarking (as opposed to function specific).

The problem is that it is certainly hard to imagine how to saturate the GPU this way. Firstly, if the worker needs to wait for a job to complete before stealing the next job, there is an overhead of sending data to the GPU and then sending it back.

It is true that looking into what SYCL does, in particular HipSYCL since I know it to have made efforts to schedule CPU and GPU operations in conjunction, can be beneficial.

LifeIsStrange commented 3 years ago

For inspiration, the tornadoVM is likely the state of the art for the JVM and one of its strength is that it is free to choose between offloading on a workload on the gpu or on other accelerators such as AVX.

Corallus-Caninus commented 2 years ago

Tensorflow may be interesting for this since kernels can be lazily evaluated into a single graph/scope using something similar to thread_local macro for tensorflow sessions. Tensorflow-rs would need ops::constant to impl into and from for native types (i32 etc) and impls for ops::add ops::mult to rusts std::ops::add etc. If done correctly this may be efficient since every iter would be loaded in the graph already at runtime.

eliasboegel commented 1 year ago

I must agree with @rcarson3 here. There has been a monumental amount of funded work going into this topic in the form of several approaches. I think the best starting point there would be to have a look at Kokkos, particularly as they essentially have achieved mapping of a single-source kernel to both CPUs and GPUs with a library-only approach. A compiler-based approach (e.g. SYCL) is another option, which for Rust is somewhat easier to deal with than for C++, as in Rust only one single compiler is widely used instead of several different vendor compilers. Kokkos (and the other libraries @rcarson3 mentioned) enjoy the benefits of having had a lot of engineering work go into them, with a lot of lessons learned along the way from many different architectures. Moreover, there is some cooperation between hardware vendors and the developers of the large parallel programming models. I don't see any reason to start any design work from scratch instead of aligning with the existing and mature work in this topic.

rayon-rs / rayon

Automatic GPU offloading #798