japaric commented 7 years ago

The NVPTX backend has been available since: nightly-2017-01-XX

This is a collections of bugs and TODOs related to it.

Documentation

How to: Run Rust code on your NVIDIA GPU

Bugs

LLVM assertion when compiling core to PTX. #38824
LLVM error when emitting PTX code with debuginfo. #38785
NVPTX: No "undefined reference" error is raised when it should be. #38786
NVPTX: non-inlined functions can't be used cross crate. #38787

Missing features

The equivalent to CUDA's __shared__ modifier. Probably needs an RFC to land in the compiler as we don't have anything similar to it (AFAIK).

Stabilization

Stabilize the nvptx targets. IOW, add them to the compiler. Candidates for merging.
All the non-trivial kernels make use of intrinsics like blockIdx.x. These will have to be stabilized. Right now these intrinsics are implemented as "plaform-intrinsics" but that feature is unstable.
Stabilize the "ptx-kernel" ABI. #38788

cc @rkruppe

gnzlbg commented 6 years ago

It would be better if instead of having to create a new module / crate, adding the #![feature(abi_ptx)] feature, and then declaring the kernels as extern "ptx-kernel" fn foo() {}, we could just handle this using #[target_feature], so that one can add kernels within a non-kernel Rust module. It would be great if this attribute could be applied to closures as well.

That way we could just write our kernels inline with normal Rust code.

#[target_feature(enabled = "ptx_device_fn")] unsafe fn a_device_fn(...) { ... }
#[target_feature(enabled = "ptx_kernel")] unsafe fn a_kernel(...) { ... a_device_fn(...) ... }

unsafe fn bar() {
   cuda::driver::launch(x, y, z, w, a_kernel);
   cuda::driver::launch(x, y, z, w, #[target_feature(enabled = "ptx_kernel")] |...| a_kernel(...) );
}

This way users can use the typical target_feature facilities to write portable code.

gnzlbg commented 6 years ago

@japaric @rkruppe @alexcrichton I'd like to work on this.

hanna-kruppe commented 6 years ago

You seem to assume a "single source" model? We don't have that currently. You have to compile your kernels as one crate for a nvptx-* target, then embed the resulting PTX asm as string/resource into a crate that's compiled for the host target. The latter crate passes it to the driver for compilation. There's no way to mix host and device code in one compilation unit like nvcc allows.

Supporting that would require novel frontend integration (novel for Rust; clang has something like this already). For example rustc would have to decide for each translation item whether it should be compiled for the host, for the device, or both -- and then combining the resulting PTX and host object files.

Additionally, even if/once we have "single source", target_feature seems inappropriate, since target_feature is for modifying the available instruction set within one target, while this here requires compiling the code for a different target altogether (and in many cases, for two targets as mentioned before). A custom attribute might be more appropriate but since this is really about "entry point for device code or not" rather than "can run on device or not" an ABI seems fine too.

gnzlbg commented 6 years ago

Additionally, even if/once we have "single source", target_feature seems inappropriate, since target_feature is for modifying the available instruction set within one target,

Indeed, makes sense.

Supporting that would require novel frontend integration (novel for Rust; clang has something like this already). For example rustc would have to decide for each translation item whether it should be compiled for the host, for the device, or both -- and then combining the resulting PTX and host object files.

A custom attribute might be more appropriate but since this is really about "entry point for device code or not" rather than "can run on device or not" an ABI seems fine too.

I'd like to work on enabling this via a mixture of the ABI solution to choose, e.g., the ptx or ptx-kernel ABIs, and target_feature to choose, e.g., sm30 vs sm70.

If we could have these multiple ABIs into a single source file, we could have #[target_device] and #[target_device_kernel] procedural macros that just generates copies of a function for different ABIs:

#[inline]
fn baz() { }

#[target_device(host, nvptx(sm = "40"), spirv(version = "1.0"))]
fn bar(...) { ... }

#[target_device_kernel(host, nvptx(sm = "40"), spirv(version = "1.0"))]
fn foo(...) { 
  #[device] bar(...); // device attribute indicates that this fn is a device fn
  baz(); // this function will be used as is
}

that expand to:

fn baz_host(...) { bar_host(...); baz(); }
#[target_feature(enabled = "sm40")]
extern "ptx" fn baz_nvptx(...) { bar_nvptx(...); baz();  }
extern "spriv" fn baz_spirv(...) { bar_spirv(...); baz();  }

fn foo_host(...) { bar_host(...); baz(); }
#[target_feature(enabled = "sm40")]
extern "ptx-kernel" fn foo_nvptx(...) { bar_nvptx(...); baz();  }
extern "spirv-kernell" fn foo_spirv(...) { bar_device(...); baz();  }

And then just launch the kernels using another procedural macro, e.g., kernel_launch!(foo, args...).

One cool feature of clang and nvcc is to allow whoever builds the library to easily choose the devices to target. Procedural macros could allow these via feature flags: --features target_device_nvptx_sm_35, nvptx_sm_70, ... (it's not nice, but should be at least doable).

This approach leaves the door open to doing something nicer in the language in the future, while allowing libraries to experiment with better APIs. I wonder whether these two building blocks (extern ABIs in a single source file, and #[target_feature]) are a good way to do this, and whether this is something that a nicer approach is going to need in one form or another, or whether this approach is completely wrong and there is a better alternative.

hanna-kruppe commented 6 years ago

Frankly, I don't see how any of the tools we have in the language now (target_feature, proc macros, ABIs) can help at all with single source support. Right now, one crate is compiled for one target, period. Subsets of the crate can tweak some parts of the target (e.g., use non-standard ABIs or enable/disable instruction set extensions) but that's a far cry from slicing out a subset of the crate, compiling it for a completely different target, and then stitching the results back together -- and that's precisely what is necessary for single-source offloading (not just CUDA, but also everything else along these lines that I've seen).

In fact the assumption that one crate == one target goes as far as rustc being literally unable to store more than one target per compilation session (Session). Not to mention how you need to generate a whole separate LLVM module for the PTX code.

Even if one attempts to minimize the amount of compiler changes needed during prototyping for faster iteration (generally a good idea) by e.g. splitting the crate into two crates with an external tool and invoking rustc twice, there is ample room for compiler hacking. Even the bare minimum of single source support requires name resolution information, and being able to use generic library code will require type system integration as well.

So in my opinion, this is a rather big feature with at least as much need for experimentation and compiler hacking and design work as SIMD intrinsics. I say this not to discourage you but because your posts so far ignore the technical challenges that are, in my opinion, the biggest obstacle to single source support.

I'm also rather puzzled by the priorities here. Before experimenting with the best way to allow users to compile their single-source applications not just for multiple CUDA devices but also for entirely different targets like SPIR-V, basic features like an equivalent to __shared__ seem like a simpler and more important first step. Again, not trying to discourage, but even the greatest most ergonomic portable offloading solution seems pointless if the kernels can't even use group-shared memory.

gnzlbg commented 6 years ago

Frankly, I don't see how any of the tools we have in the language now (target_feature, proc macros, ABIs) can help at all with single source support.

Oh no, I think I expressed myself wrong. I meant that once we get single source support using extern ABIs, the combination of tools that we have already available in the language can allow for pretty nice APIs.

basic features like an equivalent to shared seem like a simpler and more important first step.

I think that __shared__ can be implemented as a core::intrinsic and is thus not a big deal: fn shared() -> *mut u8 would do.

hanna-kruppe commented 6 years ago

I meant that once we get single source support using extern ABIs,

What does this mean? I am not aware of any plans for any kind of single source support. And what does "using extern ABIs" mean? It seems to presuppose some strategy for single source support but it's not clear to me which one (and it doesn't sound like any of the strategies that I am aware of). Finally, assuming I'm correct that single source support is not on the horizon, I'm puzzled why we're hashing out details of how it could be exposed better to the user if the basic technical prerequsites aren't even on the horizon.

I think that shared can be implemented as a core::intrinsic and is thus not a big deal: fn shared() -> *mut u8 would do.

IIUC such an intrinsic would be basically like an alloca intrinsic, which Rust has rejected in favor of better support for unsized (DST) values. So while this is a possible strategy (although I can think of some technical challenges as well) that a prototype implementation might choose, it is far from clear to me that it's the approach we'd want to adopt.

gnzlbg commented 6 years ago

IIUC such an intrinsic would be basically like an alloca intrinsic

How so? The kernel does not allocate anything: __shared__ just initializes a pointer with a value.

Finally, assuming I'm correct that single source support is not on the horizon

I'd like to work on enabling single source support and I'd like to enable it in such a way that it is useful.

And what does "using extern ABIs" mean?

In a single source model:

extern "ptx-kernel" unsafe fn foo(...) { ... is compiled to a ptx kernel ... }
fn bar(...) { ... is compiled for the host ... }

hanna-kruppe commented 6 years ago

How so? The kernel does not allocate anything: shared just initializes a pointer with a value.

It's a storage specifies in C parlance. You declare variables to live in shared memory as opposed to thread-private memory or global memory or constant memory. In pointer types it's just an optional hint that the pointee lives in shared memory, that aspect isn't even needed. What is absolutely necessary is to be able to do declare locals like __shared__ float As[BLOCK_SIZE][BLOCK_SIZE]; (from http://docs.nvidia.com/cuda/cuda-c-programming-guide/index.html#shared-memory) which is like let As: [[f32; BLOCK_SIZE]; BLOCK_SIZE]; except there's just one array per thread group, accessed by all threads in the group.

gnzlbg commented 6 years ago

It's a storage specifies in C parlance.

Sure, but what's the point of making it a storage specifier in Rust? You can't have two variables on shared memory, that is, the following is not valid CUDA C:

__global__ void foo(float* foo) {
  __shared__ a float[];
  __shared__ b float[];  // ERROR: you can only have one pointer to shared memory per kernel
  foo[0] = a[0] + b[t0]; 
}

What is absolutely necessary is to be able to do declare locals like shared float As[BLOCK_SIZE][BLOCK_SIZE]; (from http://docs.nvidia.com/cuda/cuda-c-programming-guide/index.html#shared-memory)

In particular, here you are not allocating a float [BLOCK_SIZE][BLOCK_SIZE]; region of shared memory, but assigning the pointer to shared memory to a float [BLOCK_SIZE][BLOCK_SIZE] array (C arrays are just pointers). This might lead to UB if no memory actually has been allocated, or if some other kernel reinterpreted it to have another type, or if less memory than the one required here was allocated, etc.

So IMO, independently of what the spec says, we should focus on the actual semantics of __shared__. The only thing __shared__ does is initializing some pointer to point to the shared memory region with the restriction that one cannot initialize two pointers to it. Nothing more, nothing less. It does not allocate anything, it does not guarantee that any memory exists, or was allocated, or that the data in the memory has some "type", ... no nothin'.

This

extern __shared__ a float[];

is just:

float* a = __get_ptr_to_shared_memory();

hanna-kruppe commented 6 years ago

I want to point out that we're pretty badly derailing this metabug. You should probably open a thread on internals.rlo if you want to take this discussion much further.

But first I would invite you to double check your facts. Many things you've said go against everything I've ever heard and seen about CUDA (and other offloading solutions, for that matter).

Sure, but what's the point of making it a storage specifier in Rust?

I'm not saying it should be a storage specifier in Rust. We don't even have such a concept at the moment. I'm just saying, it's fundamentally a variation on variable declaration (like DST locals), not something about pointers.

You can't have two variables on shared memory, that is, the following is not valid CUDA C:

???? Check the code I linked earlier, it has at least two such variables. Or pretty a random non-trivial CUDA program using shared memory. I don't know why the code you give is rejected, but I've never even seen this syntax so I'm not even sure what it means.

For all you know, As[0] will fail because no shared memory actually has been allocated.

Are you saying Nvidia's examples, as well as all the other programs using shared memory, are effectively borked? That can't be right. In fact, I'm pretty sure the size of the is recorded (provided it has a size -- again I don't know wth __shared__ a float[]; is) in the binary and the scheduler makes sure to only place as many thread groups on one core as the available shared memory permits. I know this because it means the amount of shared memory you use impacts occupancy and thus performance.

That reminds me, another way in which such intrinsic for __shared__ would be weird is that it should probably require the size to be a compile time constant (unlike alloca).

gnzlbg commented 6 years ago

but I've never even seen this syntax so I'm not even sure what it means.

That syntax is the dynamic shared memory allocation syntax.

That reminds me, another way in which such intrinsic for shared would be weird is that it should probably require the size to be a compile time constant (unlike alloca).

The size of the shared memory region can be specified at run-time, at least in CUDA.

Reading through the docs of the example you mention, if the size of the variables allocated in shared memory are compile-time constants, and no dynamic shared memory allocation occurs, it looks like one does not need to specify the memory to allocate during kernel launch because the compiler does it for you (but I've always used dynamic shared memory so I am not sure).

hanna-kruppe commented 6 years ago

Good to know that there's a dynamic allocation strategy as well. But it seems that you still specify the size, just at kernel invocation time? That seems like it would still allow the driver to make sure enough memory is available (i.e., as much memory as the kernel invocation specified; of course the kernel still needs to obtain and use that number correctly). For dynamic an intrinsic might be good, but since static shared memory allocation seems extremely common, we'd probably want to support it as well and an intrinsic can't really do that (well).

gnzlbg commented 6 years ago

But it seems that you still specify the size, just at kernel invocation time?

Yes.

That seems like it would still allow the driver to make sure enough memory is available (i.e., as much memory as the kernel invocation specified; of course the kernel still needs to obtain and use that number correctly).

Exactly. The typical way in which this is used is by passing something that correlates with the allocated size as a run-time argument to the kernel.

For dynamic an intrinsic might be good, but since static shared memory allocation seems extremely common, we'd probably want to support it as well and an intrinsic can't really do that (well).

Yes definitely. Since dynamic shared memory is more powerful (it allows doing everything that can be done with static shared memory and some more), has no drawbacks over static shared memory beyond ergonomics (shared memory is always allocated at run-time, whether the size is known are compile-time or not is pretty much irrelevant), and can probably just be an nvptx intrinsic that returns a pointer to the shared memory region, I think that implementing support for it would be the tiniest incremental step that delivers the most value.

Adding support for __shared__ static memory would be a nice ergonomic addition. If you have device-only functions that need to be called from kernels, using static shared memory in them allows you to bump the size of the allocated memory region transparently (using dynamic shared memory these functions would need to take as argument a pointer into a suitable part of the shared memory region).

AFAIK only fixed-size arrays are allowed in static shared memory and the memory must be uninitialized. So while something like let floats: #[shared] [f32; N] = mem::unitialized(); might work, I think here it would be better to just provide an nvptx::shared_array<[T; N]>type, implemented using compiler magic to put it always on static shared memory, that provides a minimal API that makes sense for shared memory array since things like bounds checking by default (as provided by arrays) make little sense in device kernels where you might not have a way to panic, abort, print anything, etc.

lilith commented 6 years ago

Note that I'm interested in funding work on this: https://internals.rust-lang.org/t/nvptx-funding/7441

I'd like to get this to work out of the box on nightly.

Jasper-Bekkers commented 6 years ago

Hi,

I've started using the NVPTX backend for some simple experiments, I'm listing my experiences here so far since I don't know what the proper protocol is. We can turn these into specific issues on the right repo's later on.

[ ] Name mangling is sometimes incorrect because the LLVM backend outputs names with . in some cases and PTX only accepts [a-zA-Z0-9_$]+

pub struct MyStruct {
    data: u32,
}

impl PartialEq for MyStruct {
    fn eq(&self, other: &Self) -> bool {
        return self.data == other.data;
    }
}

Leads to invalid PTX since symbols are being generated with dots in them:


.visible .func  (.param .b32 func_retval0) _ZN59_$LT$nvkernel..MyStruct$u20$as$u20$core..cmp..PartialEq$GT$2eq17hebe67210677cfb7eE(
    .param .b64 _ZN59_$LT$nvkernel..MyStruct$u20$as$u20$core..cmp..PartialEq$GT$2eq17hebe67210677cfb7eE_param_0,
    .param .b64 _ZN59_$LT$nvkernel..MyStruct$u20$as$u20$core..cmp..PartialEq$GT$2eq17hebe67210677cfb7eE_param_1
)
{
    .reg .pred  %p<2>;
    .reg .b32   %r<4>;
    .reg .b64   %rd<3>;

    ld.param.u64    %rd1, [_ZN59_$LT$nvkernel..MyStruct$u20$as$u20$core..cmp..PartialEq$GT$2eq17hebe67210677cfb7eE_param_0];
    ld.u32  %r1, [%rd1];
    ld.param.u64    %rd2, [_ZN59_$LT$nvkernel..MyStruct$u20$as$u20$core..cmp..PartialEq$GT$2eq17hebe67210677cfb7eE_param_1];
    ld.u32  %r2, [%rd2];
    setp.eq.s32     %p1, %r1, %r2;
    selp.u32    %r3, 1, 0, %p1;
    st.param.b32    [func_retval0+0], %r3;
    ret;
}

I haven't looked much into it but I have a feeling that it's due to https://github.com/rust-lang/rust/blob/master/src/librustc_codegen_utils/symbol_names.rs#L424

[ ] So far I've only been able to get this to run and compile with the nightly-2018-02-12 compiler (due to this PR https://github.com/japaric/core64/pull/4 ) anything newer seems to spew quite a few compile errors after trying to manually create my own core64 lib.
[ ] It would be nice to see if we can add a "kernel" attribute to expose some (all) of these https://docs.nvidia.com/cuda/parallel-thread-execution/index.html#performance-tuning-directives to tweak occupancy & register counts similar to CUDA
[ ] Find a way to expose vector loads & stores (ld.v4.f32 etc) since they're significantly faster. One way would be through the platform intrinsics
[ ] Expose __shared__ memory (as per discussion previous) though preferably as static kernel side memory - though ideally both options are available since this too factors into the kernel occupancy
[ ] Expose __constant__ memory similar to __shared__.
[ ] Expose cross-lane swizzle ops (shfl.bfly, shfl.up etc) as platform intrinsics
[ ] Right now the target json determines the sm target machine and fixes it at sm_20 which is extremely dated (Fermi gpus). It would be nice if this was more flexible - though for now patching the json works too.
[ ] I have some confusion whether repr(C) would be good enough of a layout to share data between GPU and CPU or if we'd eventually need another one.
[ ] It's not entirely clear to me if __syncthreads is also a compiler barrier or not but it ought to be since the compiler shouldn't schedule memory requests across it.

On the bright side: this has been a really pleasant GPU programming experience so far (other then actually getting it set up) because it's extremely valuable to share the same codebase between CPU and GPU.

bheisler commented 6 years ago

I have not been able to compile to PTX with cargo or xargo, I've only been able to do so using accel. Therefore, some of the following may be issues with Accel. That seems unlikely, so I'll report them here.

I see an internal compiler error if the PTX code references the core::intrinsics::sinf32 or cosf32 functions
INVALID_PTX error if I index into a slice (I saw the same error if I used get_unchecked) in the PTX code
INVALID_PTX error if I use the for x in 0..y loop in the PTX code
INVALID_PTX error if any struct in the PTX code implements core::ops traits like Add or Mul
INVALID_PTX error if any struct in the PTX code derives Debug or Display

Most of these are probably due to references to missing functions in the final PTX, but they'll need to be dealt with somehow.

I am interested in contributing to improve the state of GPGPU in Rust. Not sure where to start.

termoshtt commented 6 years ago

Name mangling is sometimes incorrect Leads to invalid PTX since symbols are being generated with dots in them:

I've met this issue while developing accel , and it prevent me to use libcore for nvptx target. accel cannot link libcore or other std libraries currently.

I recently start to write a patch to rustc to enable nvptx target.

I haven't looked much into it but I have a feeling that it's due to https://github.com/rust-lang/rust/blob/master/src/librustc_codegen_utils/symbol_names.rs#L424

this seems to be a good information for me :)

bheisler commented 6 years ago

53099 is relevant here. I haven't been able to compile any kernel recently, because of a segfault while compiling `libcore`. This happens on all machines where I've tried it (Windows and Linux). However, other people say that they've been able to compile NVPTX kernels without running into this issue. Has anyone else seen this?

denzp commented 6 years ago

@bheisler I believe it somehow related to definition json. It doesn't happend to me with json from ptx-linker, but I saw the problem with another one.

denzp commented 6 years ago

I'm finally proud to announce my progress on CUDA integration. I've made several tools to ease development and currently working on a tutorial and high-level crate (it will probably be a custom rustc driver because compiler plugins are de-facto deprecated).

First one is a ptx-linker that solves several important problems:

recently mentioned issue about dots in functions or global variables names,
38786 "undefined references" validation,
38787 linking of multiple LLVM modules.

I started work on the linker about a year ago, and today achieved important milestone: it doesn't depend on any external tools and libs anymore. So end users don't need to care about matching Rust's and system's LLVM versions (which became a problem when Rust switched to LLVM 7.0).

The second crate is a ptx-builder that improves development convenience dramatically. It's a build.rs helper that ensures all needed tools (xargo and ptx-linker atm) are present. It also manages a build environment and xargo runs.

Also worth checking, an incomplete tutorial about CUDA development flow with more or less real example. The tutorial evolved as mentioned before tools did. And more chapters are yet to come :) Sometimes I run tests from there, to ensure PTX assembly is correct and the whole thing still works :)

termoshtt commented 6 years ago

@denzp Can ptx-linker link with libcore? I am creating a toolchain to link libcore using llvm-link in rust-accel/nvptx. Linking of libcore will cause the symbol name issue as reported by @Jasper-Bekkers due to the difference between GAS and PTX, and I avoid it by rewriting librustc_codegen_utils/symbol_names.rs.

denzp commented 6 years ago

@termoshtt The linker suppose to fix this, it has a special "pass" that does renaming. The problem can happen not only with libcore though: consts or structs can also produce invalid PTX sometimes:

src_image.pixel(i, j)

call.uni (retval0), 
_ZN32_$LT$example..Image$LT$T$GT$$GT$5pixel17h81db5ad692bcf640E, 
(param0, param1, param2);

I found the linker robust enough about solving the issue. But still, I'd probably prefer this to be fixed in rustc.

bheisler commented 6 years ago

@denzp - I don't think it's caused by the target JSON, unfortunately. When I add the obj-is-bitcode flag to my target file it compiles without segfaulting, but that means it no longer writes PTX files but instead writes LLVM bitcode.

denzp commented 6 years ago

@bheisler I can confirm that obj-is-bitcode indeed helps to avoid segfault (that's the reason I've never seen the problem before, I always use the flag). But it doesn't affect assembly file creation for me. Let's move further discussion into #53099

juchiast commented 6 years ago

38824 is closed about a month ago. I think we can remove it from the list?

steveklabnik commented 5 years ago

Triage: this is a metabug. Not aware of anything particular going on with this target lately.

pnkfelix commented 2 years ago

Visited for T-compiler backlog bonanza. It seems like there are some unresolved questions about scope and design, with respect to the concerns that were raised on this thread between @hanna-kruppe and @gnzlbg . (Basically, its not clear to me whether the work remaining here is "just" more implementation and fixing bugs, or if there's some design stuff that needs to be revisited.)

@rustbot label: +S-tracking-needs-summary

OwenTrokeBillard commented 9 months ago

@kjetilkjeka Thank you for all your excellent recent work on Rust PTX support.

What, in your opinion, is required to bring Rust CUDA support to parity with C++? Is it even possible?

I'm curious to hear from others too.

kjetilkjeka commented 9 months ago

I think the most prominent issue with using CUDA from Rust is referenced by this issue. It's the lacking support for shared memory.

In the bright future, I think that Rust with CUDA could in some ways even surpass using CUDA from C++. To avoid adding a lot of discussion around what CUDA using Rust should and should not be in this issue, I think we should continue discussing it in the newly created zulip thread for gpgpu.

https://rust-lang.zulipchat.com/#narrow/stream/422870-t-compiler.2Fgpgpu-backend

If you start a thread there I'm happy to discuss it further :smile:

rust-lang / rust

NVPTX backend metabug #38789

Documentation

Bugs

Missing features

Stabilization

38786 "undefined references" validation,

38787 linking of multiple LLVM modules.

38824 is closed about a month ago. I think we can remove it from the list?