Is it possible to use Dynamic Parallelism code with ManagedCuda?

EdenRidgway commented 7 years ago

Hi,

Thanks for the great work on this project. I was wondering if anyone has gotten a kernel that uses dynamic parallelism working with ManagedCuda (see: https://devblogs.nvidia.com/parallelforall/cuda-dynamic-parallelism-api-principles/). To generate a ptx file for the kernel file it has to be compiled with -arch=sm_35 (or greater) and -rdc=true (relocatable device code) settings.

The example I've been trying to get working is the advanced quick sort provided by NVidia which can be found in their sample directory CUDA Samples\v8.0\6_Advanced\cdpAdvancedQuicksort. When I generate a PTX file off a cut down version of the sample (just with the device functions) it complains that the file is invalid with the error: ErrorInvalidPtx: This indicates that a PTX JIT compilation failed.

Has anyone gotten an example with dynamic parallelism working with ManagedCuda or is that simply not possible. The stackoverflow post answer: https://stackoverflow.com/questions/26147981/nvlink-relocatable-device-code-and-static-device-libraries makes it sound like this may only be achievable via a pinvoke call to a dll with exported functions (inferred from what has been said). However maybe the problem is in the way I'm compiling it when creating the ptx file (but this post suggests that it may not be: https://devtalk.nvidia.com/default/topic/668017/dynamic-parallelism-with-cuda-driver-api/)?

Any help or guidance would be greatly appreciated.

kunzmi commented 7 years ago

Of course this is possible. See also https://stackoverflow.com/questions/27829906/cuda-dynamic-parallelism-with-driver-api

You have two choices: either compile directly to cubin with the cudadevrt linked (not good once you upgrade your GPU :), or compile to ptx and link the cudadevrt at runtime. Compiling to cubin is straight forward, nothing changes except for the compiler/linker settings for nvcc. You can link a ptx-file with the cuda device runtime library as follows with managedCuda:

CudaContext ctx = new CudaContext(0);
CudaLinker linker = new CudaLinker();
linker.AddFile("kernel.ptx", CUJITInputType.PTX, null);
linker.AddFile(@"C:\Program Files\NVIDIA GPU Computing Toolkit\CUDA\v8.0\lib\x64\cudadevrt.lib", CUJITInputType.Library, null);
byte[] image = linker.Complete();
linker.Dispose();

CudaKernel k = ctx.LoadKernelPTX(image, "kernelname");

EdenRidgway commented 7 years ago

Thank you so much for getting back to me. That has done the trick. Thanks!

IzarUrdin commented 4 years ago

Hi all ... I´m trying to compile a simple kernel (recursive)

__device__ int v = 0; __global__ void Recursive(int depth) { // up to depth 6 if (depth == 6) return; v++; Recursive<<<2,2>>>(depth + 1); }

When compiling (in my first attempt) I got the error "MyKernels.c(52): error: device-side kernel launch could not be processed as the required runtime APIs are not declared"

I fixed it adding options to compilation:

" rtc.Compile(new string[] { "-arch=compute_35", "-rdc=true" });"

But now I have the error "ErrorInvalidPtx: This indicates that a PTX JIT compilation failed."

Any idea which the problem is? May I link the cudadevrt.lib before compilation? How to do it?

IzarUrdin commented 4 years ago

Ok ... I´ve finally got it ... first I save the ptx file after compilation:

File.WriteAllText(filename + ".ptx", rtc.GetPTXAsString());

and then as Kunzmi said I load the lib with the linker:

CudaLinker linker = new CudaLinker(); linker.AddFile(filename + ".ptx", CUJITInputType.PTX, null); linker.AddFile(@"C:\Program Files\NVIDIA GPU Computing Toolkit\CUDA\v10.2\lib\x64\cudadevrt.lib", CUJITInputType.Library, null); byte[] image = linker.Complete(); linker.Dispose(); reccount = ctx.LoadKernelPTX(image, "RecursiveCount");

kunzmi / managedCuda

Is it possible to use Dynamic Parallelism code with ManagedCuda? #38