Closed nakul02 closed 1 week ago
Thanks for this pointer. I think I had a glance at CUB when I first visited the NVLABS site, but this has been a while ago, and I did not (yet) investigate this further until now.
One of the reasons may be that I thought that there is nothing that can be brought to the Java world "directly".
This may be a misunderstanding, and I'll read the docs more thoroughly and try out some of the examples, but my first impression is that the code offered in CUB is purely device code. So it should (naively) already be possible to write a kernel that uses CUB routines, compile this kernel into a CUBIN/PTX and then load and execute this kernel in JCuda.
Of course, it would definitely make sense to offer (at least some of) the CUB functions in a more Java-idiomatic way. Maybe similar to the https://github.com/jcuda/jcuda-vec library, where some standard operations are offered as Java classes that internally manage the (precompiled) kernels to offer the desired functionality. The main goal here was to offer basic building blocks, so that people would not have to create and manage a lot of kernels for "trivial" operations. Something similar might be doable for CUB.
(This could, in fact, be a substitute for JCudpp to some extent. CUDPP is a bit tricky to compile. And I even thought that it was an abandoned project, but just noticed that they released an update at https://github.com/cudpp/cudpp only a few days ago).
Is this roughly what you had in mind?
In any case, I'll investigate the options here, and see how well the ""CUB-API"" (i.e. the low-level CUB functions) can be brought into the Java world, conveniently.
I think this would be more along the lines of Jcudpp. (BTW - is JCudpp still available in the JCUDA 8?). NVidia CUB has a bunch of "Warp", "Block" and "Device" level primitives. "Warp" and "Block" primitives are to be called from within a kernel. "Device" primitives are host code.
To CUB's advantage, a lot of the code is template-ized. I guess JCuda would have to provide an implementation for each supported type with instantiation happening in the kernel. For things like Reduction where the operation can be specified, maybe initially JCUDA could just support operations that ship with CUB and later figure out a way to extend the mechanism.
Regarding JCudpp: I haven't yet updated it for CUDA 8. Although this would only mean to re-compile it for CUDA 8, the user base of JCudpp seems to be very small, and admittedly, this had lower priority than other tasks. But considering the recent update of CUDPP to 2.3, the priority has now increased, and I'll try to do the update soon, although I can't make any promises right now regarding the exact timeline.
Regarding CUB, I still have to study it further. Broadly speaking (and without having looked at the code), there are different possible "levels" of integration:
BLOCK_SIZE
etc), second regarding the type template parameters (whether a reduction operates on float
or int
), and third, regarding the operation parameters. A quick glance at the headers shows dedicated methods like DeviceReduce::Sum
, but they seem to delegate to a generic reduction that receives a ReductionOpT
parameter for "any" binary operator. I'll have to review the code more thoroughly to see how exactly these degrees of freedom could be exposed. (Another caveat of PTX is that it is, to some extent, specific for the target architecture, e.g. regarding the compute capability...)String
(!) templates for various operations, that additionally allow the configuration parameters mentioned above to be inserted as strings as well, and then compile the result with the NVRTCfloat
, int
...), probably with many parameters for the remaining parts of the configuration.It would also be desirable to have multiple levels of abstraction here. For example, it would be nice to have a convenient "API" layer, but still have the option to go one level deeper and, for example, manually specify some of the parameters that are not exposed in the API level.
But again, this classification is very rough and rather a "brainstorming". I'll have to allocate some time to familiarize myself with CUB.
I like the runtime compilation and API approaches.
Like you said, the precompiled ptx approach is a slippery slope because of the degrees of freedom.
From a usage standpoint, the API is very easy to approach and for when something a little more complex is needed, the runtime compilation can be used.
An example of this could be with the custom reduction operations. If one sticks to Sum, Min, Max, the API could provide it. If one needs something more complex, they can create a string with the relevant C++ code and use it.
At some point, one has to anticipate "It's not gonna happen", and close an issue like this 😞
Would it make sense for JCuda and the Mavenized-JCuda project to include Java bindings to NV Labs' CUB (http://nvlabs.github.io/cub/index.html) ? The bindings would only be to Device-wide primitives. These include Scan, Reduce, Select, Sort, SpMv, SegmentedSort, SegmentedReduce.