gpu / JOCL

Java bindings for OpenCL
http://www.jocl.org
Other
183 stars 33 forks source link

Struct implementation #25

Open mambastudio opened 5 years ago

mambastudio commented 5 years ago

Hello Marco,

I've come across an issue with memory consumption of my ray tracer , in which it implements a stream compaction as described here. This uses the prefix sum approach. For example, my intersection struct has like 10 variables with float4, float2, int, int4 and thus I result to implementing jocl experimental struct from jocl website.

Due to large memory consumption on a 4GB laptop, running the ray tracer is hectic. Needs to run in server mode jvm so as to go beyond 1200m Xmx in java8. Coz my intersection struct is an array of 480000 (800 x 600 screen space), that takes a lot of java memory together with temporary array for swapping same size when compacting intersection.

I did make a simple struct implementation that can utilize an array but only uses java primitives (int/float), only useful when you use one type of variable in a struct, and OpenCL loves that approach of forced struct data alignment. It has worked well in terms of loading large scenes (Wavefront OBJ), which everything is float4 hence array of floats. And it's the default implementation for scene data. Unlike a java class which stores extra data, hence resource consuming for large arrays of class objects, primitives do well in java. I discovered alternative approaches of struct implementation like Taleo-JUnion, which takes a similar approach. It allows use of different types of primitives and inner struct and lays that in bytebuffer. Works like a class but fast like a primitive in java. Unfortunately if I transfer by the bytebuffer to opencl using jocl, the values of a simple struct with different type variables doesn't work. Data results are jumbled up when assigned in a kernel. JUnion uses native order, but it seems you can configure the order of bytebuffer. If I could ask, how does jocl experimental Struct able to map data well to GPU?

By the way, I can make a simple example of prefix sum for use in your website since prefix sum is like the hello world of parallel programming, compatible with OpenCL 1.2.

Joe.

mambastudio commented 5 years ago

Lol. I think I figured it out. A struct in opencl should be aligned in a way that struct size must be multiples of maximum datatype size. Makes sense why I see most opencl code structs with padding variables.

gpu commented 5 years ago

Sorry for the delayed response here. The question/issue contains quite some information that I'll have to read through more carefully.

Beyond that, you may have noticed the disclaimers surrounding the JOCL structs library: It's a very early, very basic approach, and "structs are difficult" (mainly due to the alignment requirements that you also seem to have stumbled over).

A few months ago, I had moved the JOCL struct library to GitHub, but since then, did not proceed with the development there. I'll read your issue and the references more thoroughly ASAP, maybe I'll get a clearer picture or possible ideas for further functionalities in the stucts library.

For the meantime, I have added you there as a collaborator, so you should now have access to https://github.com/gpu/JOCLStructs

(Note that this is intended to be purely informative, and maybe to discuss issues or further developments. If you are eager and want to contribute, it should still go via PRs).

mambastudio commented 5 years ago

Much appreciated Marco. I'll look at the code of JOCLStructs in detail, and see its approach. Seems they aren't many source files, hence will be easier to go through them.

gpu commented 5 years ago

So I had a short look at the article that you linked to, and your actual ray tracer, and the related projects (JOCLWrapper, Coordinate, ...): This is quite a lot of projects to go through, and I'd definitely have to allocate much more time for that. I remember your initial demo of your ray tracer that you sent me 2 years ago, and you seem to have made a lot of progress there in the meantime.


BTW, two small pointers:


In the initial comment here, you mentioned that a "Prefix sum" sample would be nice. The most similar one might be JOCLReduction from http://www.jocl.org/samples/samples.html , but a prefix sum is a bit more tricky. It's true that prefix sums are a fundamental building block for many operations. I remember when I read Vector Models for Data-Parallel Computing from Guy Blelloch, and was surprised about the versatility of the operation. This Guy was really ahead of his time. Nowadays, with GPUs, everything that he wrote in is thesis in 1990 is more relevant than ever.


Coming to the core of your question, which seems to be this

If I could ask, how does jocl experimental Struct able to map data well to GPU?

Yeah, it's difficult ;-) I noticed that you also wrote about this in the junion issue tracker, at https://github.com/TehLeo/junion/issues/6#issuecomment-531303921 . The main points that are relevant here are 1. the order of the fields, and 2. the alignment requirements and the resulting padding.

The JOCLStructs library is, obviously, tailored for JOCL/OpenCL specifically. But the problem of mapping structs to Java is a very generic one, and I even considered dropping the "JOCL" part of the name and offer it as a generic "Java Structs" library. I wasn't aware of junion until now, but this seems to be the path that they have taken.

I'm pretty sure that they tackled many problems that I didn't even think of until now. But from the perspective of OpenCL, it raises some questions for me. For example, having types like cl_float4 in JOCLStructs may offer some convenience. Otherwise, the user of the library would have to define these data types manually, and one would have to think about a nice and convenient way to do that. This has some caveats. Naively saying that a cl_float4 is structurally the same as a float[4] might work. But a cl_float3 is not the same as a float[3]: The spec explicitly says that it has the same data layout and size (!) as a cl_float4 ...

However, now that you have access to the source code: There are not many classes, but at least one of them is large (and should be split up into multiple classes, by the way) : There is some magic hidden in https://github.com/gpu/JOCLStructs/blob/master/src/main/java/org/jocl/struct/StructAccess.java#L259 , which computes the "accessors" for the struct fields, and takes all the alignment and packing issues into account...

According to https://tehleo.github.io/junion/features.html , the junion library has the option to show the "layout string" of a struct. A similar option is offered in JOCLStructs via Struct.createLayoutString(structClass) - see https://github.com/gpu/JOCLStructs/blob/master/src/main/java/org/jocl/struct/Struct.java#L441

I probably should emphasize that the whole library makes the assumption that the order in which structClass.getDeclaredFields() returns the fields is basically the same as in the source code. This seems to be true in all cases that I have encountered, but is not guaranteed by the specs. The junion library seems to offer a dedicated annotation for that, which is certainly a reasonable approach in order to make the actual layout unambiguously clear.

Again, I'd have to invest far more time into JOCL and JOCLStructs, and also for reading about junion - which might well replace JOCLStructs, in the best case. Moving the JOCL Samples from the website into an own GitHub repo ist another point on my TODO list, and once I tackle this, I might add a "prefix sum" example. And there's still an (unpublished) JOCLUtils library here, which might cover some of the convenience functionality that you offer via JOCLWrapper, but that's another thing that I have to read more thoroughly....

cursor42 commented 4 years ago

Hello, I also played around with your '0.0.1' alpha code of "JOCL Structs". For my purpose (passing complex input to kernel functions) it works in my hardware environment with a few 'workarounds'. Maybe I can provide changes/ideas/test cases to the community (e.g. added support for 'double3' data type).

Could you please allow me to access

JOCL Structs on GitHub

or open it to be public?

gpu commented 4 years ago

@cursor42 Sorry for the delayed response.

I have added you as a collaborator (but note that changes should still be done via PRs).

Specifically: There is not really anything to do for double3 data types: According to the OpenCL specification, they have the same memory layout as double4.

cursor42 commented 4 years ago

Sorry to ask again: I still cannot access https://github.com/gpu/JOCLStructs.

gpu commented 4 years ago

@cursor42 Sorry, there seems to be a maximum of 3 collaborators for private repositories. So now I made https://github.com/gpu/JOCLStructs public, (naively) hoping that people will apply the appropriate scrutiny due to the experimental nature of the project.