gpu / JOCL

Java bindings for OpenCL
http://www.jocl.org
Other
187 stars 33 forks source link

Native pointer address! #45

Open mambastudio opened 2 years ago

mambastudio commented 2 years ago

Hi Marco!

Not sure if this might be appropriate to ask here, but I was more curious of native pointer address concept implementation in JOCL.

Since java might start using a safe api for foreign memory access (currently incubating in jdk 19, and proposed in JEP424), this might open doors to off-heap access of data (outside the jvm) and memory access might be a terabyte size depending on the physical memory available. Of-course native memory can be accessed through unconventional methods, such as the infamous Unsafe API (this API should have been made an official API aeons ago - but good thing the OpenJDK team is making strides to create and implement similar methods from Unsafe), and also through a direct buffer with a possibility of using a hack to get its pointer... public static long addressOfDirectBuffer(ByteBuffer buffer) {return ((DirectBuffer) buffer).address();} (caution, this is using sun.com code hence doesn't work in every java version), whereby the buffer parameter is a direct buffer. I'm aware that Direct ByteBuffer does use Unsafe, but unfortunately, the offheap data is still read into heap first, and still limited to the size of 2GB or 2^31.

Some great insights here

Based on the knowledge above, I was thinking of the possibility of implementing the "addressOfDirectBuffer", and use the long address and use it as an official pointer address as a way to experiment with native pointers and use of OpenCL USE_HOST_PTR. This is shown below.

package wrapper.core;

import org.jocl.NativePointerObject;
import org.jocl.Pointer;

/**
 *
 * @author user
 */
public class CPointer extends NativePointerObject{
    private final long pointer;

    public CPointer(long pointer)
    {
        super();
        this.pointer = pointer;
    }

    @Override
    protected long getNativePointer()
    {
        return pointer;
    }

    public Pointer getPointer()
    {
        return Pointer.to(this);
    }

    @Override
    public int hashCode()
    {
        int result = 227;
        int c = (int)(pointer ^ (pointer >>> 32));
        return 37 * result + c;
    }

    @Override
    public boolean equals(Object obj) {
        if (this == obj) {
            return true;
        }
        if (obj == null) {
            return false;
        }
        if (getClass() != obj.getClass()) {
            return false;
        }
        final CPointer other = (CPointer) obj;
        return this.pointer == other.pointer;
    }
}

Unfortunately, this approach doesn't work, and the data is corrupted in a way. Would you mind giving insights on how native pointers are implemented in jocl. For example, are the pointers accessed in jocl through JNI similar to maybe the hacks describe above? I know this might seem out of scope of what JOCL tries to provide (safe code), but I was a bit curious. There is a project I'm implementing that might require off-heap data only, due to out of memory heap issues.

You can play around with this project. Uses jdk 1.8 (use windows only - the file dialog explorer is quite custom), but good for debugging my other raytracing codes (like implementing new concepts rapidly).

gpu commented 2 years ago

Sure, I'm always open for talking about this. But maybe with a small disclaimer: It has been quiet around JOCL recently. Also because it has been relatively quiet about OpenCL in general. I might actually have to look up certain points, at a certain technical level...


Since java might start using a safe api for foreign memory access ...

I'm roughly aware of the efforts for providing foreign memory access functionality in the JDK. Back when https://openjdk.org/projects/panama/ started, I actually talked a bit with Maurizio Cimadamore, about possible uses of the new API - although this rather referred to JCuda and jextract. I wrote a few notes at https://mail.openjdk.org/pipermail/panama-dev/2019-February/004443.html . I tried to follow some of the discussions on the Panama mailing list (regarding the memory interface and the vector API), but haven't been able to really track that closely for a while now..


... Unsafe API ... Some great insights here

I always tried to avoid the Unsafe API. Not so much because it is 'unsafe'. (I mean, JNI is as unsafe as it can be. One could do everything in these C functions...). But it always felt a bit clumsy, potentially incompatible (due to sun.misc), and ... I still have to find a use-case where it actually produces a noticable performance benefit in client-code. The fact that Unsafe is used in the DirectByteBuffer* implementations is not a concern here. They could throw out Unsafe today. If the ByteBuffer worked as before, I wouldn't care.

(An aside: I'm a bit surprised to see that the article does not mention intrinsics - but that's also a point where one could quickly dive into the deepest guts of modern VMs...)


Based on the knowledge above, I was thinking of the possibility of implementing the "addressOfDirectBuffer", and use the long address and use it as an official pointer address as a way to experiment with native pointers

The Pointer class extends NativePointerObject, and this already does store the native pointer. I think that this will usually be similar to the private final long pointer in your example. When you say that it "...doesn't work, and the data is corrupted", then I'm not entirely sure how exactly you tried to use this class.

But maybe that is the core of the question:

For example, are the pointers accessed in jocl through JNI similar to maybe the hacks describe above?

The Pointer class (or rather the parent class NativePointerObject) may already store the actual address as its long nativePointer value. And it is accessed from the JNI side. (This is not really a 'hack', but ... just the stuff that one usually does in JNI...).

But this address only makes sense for NativePointerObjects where the actual memory is allocated natively - for example, when creating a cl_mem object.

One difficulty for the Pointer class is: There may be different types of "memory".

  1. Pointers with real, native memory - like a cl_mem
  2. Pointers that are created from direct byte buffers
  3. Pointers that are created from non-direct byte buffers (i.e. from Java arrays like float[])
  4. Pointers that are actually pointers to arrays of other pointers - that's the most tricky thing here...

Back when I started JOCL, I tried to handle these cases transparently (including things like pointer arithmetic, with Pointer#withByteOffset).

If I had to re-design JOCL from scratch, I'd probably re-consider this. I would certainly change some details of the implementation. But maybe I would even drop the support for non-direct buffers altogether: Most JNI libraries only support direct buffers, because everything else is just very complicated. But I really wanted to be able to access something like a float[] array without first having to copy it into a direct buffer...

On the JNI side, such a Pointer object (where I don't know which sort of pointer it is) is handled with a PointerData structure. The initPointerData function documentation contains some further details about how these different types of pointers are handled, but ... I didn't touch that code in years...


Some higher-level thoughts:


You can play around with this project.

I'll try to allocate some time for that. I occasionally pointed to your repositories as examples for a project that uses JOCL and structs (!), and we talked a bit about that back in 2016/2017, but I'm sure that a lot has changed in the meantime.

mambastudio commented 2 years ago

Thank you Marco,

This is a well detailed explanation and thank you for your time in responding. This actually made me scratch my head to understand the concepts about its implementation and had to dwell much into your code. The problem with my ray tracer is that it had an issue of throwing out of memory exception after I tried to create an image with more than 2000 x 2000. Due to the large memory allocation of direct buffers through the custom structs I've implemented.

I noticed your Pointer class has to have a buffer regardless, and you explained it well above in the numbered bullet section. Hence I couldn't avoid using the buffer class (requires the buffer object for the JNI). The downside about this is that I will have conform with int sized buffer capacity instead of the desired long sized buffers.

My code in the ray tracer has to create some complex structs such as:

public class RIntersection extends Structure
{        
    public RPoint3 p;
    public RPoint3 n;
    public RPoint3 d;    
    public RPoint2 uv;
    public int mat;
    public int id;
    public int hit;   
}

which is around 80 bytes. So creating such structs for maybe 2000 x 2000 size image would mean creating a lot of bytes in general, but would fit in an array for sure. So, the question is why was the virtual machine throwing out of memory allocation?

After scouring through internet, I discovered that the nio classes tend to have such kind of behaviour (out of memory exceptions) even after using large -Xmx heap size or even -XX:MaxDirectMemorySize. Further research I presumed probably is the large amount of work done in the bound checks of the allocation code of direct buffer as shown in the previous website I showed (reference). With that in mind, I allocated direct buffer by a hack using unsafe class by replacing the created address and replacing it with the one created by unsafe.

Field capacityField = Buffer.class.getDeclaredField("capacity");
capacityField.setAccessible(true);
Field addressField = Buffer.class.getDeclaredField("address");
addressField.setAccessible(true);

Unsafe unsafe = getUnsafe();
long address = unsafe.allocateMemory(cap); //allocate memory of size cap from unsafe

ByteBuffer byteBuffer = ByteBuffer.allocateDirect(0).order(ByteOrder.nativeOrder()); //create small direct buffer
unsafe.freeMemory(((DirectBuffer) byteBuffer).address()); //release the address from the small direct buffer

addressField.setLong(byteBuffer, address); //replace the address with the one generated from unsafe
capacityField.setInt(byteBuffer, cap); //replace the size by the one used by unsafe

byteBuffer.clear();
return byteBuffer;

This actually resolved allocation (temporarily) by by-passing the expensive direct buffer allocation. The problem came now from garbage collection when I re-initialized everything, in which my ray-tracer was crashing every time. Seems the VM does some magic behind to track and garbage collect direct buffers. Almost gave up until I noticed JNA has similar classes of generating direct buffers without the expensive approach done by the standard classes (almost similar to the above approach). I didn't check their implementation, but I just used their API to call necessary classes as shown below.

public static ByteBuffer allocatDirectBufferJNA(long cap) //actually the capacity here cannot go beyond Integer.MAX_VALUE
{
    Memory m = new Memory(cap);
    ByteBuffer buf = m.getByteBuffer(0, m.size()).order(ByteOrder.nativeOrder());
    return buf;
}

This actually resolved everything. Crazy how memory management works in java, especially for direct buffers.

Now I can rest. Lol.

gpu commented 2 years ago

I haven't had the chance to really look more closely at the latest state of the ray tracer. But some quick thoughts for now.


My code in the ray tracer has to create some complex structs such as:

which is around 80 bytes. So creating such structs for maybe 2000 x 2000 size image would mean creating a lot of bytes in general, but would fit in an array for sure. So, the question is why was the virtual machine throwing out of memory allocation?

My first thought here was: Do you really have to allocate this at once for all the pixels? I could imagine that there was an approach to divide this into "chunks", and actually think that this would have made sense for an OpenCL-based implementation regardless of whether there is memory pressure. So, roughly speaking, I'd have expected some pseudocode like

int chunkSize = 128;
for (int x=0; x<2048; x+=chunkSize) {
    for (int y=0; y<2048; y+=chunkSize) {
        // Allocate buffer for 128x128 RIntersection objects (this could
        // and SHOULD probably be done outside of the loop...)
        Buffer buffer = allocate(chunkSize * chunkSize, RIntersection.class);

        // What actually launches the OpenCL kernel: Computes the intersections
        // of 128x128 rays and writes the results into the buffer
        computeIntersections(x*chunkSize, y*chunkSize, chunkSize, chunkSize, buffer);

        // Convert the final intersection info into image pixels
        convertToPixels(buffer, image);
    }
}

But of course, this is a very naive sketch, and there certainly are many reasons why it is not as simple as suggested in this pseudocode...


My second thought was: This may be related to memory fragmentation. This is going down to a level of memory management that I'm not deeply familiar with (because in 99.99% of all cases, this is not a concern for Java developers). But as far as I know, memory allocation on the operating system level can fail to allocate the required memory in some cases. Again, VERY roughly as in

But again, it might be the case that this is handled by some magic of the operating system nowadays, even if the allocation takes place on the level of JNI.


The problem came now from garbage collection when I re-initialized everything, in which my ray-tracer was crashing every time. Seems the VM does some magic behind to track and garbage collect direct buffers.

Yes, I know that there is some 'magic' involved. There was the sun.misc.Cleaner class, which is now part of the standard as java.lang.ref.Cleaner, and which I think is responsible for this (but I havent' looked at the details and the latest version).

There is one general problem with native allocations: They are outside the scope of the Garbage Collector. In fact, people occasionally suggested that I could tweak the Pointer class in JOCL and JCuda so that it automatically deallocates memory, maybe by overriding finalize(). But this plainly does not work. In a pseudocode like

for (int i=0; i<100; i++) {
    Pointer pointer = allocateWithUnsafe(100000000);
    doSomethingWith(pointer);
    // Rely on 'pointer' being garbage collected here
}

one could assume that the pointer is garbage collected at the end of the loop (meaning that finalize() should be called, and the native/unsafe memory could be freed there). But Java does not know about these 100000000 bytes that have been allocated there. From the perspective of the JVM, this Pointer object will only occupy ~40 bytes or so, and it might not see the need to clean that up immediately. After a few iterations, the native allocation will not work any more.


Now, ... this doesn't help you much. But when you refer to JNA and say

This actually resolved everything.

A word of warning: JNA actually did try to use the finalize() approach for cleaning up their Memory objects. According to a quick websearch, they have changed this recently, to use the Cleaner that I mentioned earlier: https://stackoverflow.com/a/69444249

Again, I'm not up to date with the details. But I wanted to mention that you still might have to think about the proper freeing/deallocation even if you use Memory. It's indeed complicated...