gpu / JOCL

Java bindings for OpenCL
http://www.jocl.org
Other
187 stars 33 forks source link

Crashes my mac #40

Open GeorgeKosrs opened 3 years ago

GeorgeKosrs commented 3 years ago

Can you please help me? when I enable the gpu plugin It crashes my mac after a few hours

here are my logs

jvm_crash_pid_71781.log jvm_crash_pid_99320.log jvm_crash_pid_7039.log jvm_crash_pid_30251.log jvm_crash_pid_56604.log jvm_crash_pid_7992.log

gpu commented 3 years ago

It's hard to make a guess from the traces alone. A segfault can have many reasons. The confusing thing is that (from what I've seen), the traces in the log files refer to clCreateCommandQueueWithPropertiesAPPLE, but a quick search on the runelite repo does not bring up any results for this. Are you on some sort of experimental branch for runelite or so?

(@LlemonDuck You seem to be familiar with Apple+JOCL+runelite - maybe you have an idea, or provide some pointers what I could look at - either here, or in the runelite issue that Geoge linked to).

LlemonDuck commented 3 years ago

We've instructed @GeorgeKosrs that this is a driver issue in the AMD macOS driver. Sorry to bother you, this is definitely not a JOCL thing.

For reference, yes, we don't use clCreateCommandQueueWithPropertiesAPPLE, but the stack traces also don't include any Java calls whatsoever in the stack trace, so there's nothing we can do. Something deep inside the driver is only sometimes causing segfaults, and we can't trace it down to which calls without the Java stack.

GeorgeKosrs commented 3 years ago

It's hard to make a guess from the traces alone. A segfault can have many reasons. The confusing thing is that (from what I've seen), the traces in the log files refer to clCreateCommandQueueWithPropertiesAPPLE, but a quick search on the runelite repo does not bring up any results for this. Are you on some sort of experimental branch for runelite or so?

(@LlemonDuck You seem to be familiar with Apple+JOCL+runelite - maybe you have an idea, or provide some pointers what I could look at - either here, or in the runelite issue that Geoge linked to).

Thank you for looking, I really appreciate it @gpu

I am currently running runelite in debug mode to see if it spits out anything useful that can diagnose the issue. @aHooder has provided me with a potential fix also which I am testing.

gpu commented 3 years ago

Thanks for the feedback @LlemonDuck .

I wondered whether it might be the case that, very roughly speaking, the current call to clCreateCommandQueue that is done from runelite dispatches to clCreateCommandQueueWithPropertiesAPPLE internally (i.e. inside of Apple's CL implementation). This could to some extent explain the not-so-informative stack trace. But that's a wild guess.

My first, naive approach of "trying things out" here would be very pragmatic: Create a program that does nothing else than

int count = 0;
while (true) {
    cl_command_queue commandQueue = clCreateCommandQueue(context, device, l[0] & CL_QUEUE_OUT_OF_ORDER_EXEC_MODE_ENABLE, null);
    clReleaseCommandQueue(commandQueue);
    System.out.println("Are we there yet?" + (count++));
}

to see whether this is some sort of "internal memory leak" from creating+destroying too many command queues, and eventually causes the creation to fail. If this is not the case, any further analysis might be difficult, though....

GeorgeKosrs commented 3 years ago

I got it to crash again in debug mode. maybe there is some useful information

Screenshot 2021-03-27 at 4 25 30 pm

hs_err_pid21891.log

gpu commented 3 years ago

Thanks for the details. But the most important part for diving into the reason for the error is the stack trace, and this only contains the stack trace that we already saw.

I just hacked together a basic test that simply creates command queues (schedules some small task, so that they are used at all), and destroys them immediately, in a loop.

Of course, this is only a very simple test to see whether the reason for the crash might be the repeated creation/deletion of the command queues. But runelite does more complex stuff, and the crash may be caused by ~"and invalid operation", in some way.

But if you want to give it a try, that might bring some insights. (Note that I added a short sleep there. You might want to remove this, depending on the outcome of the first tests).

Again: That's a VERY crude test, but may be a first step for narrowing down the search space.

(E.g. if this turns out to be a problem, one could consider proposing a PR for runelite where the command queue is not re-created, but re-used as long as possible or so ...)

package org.jocl.test;

import static org.jocl.CL.*;

import org.jocl.CL;
import org.jocl.Pointer;
import org.jocl.Sizeof;
import org.jocl.cl_command_queue;
import org.jocl.cl_context;
import org.jocl.cl_context_properties;
import org.jocl.cl_device_id;
import org.jocl.cl_mem;
import org.jocl.cl_platform_id;

public class AppleCommandQueueTest
{
    private static cl_context context;
    private static cl_device_id device;
    private static cl_mem buffer;

    public static void main(String[] args)
    {
        defaultInitialization();
        createCommandQueuesUntilCrash();
    }

    private static void createCommandQueuesUntilCrash()
    {
        final long delayMs = 50;

        int count = 0;
        while (true)
        {
            cl_command_queue commandQueue = clCreateCommandQueue(context,
                device, CL_QUEUE_OUT_OF_ORDER_EXEC_MODE_ENABLE, null);

            // Do some work to use the command queue at least once...
            clEnqueueFillBuffer(commandQueue, buffer,
                Pointer.to(new float[] { 0.0f }), Sizeof.cl_float,
                0, 1 * Sizeof.cl_float, 0, null, null);
            clFinish(commandQueue);

            clReleaseCommandQueue(commandQueue);
            System.out.println("Are we there yet? " + (count++));

            try
            {
                Thread.sleep(delayMs);
            }
            catch (InterruptedException e)
            {
                e.printStackTrace();
                Thread.currentThread().interrupt();
                return;
            }
        }
    }

    private static void defaultInitialization() 
    {
        // The platform, device type and device number
        // that will be used
        final int platformIndex = 0;
        final long deviceType = CL_DEVICE_TYPE_ALL;
        final int deviceIndex = 0;

        // Enable exceptions and subsequently omit error checks in this sample
        CL.setExceptionsEnabled(true);

        // Obtain the number of platforms
        int numPlatformsArray[] = new int[1];
        clGetPlatformIDs(0, null, numPlatformsArray);
        int numPlatforms = numPlatformsArray[0];

        // Obtain a platform ID
        cl_platform_id platforms[] = new cl_platform_id[numPlatforms];
        clGetPlatformIDs(platforms.length, platforms, null);
        cl_platform_id platform = platforms[platformIndex];

        // Initialize the context properties
        cl_context_properties contextProperties = new cl_context_properties();
        contextProperties.addProperty(CL_CONTEXT_PLATFORM, platform);

        // Obtain the number of devices for the platform
        int numDevicesArray[] = new int[1];
        clGetDeviceIDs(platform, deviceType, 0, null, numDevicesArray);
        int numDevices = numDevicesArray[0];

        // Obtain a device ID 
        cl_device_id devices[] = new cl_device_id[numDevices];
        clGetDeviceIDs(platform, deviceType, numDevices, devices, null);
        device = devices[deviceIndex];

        // Create a context for the selected device
        context = clCreateContext(
            contextProperties, 1, new cl_device_id[]{device}, 
            null, null, null);

        buffer = clCreateBuffer(context, CL_MEM_READ_ONLY, 
            1 * Sizeof.cl_float, null, null);
    }

}
aHooder commented 3 years ago

@gpu thanks for your time looking into this. I'll see if I can help @GeorgeKosrs test this when I have time and we'll let you know how it goes.

petermckeown commented 10 months ago

I had a similar issue on MacOS that seems to be fixed by a newer (Temurin 21 or Open JDK 8u392) JDK.

After I ran a few hundred OpenCL contexts I got an error that contained: Problematic frame: V [libjvm.dylib+0x345bac] JNIHandles::destroy_global(_jobject*)+0x10

Running the With vm arguments -Xcheck:jni gave: FATAL ERROR in native method: Bad global or local ref passed to JNI at org.jocl.CL.clReleaseContextNative(Native Method) at org.jocl.CL.clReleaseContext(CL.java:4677)