Open GeorgeKosrs opened 3 years ago
It's hard to make a guess from the traces alone. A segfault can have many reasons. The confusing thing is that (from what I've seen), the traces in the log files refer to clCreateCommandQueueWithPropertiesAPPLE
, but a quick search on the runelite repo does not bring up any results for this. Are you on some sort of experimental branch for runelite or so?
(@LlemonDuck You seem to be familiar with Apple+JOCL+runelite - maybe you have an idea, or provide some pointers what I could look at - either here, or in the runelite issue that Geoge linked to).
We've instructed @GeorgeKosrs that this is a driver issue in the AMD macOS driver. Sorry to bother you, this is definitely not a JOCL thing.
For reference, yes, we don't use clCreateCommandQueueWithPropertiesAPPLE
, but the stack traces also don't include any Java calls whatsoever in the stack trace, so there's nothing we can do. Something deep inside the driver is only sometimes causing segfaults, and we can't trace it down to which calls without the Java stack.
It's hard to make a guess from the traces alone. A segfault can have many reasons. The confusing thing is that (from what I've seen), the traces in the log files refer to
clCreateCommandQueueWithPropertiesAPPLE
, but a quick search on the runelite repo does not bring up any results for this. Are you on some sort of experimental branch for runelite or so?(@LlemonDuck You seem to be familiar with Apple+JOCL+runelite - maybe you have an idea, or provide some pointers what I could look at - either here, or in the runelite issue that Geoge linked to).
Thank you for looking, I really appreciate it @gpu
I am currently running runelite in debug mode to see if it spits out anything useful that can diagnose the issue. @aHooder has provided me with a potential fix also which I am testing.
Thanks for the feedback @LlemonDuck .
I wondered whether it might be the case that, very roughly speaking, the current call to clCreateCommandQueue
that is done from runelite dispatches to clCreateCommandQueueWithPropertiesAPPLE
internally (i.e. inside of Apple's CL implementation). This could to some extent explain the not-so-informative stack trace. But that's a wild guess.
My first, naive approach of "trying things out" here would be very pragmatic: Create a program that does nothing else than
int count = 0;
while (true) {
cl_command_queue commandQueue = clCreateCommandQueue(context, device, l[0] & CL_QUEUE_OUT_OF_ORDER_EXEC_MODE_ENABLE, null);
clReleaseCommandQueue(commandQueue);
System.out.println("Are we there yet?" + (count++));
}
to see whether this is some sort of "internal memory leak" from creating+destroying too many command queues, and eventually causes the creation to fail. If this is not the case, any further analysis might be difficult, though....
I got it to crash again in debug mode. maybe there is some useful information
Thanks for the details. But the most important part for diving into the reason for the error is the stack trace, and this only contains the stack trace that we already saw.
I just hacked together a basic test that simply creates command queues (schedules some small task, so that they are used at all), and destroys them immediately, in a loop.
Of course, this is only a very simple test to see whether the reason for the crash might be the repeated creation/deletion of the command queues. But runelite does more complex stuff, and the crash may be caused by ~"and invalid operation", in some way.
But if you want to give it a try, that might bring some insights. (Note that I added a short sleep
there. You might want to remove this, depending on the outcome of the first tests).
Again: That's a VERY crude test, but may be a first step for narrowing down the search space.
(E.g. if this turns out to be a problem, one could consider proposing a PR for runelite where the command queue is not re-created, but re-used as long as possible or so ...)
package org.jocl.test;
import static org.jocl.CL.*;
import org.jocl.CL;
import org.jocl.Pointer;
import org.jocl.Sizeof;
import org.jocl.cl_command_queue;
import org.jocl.cl_context;
import org.jocl.cl_context_properties;
import org.jocl.cl_device_id;
import org.jocl.cl_mem;
import org.jocl.cl_platform_id;
public class AppleCommandQueueTest
{
private static cl_context context;
private static cl_device_id device;
private static cl_mem buffer;
public static void main(String[] args)
{
defaultInitialization();
createCommandQueuesUntilCrash();
}
private static void createCommandQueuesUntilCrash()
{
final long delayMs = 50;
int count = 0;
while (true)
{
cl_command_queue commandQueue = clCreateCommandQueue(context,
device, CL_QUEUE_OUT_OF_ORDER_EXEC_MODE_ENABLE, null);
// Do some work to use the command queue at least once...
clEnqueueFillBuffer(commandQueue, buffer,
Pointer.to(new float[] { 0.0f }), Sizeof.cl_float,
0, 1 * Sizeof.cl_float, 0, null, null);
clFinish(commandQueue);
clReleaseCommandQueue(commandQueue);
System.out.println("Are we there yet? " + (count++));
try
{
Thread.sleep(delayMs);
}
catch (InterruptedException e)
{
e.printStackTrace();
Thread.currentThread().interrupt();
return;
}
}
}
private static void defaultInitialization()
{
// The platform, device type and device number
// that will be used
final int platformIndex = 0;
final long deviceType = CL_DEVICE_TYPE_ALL;
final int deviceIndex = 0;
// Enable exceptions and subsequently omit error checks in this sample
CL.setExceptionsEnabled(true);
// Obtain the number of platforms
int numPlatformsArray[] = new int[1];
clGetPlatformIDs(0, null, numPlatformsArray);
int numPlatforms = numPlatformsArray[0];
// Obtain a platform ID
cl_platform_id platforms[] = new cl_platform_id[numPlatforms];
clGetPlatformIDs(platforms.length, platforms, null);
cl_platform_id platform = platforms[platformIndex];
// Initialize the context properties
cl_context_properties contextProperties = new cl_context_properties();
contextProperties.addProperty(CL_CONTEXT_PLATFORM, platform);
// Obtain the number of devices for the platform
int numDevicesArray[] = new int[1];
clGetDeviceIDs(platform, deviceType, 0, null, numDevicesArray);
int numDevices = numDevicesArray[0];
// Obtain a device ID
cl_device_id devices[] = new cl_device_id[numDevices];
clGetDeviceIDs(platform, deviceType, numDevices, devices, null);
device = devices[deviceIndex];
// Create a context for the selected device
context = clCreateContext(
contextProperties, 1, new cl_device_id[]{device},
null, null, null);
buffer = clCreateBuffer(context, CL_MEM_READ_ONLY,
1 * Sizeof.cl_float, null, null);
}
}
@gpu thanks for your time looking into this. I'll see if I can help @GeorgeKosrs test this when I have time and we'll let you know how it goes.
I had a similar issue on MacOS that seems to be fixed by a newer (Temurin 21 or Open JDK 8u392) JDK.
After I ran a few hundred OpenCL contexts I got an error that contained: Problematic frame: V [libjvm.dylib+0x345bac] JNIHandles::destroy_global(_jobject*)+0x10
Running the With vm arguments -Xcheck:jni gave: FATAL ERROR in native method: Bad global or local ref passed to JNI at org.jocl.CL.clReleaseContextNative(Native Method) at org.jocl.CL.clReleaseContext(CL.java:4677)
Can you please help me? when I enable the gpu plugin It crashes my mac after a few hours
here are my logs
jvm_crash_pid_71781.log jvm_crash_pid_99320.log jvm_crash_pid_7039.log jvm_crash_pid_30251.log jvm_crash_pid_56604.log jvm_crash_pid_7992.log