Non deterministic segfault with Intel CPU

TPolzer commented 8 years ago

I have some scala code that puts the contents of an iterator into OpenCL device memory. If I do this in parallel on the Intel CPU OpenCL implementation it segfaults most of the time. I have reduced it down to the following code:

import org.jocl.CL._
import org.jocl._
import java.nio.ByteOrder
import java.nio.ByteBuffer

object OpenCL
{
  setExceptionsEnabled(true)
  val deviceType = CL_DEVICE_TYPE_CPU
  val devices = {
    val numPlatforms = Array(0)
    clGetPlatformIDs(0, null, numPlatforms)
    val platforms = new Array[cl_platform_id](numPlatforms(0))
    clGetPlatformIDs(platforms.length, platforms, null)
    platforms.flatMap(platform => {
      try {
        val contextProperties = new cl_context_properties
        contextProperties.addProperty(CL_CONTEXT_PLATFORM, platform)
        val numDevices = Array(0)
        clGetDeviceIDs(platform, deviceType, 0, null, numDevices)
        val devices = new Array[cl_device_id](numDevices(0))
        clGetDeviceIDs(platform, deviceType, numDevices(0), devices, null)
        devices.flatMap(device => {
          try{
            val vendorIdBuffer = new Array[Byte](1024)
            clGetDeviceInfo(device, CL_DEVICE_VENDOR, 1024, Pointer.to(vendorIdBuffer), null)
            val vendorId = new String(vendorIdBuffer, "UTF-8")
            if(!vendorId.matches(".*Intel.*")) {
              None
            } else {
              println(vendorId)
              val context = clCreateContext(contextProperties, 1, Array(device), null, null, null)
              val queue = clCreateCommandQueue(context, device, CL_QUEUE_OUT_OF_ORDER_EXEC_MODE_ENABLE | CL_QUEUE_PROFILING_ENABLE, null)
              Some(new OpenCLSession(context, queue, device))
            }
          } catch {
            case e: CLException => None
          }
        })
      } catch {
        case e: CLException => Nil
      }
    })
  }
  def main(args: Array[String]) : Unit = println((0 to 30).par.map(x => OpenCL.devices(0).stream((0 to 1024*1024*256).iterator.map(_.toDouble),1024*1024*256)))
}

class OpenCLSession (val context: cl_context, val queue: cl_command_queue, val device: cl_device_id)
{
  def stream(it: Iterator[Double], groupSize: Int = 1024*1024*256) : cl_mem = {
    var on_host : Option[cl_mem] = None
    try {
      on_host = Some(clCreateBuffer(context, CL_MEM_ALLOC_HOST_PTR, groupSize, null, null))
      val rawBuffer = clEnqueueMapBuffer(queue, on_host.get, true, CL_MAP_WRITE, 0, groupSize, 0, null, null, null)
      val buffer = rawBuffer.order(ByteOrder.nativeOrder).asDoubleBuffer
      var copied = 0
      while(copied < groupSize/Sizeof.cl_double && it.hasNext) {
        buffer.put(copied, it.next)
        copied += 1
      }
      clEnqueueUnmapMemObject(queue, on_host.get, rawBuffer, 0, null, null)
      clRetainMemObject(on_host.get)
      on_host.get
      } finally {
        on_host.foreach(clReleaseMemObject)
      }
  }

  override def finalize = {
    clReleaseCommandQueue(queue)
    clReleaseContext(context)
  }
}

I suspect it is some unfortunate interference between the jvm and the Intel OpenCL implementation. I would be glad for some expert judgment on this.

gpu commented 8 years ago

Sorry for the delay. "Non deterministic" sounds concerning. Admittedly, I'll have to re-read this, as I'm not really familiar with Scala. Maybe I can also try it out, I wanted to give Scala another try anyhow, but am not sure when I will be able to do this.

Until then: Where does the parallelism come into play here? I assume that you are really talking about multiple host threads, right?

And... is there a Java program that shows the same problem? (If not, I'll try out the Scala version ASAP).

TPolzer commented 8 years ago

Nothing scala specific here really, just that my code was already in scala.

I am talking about multiple host threads all mapping and writing memory at the same time.

gpu commented 8 years ago

The main question was: Which parts of this code, exactly, are executed in parallel? I just don't know for sure what def stream(...) etc. is actually doing. The documentation at https://www.khronos.org/registry/cl/sdk/1.0/docs/man/xhtml/clEnqueueMapBuffer.html states many cases where the behavior is undefined, and I'd like to figure out whether one of these cases applies here. For example, it says that

Mapping (and unmapping) overlapped regions of a buffer or image memory object for writing is undefined.

and I'm not sure whether this is the case here.

Otherwise, I'll try to have a closer look at this soon, but the mix of guessing, reading Scala-docs and deriving what it is likely doing in the background seems ... non-deterministic ;-) I'd really like to pin this down to a case that I can analyze quickly and reliably.

TPolzer commented 8 years ago

This is stream in Java:

public class OpenCLSession {
    final cl_context context;
    final cl_command_queue queue;
    final cl_device_id device;

    OpenCLSession(cl_context context, cl_command_queue queue, cl_device_id device) {
        this.context = context;
        this.queue = queue;
        this.device = device;
    }
    cl_mem stream(scala.collection.Iterator<Double> it) {
        int groupSize = 1024*1024*256;
        cl_mem on_host = null;
        try {
            on_host = clCreateBuffer(context, CL_MEM_ALLOC_HOST_PTR, groupSize, null, null);
            ByteBuffer rawBuffer = clEnqueueMapBuffer(queue, on_host, true, CL_MAP_WRITE, 0, groupSize, 0, null, null, null);
            DoubleBuffer buffer = rawBuffer.order(ByteOrder.nativeOrder()).asDoubleBuffer();
            int copied = 0;
            while(copied < groupSize/Sizeof.cl_double && it.hasNext()) {
                buffer.put(copied, it.next());
                copied += 1;
            }
            clEnqueueUnmapMemObject(queue, on_host, rawBuffer, 0, null, null);
            clRetainMemObject(on_host);
            return on_host;
        } finally {
            if(on_host != null)
                clReleaseMemObject(on_host);
        }
    }

    @Override
    protected void finalize() {
        clReleaseCommandQueue(queue);
        clReleaseContext(context);
    }
}

OpenCLSession is instantiated once by the scala code, and then stream is called in parallel from a thread pool (with a fresh iterator for each call).

Since the buffers are all freshly allocated by clCreateBuffer and not used after initialization, they should not be overlapping.

gpu commented 8 years ago

So I have tried out the original program that you posted, only adjusted to use my CPU, which is one from AMD. I started the program several times, and did not experience any crashes.

If I understood this correctly, then the answer to my question above (namely, where the actual parallelism comes into play) is

  def main(args: Array[String]): Unit = println((0 to 30).par.map(....
  //                                                       ^ here

Did I understand this correctly: When you remove this, then you do not see any crashes?

During the crash, it should write the infamous hs_err... file somewhere. Can you post the relevant part (i.e. the top, including the stack traces) of this file? I just want to know, roughly, where the crash happens.

(If possible, I'd try out to write a plain OpenCL (C++) program that uses 2 host threads, to narrow down the search space, and check whether this problem might be caused by the Intel OpenCL implementation)

TPolzer commented 8 years ago

I have tested this on several OpenCL implementations, and the Intel CPU one is the only crashing one. I have only observed it under parallel execution (the more threads, the more likely it crashes).

When it crashes, it produces no hs_err file. The coredump file (not sure if that helps) is huge.

TPolzer commented 8 years ago

Turns out the compressed coredump is not so huge, here it is (with 3 threads): http://bulsa.faui2k11.de/core.xz

gpu commented 8 years ago

Sorry, I'm not familiar with Linux and code dump analysis.

But when you say that it only happens on Intel CPUs, then it's not unlikely that there reason is actually a bug/limitation/constraint for the Intel OpenCL implementation. (I don't say that it is, only that it's not unlikely).

In cases like this, I usually try to write a native OpenCL program that "does the same thing" (as far as reasonably possible), and consider the result as "ground truth": When it also crashes without the JOCL layer, then the reason is somewhere else.

Again: This could really be caused by JOCL or the "unfortunate interference" between JVM and OpenCL that you mentioned.

But finding a definite answer here may be tricky.

I'll try to allocate some time for creating a native implementation, with multiple threads, each mapping buffers, based on the given example. It will likely work for me, and I can't test it on an Intel CPU, but maybe I can provide a minimal test case for you to try out on Intel. However, I can't give an exact time frame for this.

TPolzer commented 8 years ago

I've tested this some more and indeed, it looks like a Hotspot/Intel problem:

The same code (modified to use DoubleStream instead of scala Iterator) never segfaults when using the IBM Java implementation. With the OpenJDK one it segfaults most of the time, but more interestingly, the segfaults are still there on successful runs (as can be seen with strace -fe '' java ...), it's just that they are handled silently.

So my conclusion is that the Hotspot JVM uses segfaults at some point, and the Intel SDK probably masks/deregisters the handler.

The only remaining question now is, where to report this? :disappointed:

gpu commented 8 years ago

Thanks for the further investigation!

I'd have to first do some research about the options for debugging a segfault in such a complex setup (OpenCL accessed by JOCL accessed by a JVM...), particularly on Linux. But you already mentioned strace - shouldn't this generate some information about where the segfault happened? Interpreting this information is far from trivial, according to first websearch results - but ... maybe you could provide an example output, so that either a Linux guru could give a hint (or I could speculate and guess, although I doubt that this would help you much) ?

And maybe it's possible to derive a hint about where to report this. Most likely, the OpenCL/JVM implementors will skeptically look at such a report and say "Are you sure it's not caused by the JVM/OpenCL implementation?" (respectively). Having a stack trace might help to pin down the culprit. (And I'm still crossing fingers that it's not actually JOCL in the end...)

(BTW: Apologies that I can't be as supportive as I'd like to be right now. The task to increase test coverage is already on my todo list, together with e.g. issue 7, and this should probably include more extensive testing on other VMs and OSes - I'm currently very focussed on Windows/Oracle...)

TPolzer commented 7 years ago

I posted this issue to the Intel forums: https://software.intel.com/en-us/forums/opencl/topic/733905 Let's see what they think.

gpu commented 7 years ago

Thanks for keeping this up. I think that only the intel folks can really give an answer here, if they are willing to investigate it (the setup and configuration is very special, and it's hard to reproduce the issue). The first response at least sounds encouraging that they will have a look at this.

TPolzer commented 7 years ago

It shouldn't be that hard to reproduce with the jar I uploaded. Although my mentioning that I basically don't care anymore seems to have caused a loss of interest on Intel's side.

gpu commented 7 years ago

Maybe a "bump" there could also be helpful, but it's not unlikely that they consider the problematic case as "too narrow" (or "too specific"). As for the JOCL side, I'm not sure what I could do now. (I've recently been busy with other stuff, and the JOCL work was only a few updates to JOCLBlast, but the open issues here are sill nagging me). In any case, I'll leave this one open as well, at least until it's clear whether Intel will still respond or not.

gpu / JOCL

Non deterministic segfault with Intel CPU #12