Non-blocking clEnqueueWriteBuffer does not work in the runtime

Bastacyclop commented 3 years ago

In the "one copy" buffer runtime, deviceBufferSync uses clEnqueueWriteBuffer. I would have thought that we could use a non-blocking call there, but that produces bugs (e.g. in https://github.com/rise-lang/shine/pull/203, output is fixed by changing to a blocking call https://github.com/rise-lang/shine/pull/203/commits/3eb189abef2447041a0e009f6b07ca072a06ae6c). The idea was that before accessing the ptr on the host, there should be a blocking call to clEnqueueReadBuffer via hostBufferSync (which should be ordered after the clEnqueueWriteBuffer and most likely a kernel call on the command queue).

Any idea why the non-blocking call does not work and how it could be fixed? @fedepiz @michel-steuwer

The code where things go wrong:

#include "ocl/ocl.h"
struct foo_t {
  Kernel k0;
};

typedef struct foo_t foo_t;

void foo_init(Context ctx, foo_t* self){
  (*self).k0 = loadKernel(ctx, k0);
}

void foo_destroy(Context ctx, foo_t* self){
  destroyKernel(ctx, (*self).k0);
}

void foo_run(Context ctx, foo_t* self, Buffer moutput, int n1, Buffer me2){
  {
    Buffer mx101 = createBuffer(ctx, n1 * sizeof(int32_t), HOST_WRITE | DEVICE_READ);
    {
      int32_t* x101 = (int32_t*)hostBufferSync(ctx, mx101, n1 * sizeof(int32_t), HOST_WRITE);
      int32_t* e2 = (int32_t*)hostBufferSync(ctx, me2, n1 * sizeof(int32_t), HOST_READ);
      /* mapSeq */
      for (int i_110 = 0; i_110 < n1; i_110 = 1 + i_110) {
        x101[i_110] = ((int32_t)2) + e2[i_110];
      }

    }

    {
      DeviceBuffer b0 = deviceBufferSync(ctx, moutput, n1 * sizeof(int32_t), DEVICE_WRITE);
      DeviceBuffer b2 = deviceBufferSync(ctx, mx101, n1 * sizeof(int32_t), DEVICE_READ);
      const size_t global_size[3] = (const size_t[3]){n1 / 2, 1, 1};
      const size_t local_size[3] = (const size_t[3]){2, 1, 1};
      const KernelArg args[3] = (const KernelArg[3]){KARG(b0), KARG(n1), KARG(b2)};
      launchKernel(ctx, (*self).k0, global_size, local_size, 3, args);
    }

    destroyBuffer(ctx, mx101);
  }

}

void foo_init_run(Context ctx, Buffer moutput, int n1, Buffer me2){
  foo_t foo;
  foo_init(ctx, &foo);
  foo_run(ctx, &foo, moutput, n1, me2);
  foo_destroy(ctx, &foo);
}

const int N = 64;
int main(int argc, char** argv) {
  Context ctx = createDefaultContext();
  Buffer input = createBuffer(ctx, N * sizeof(int32_t), HOST_READ | HOST_WRITE | DEVICE_READ);
  Buffer output = createBuffer(ctx, N * sizeof(int32_t), HOST_READ | HOST_WRITE | DEVICE_WRITE);

  int32_t* in = hostBufferSync(ctx, input, N * sizeof(int32_t), HOST_WRITE);
  for (int i = 0; i < N; i++) {
    in[i] = 0;
  }

  foo_init_run(ctx, output, N, input);

  int32_t* out = hostBufferSync(ctx, output, N * sizeof(int32_t), HOST_READ);

  for (int i = 0; i < N; i++) {
    if (out[i] != 3) {
      fprintf(stderr, "wrong output: %i\n", out[i]);
      exit(EXIT_FAILURE);
    }
  }

  destroyBuffer(ctx, input);
  destroyBuffer(ctx, output);
  destroyContext(ctx);
  return EXIT_SUCCESS;
}

Bastacyclop commented 3 years ago

buffer_one_copy.c

fedepiz commented 3 years ago

I suspect this may be because the clEnqueueWriteBuffer may not be blocking for the device as well as the host. So even enqueing WRITE -> KERNEL -> READ_BLOCKING does not mean that KERNEL starts after WRITE ends. I suspect what we want is rather WRITE -> WAIT(event from the write) -> KERNEL -> WAIT(event from the kernel) -> READ_BLOCKING

In general, the specification claims that the application cannot use the target pointer of a non-blocking write before waiting on the event. I originally took it to only be relevant for host-pointers, but the spec doesn't specify this. It jus says "the application" and "the memory pointed"

Link to relevant docs: https://www.khronos.org/registry/OpenCL/sdk/1.0/docs/man/xhtml/clEnqueueWriteBuffer.html

Bastacyclop commented 3 years ago

Consolidates what you said: https://community.intel.com/t5/OpenCL-for-CPU/clEnqueueWriteBuffer-does-not-finish-before-Kernel/td-p/1077032

michel-steuwer commented 3 years ago

If I understand your speculations here correctly, you think there might be an issue with the synchronization of the memory operation and the kernel execution.

If you are using different command queues for memory operations and kernels, then you must use events to synchronize them. If you use only a single command queue for both (and you configure it to be in order), then every command will wait until all prior commands have finished before execution.

fedepiz commented 3 years ago

Mh. I am getting contradictory statements from the documentation. One one hand, the https://www.khronos.org/registry/OpenCL/sdk/1.2/docs/man/xhtml/clCreateCommandQueue.html docs agree with what you say

If the CL_QUEUE_OUT_OF_ORDER_EXEC_MODE_ENABLE property of a command-queue is not set, the commands enqueued to a command-queue execute in order. For example, if an application calls clEnqueueNDRangeKernel to execute kernel A followed by a clEnqueueNDRangeKernel to execute kernel B, the application can assume that kernel A finishes first and then kernel B is executed. If the memory objects output by kernel A are inputs to kernel B then kernel B will see the correct data in memory objects produced by execution of kernel A.@

On the other hand, the documentations for https://www.khronos.org/registry/OpenCL/sdk/1.0/docs/man/xhtml/clEnqueueWriteBuffer.html states prosaically

If blocking_write is CL_FALSE, the OpenCL implementation will use ptr to perform a nonblocking write. As the write is non-blocking the implementation can return immediately. The memory pointed to by ptr cannot be reused by the application after the call returns. The event argument returns an event object which can be used to query the execution status of the write command. When the write command has completed, the memory pointed to by ptr can then be reused by the application.

Could it be that for kernel launches an in-order queue guarantees sequential execution, but the same is not true for a non-blocking clEnqueueWriteBuffer?

I think I may experiment on the side and see

EDIT:

A bit later, the docs on the command queue state

Similarly, commands to read, write, copy or map memory objects that are enqueued after clEnqueueNDRangeKernel, clEnqueueTask or clEnqueueNativeKernel commands are not guaranteed to wait for kernels scheduled for execution to have completed (if the CL_QUEUE_OUT_OF_ORDER_EXEC_MODE_ENABLE property is set). To ensure correct ordering of commands, the event object returned by clEnqueueNDRangeKernel, clEnqueueTask or clEnqueueNativeKernel can be used to enqueue a wait for event or a barrier command can be enqueued that must complete before reads or writes to the memory object(s) occur.

Which is a bit vague (strictly speaking, it doesn't't tell me what happens when OUT_OF_ORDER is disabled, or what happens if the events are BEFORE the kernel, as it said "after" the enqueuing of a kernel) but seems to imply @michel-steuwer a bit more. If that's the case, I am really not sure.

Bastacyclop commented 3 years ago

@fedepiz @michel-steuwer Here is a log of the runtime and associated OpenCL calls for this example:

hostBufferSync:    (none)                      // for input write
// host code: in[i] = 0;
hostBufferSync:    (none)                      // for temporary write
hostBufferSync:    (none)                      // for input read
// host code: x101[i] = 2 + e2[i];
deviceBufferSync:  (none)                      // for output write
deviceBufferSync:  clEnqueueWriteBuffer        // for temporary read
launchKernel:      clEnqueueNDRangeKernel
hostBufferSync:    clEnqueueReadBuffer         // for output read

rise-lang / shine

Non-blocking clEnqueueWriteBuffer does not work in the runtime #204