halide / Halide

a language for fast, portable data-parallel computation
https://halide-lang.org
Other
5.77k stars 1.07k forks source link

OpenCL error with `CL_MISALIGNED_SUB_BUFFER_OFFSET` #8235

Open micragz opened 1 month ago

micragz commented 1 month ago

Hello, and thanks for this amazing project.

We are seeing Halide errors on some devices. They seem to originate from Halide's OpenCL call to clCreateSubBuffer:

https://github.com/halide/Halide/blob/33d5ba9536f5aa6d3702c2c764b899a8f45b3b62/src/runtime/opencl.cpp#L1156

With CL_MISALIGNED_SUB_BUFFER_OFFSET being returned from here. This error code is described as follows from clCreateSubBuffer:

CL_MISALIGNED_SUB_BUFFER_OFFSET if there are no devices in context associated with buffer for which the origin field of the cl_buffer_region structure passed in buffer_create_info is aligned to the CL_DEVICE_MEM_BASE_ADDR_ALIGN value.

With the mentioned origin field being set from the Halide buffer's own offset.

In particular, we have seen this error on the Intel OpenCL CPU runtime and on AMD GPUs. On the other hand, NVidia and Intel GPUs in particular do not seem to report the misalignment error, as far as we have seen.

We have observed that the call to clCreateSubBuffer from Halide does seem to violate this condition occasionally on all devices, even though only some of them report OpenCL errors as a result. As an example, we can see an offset of 321984 bytes (aligned to 64 bytes), with a required device alignment of 128 bytes. So it seems that the driver implementations diverge in how lenient they are on this condition, but Halide often violates it regardless.

Are there some steps that can be taken to prevent violating this alignment condition from the Halide side?

abadams commented 1 month ago

Ugh, if some devices require that alignment, I think we need to reconsider using sub buffers at all, because we can't guarantee that sort of crop alignment.

zvookin commented 3 weeks ago

This happens when passing cropped device buffers to an OpenCL kernel compiled by Halide. Halide implements halide_device_crop for OpenCL by storing an offset in the Halide buffer structure. In order to pass a buffer with a non-zero offset to an OpenCL kernel, at least as Halide currently generates code for OpenCL, a sub-buffer must be created or the crop area copied to a new device buffer. The halide_opencl_run calls clCreateSubBuffer to create a sub-buffer as this is more efficient than copying. (And since the buffers are general read/write, copying requires copy-in/copy-out, though it is possible we have enough information to eliminate one side or the other in many cases.)

Unfortunately, sub-buffers must be aligned to the CL_DEVICE_MEM_BASE_ADDR_ALIGN value, which cannot be assumed to be 1. On devices where stricter alignment is required, a cropped device buffer can potentially cause a failure in halide_opencl_run as documented here. The issue does not include a reproducing case, but the easiest way to make a test case is to call halide_device_crop to form a buffer with an offset of 1 and pass that to Halide generated computation scheduled using OpenCL.

There are three possible ways to go here:

  1. Declare it an error. Likely improve the reporting in halide_opencl_run and document how it occurs.

  2. Query the OpenCL device for the required alignment and if the supplied input or output buffer does not meet the requirements, have the runtime copy it to a newly allocated buffer. This involves taking care to copy in and copy out as required. The copies can likely be asynchronous but the allocation and free probably have to serialize somewhat. And as with many things where we take care of this inside the runtime, there is little opportunity to reuse the temporary buffer or otherwise optimize/amortize the cost. That said, the cost would only be incurred when the current implementation would crash. We should likely think a bit on whether copying preserves semantics in all cases. E.g. aliasing.

  3. Compile the OpenCL kernel code to take an offset in addition to the buffer. Doing this for every single buffer is likely enough of a performance issue to merit programmer control in the schedule. I.e. one would have to mark something as unaligned to support this. (Or optionally, introduce a concept akin to host alignment in the schedule for GPU to allow generating more efficient code when proper alignment is asserted. However this is fairly dubious as the required alignment is not known at compile time, only runtime.)

Option 2 is likely the way to go. It's not a one liner, but it's not that difficult to code. It's more tricky to test without a platform handy that exhibits the behavior.