microsoft / OpenCLOn12

The OpenCL-on-D3D12 mapping layer
MIT License
104 stars 13 forks source link

doesn't expand localsize in clEnqueueNDRangeKernel() #50

Closed min-lee closed 7 months ago

min-lee commented 9 months ago

I found that this code in clEnqueueNDRangeKernel() doesn't work. This code is meant to expand size of localsize, but it fails because the condition DispatchDimensions[i] > D3D12_CS_DISPATCH_MAX_THREAD_GROUPS_PER_DIMENSION is almost always false.

        // Try to expand local size to avoid having to loop Dispatches
        while (DispatchDimensions[i] > D3D12_CS_DISPATCH_MAX_THREAD_GROUPS_PER_DIMENSION)
        {
            auto OldLocalSize = LocalSizes[i];
            LocalSizes[i] *= 2;
            if ((uint64_t)LocalSizes[0] * (uint64_t)LocalSizes[1] * (uint64_t)LocalSizes[2] > D3D12_CS_THREAD_GROUP_MAX_THREADS_PER_GROUP ||
                LocalSizes[i] > MaxDims[i] ||
                global_work_size[i] % LocalSizes[i] != 0)
            {
                LocalSizes[i] = OldLocalSize;
                break;
            }
            DispatchDimensions[i] /= 2;
        }
jenatali commented 8 months ago

The logic as written is correct, if conservative. While the number of thread-groups being launched is too many to fit in a single Dispatch call, we increase the size of the thread group. Otherwise, if it can fit in a single Dispatch with the base thread group sizes, we just use those thread group sizes. The base sizes could probably be made larger though...

jenatali commented 8 months ago

Reopening. This issue doesn't describe the problem well, but there is a problem: for a work group with prime / odd sizes, the local size will be shrunk down to 1 in that dimension. We should either pick a better size that still evenly divides the global size, or else we should pick a reasonable size for performance and kick the rest of the threads to a separate dispatch with a smaller thread group size.