inducer / pyopencl

OpenCL integration for Python, plus shiny features
http://mathema.tician.de/software/pyopencl
Other
1.06k stars 241 forks source link

GenericScanKernel produces wrong output for segmented scan, GenericDebugScanKernel output is correct. #411

Closed sinsemellan closed 3 years ago

sinsemellan commented 3 years ago

When using a scan expression containing a constant value e.g. scan_expr="across_seg_boundary ? 10 : 5" the first value (10) gets selected after the first crossed segment boundary for all remaining entries. A kernel generated by GenericDebugScanKernel produces correct results. Tested with version 2020.3.1 and an oclgrind, intel cpu and NVIDIA device. Failing example test case, first assertion is ok (using GenericDebugScanKernel) second one fails (using GenericScanKernel):

  def test_segmented_scan_2(self):
        ctx = cl.create_some_context(interactive=False, answers=["1", "1"])
        queue = cl.CommandQueue(ctx)

        values = np.zeros(4, dtype=np.int32)
        values[0] = 5
        values[1] = 2
        values[2] = 3
        values[3] = 1

        group_heads = np.zeros(4, dtype=np.int8)
        group_heads[0] = 0
        group_heads[1] = 0
        group_heads[2] = 1
        group_heads[3] = 0

        g_values = clarray.to_device(queue, values)
        g_group_heads = clarray.to_device(queue, group_heads)
        g_results = clarray.empty_like(g_values)

        knl = GenericDebugScanKernel(
            ctx,
            np.int32,
            arguments=["__global const int *values", "__global const char *group_heads", "__global int *results"],
            input_expr="values[i]",
            is_segment_start_expr="(group_heads[i] == 1)",
            scan_expr="across_seg_boundary ? 10 : 5",
            neutral="0",
            output_statement="results[i] = item",
        )

        knl(g_values, g_group_heads, g_results, queue=queue)

        expected_result = np.empty_like(values)
        expected_result[0] = 5
        expected_result[1] = 5
        expected_result[2] = 10
        expected_result[3] = 5

        np.testing.assert_array_equal(expected_result, g_results.get())

        knl = GenericScanKernel(
            ctx,
            np.int32,
            arguments=["__global const int *values", "__global const char *group_heads", "__global int *results"],
            input_expr="values[i]",
            is_segment_start_expr="(group_heads[i] == 1)",
            scan_expr="across_seg_boundary ? 10 : 5",
            neutral="0",
            output_statement="results[i] = item",
        )

        knl(g_values, g_group_heads, g_results, queue=queue)

        np.testing.assert_array_equal(expected_result, g_results.get())
inducer commented 3 years ago

Sorry, across_seg_boundary doesn't work like that; you can't use it to set a flag at exactly the segment boundary. In an actually parallel scan (unlike the debug version), the scan expression is evaluated for array elements that are not neighbors. See this slide for an idea:

https://andreask.cs.illinois.edu/cs598apk-f18/notes.pdf#page=206

When across_seg_boundary is true, the scan expression must act as if the a input to the expression were not supplied. Yours does not meet this criterion. I'd be grateful for a doc contribution explaining this.