halide / Halide

a language for fast, portable data-parallel computation
https://halide-lang.org
Other
5.86k stars 1.07k forks source link

Internal Error at .../anderson2021/SearchSpace.cpp:486 ... Condition failed: !parallel_tilings.empty(): zero parallel tilings #8246

Open jansel opened 4 months ago

jansel commented 4 months ago

repro.py:

import halide as hl

@hl.generator(name="kernel")
class Kernel:
    in_ptr0 = hl.InputBuffer(hl.Float(32), 1)
    in_ptr1 = hl.InputBuffer(hl.Float(32), 1)
    out_ptr0 = hl.OutputBuffer(hl.Float(32), 1)

    def generate(g):
        in_ptr0 = g.in_ptr0
        in_ptr1 = g.in_ptr1
        out_ptr0 = g.out_ptr0
        tmp0 = in_ptr0[0]
        tmp1 = in_ptr1[0]
        tmp2 = tmp0 + tmp1
        out_ptr0[hl.Var()] = tmp2

        assert g.using_autoscheduler()
        in_ptr0.set_estimates([hl.Range(0, 1)])
        in_ptr1.set_estimates([hl.Range(0, 1)])
        out_ptr0.set_estimates([hl.Range(0, 1)])

if __name__ == "__main__":
    import sys, tempfile

    with tempfile.TemporaryDirectory() as out:
        sys.argv = ['repro.py', '-g', 'kernel', '-o', out, '-f', 'halide_kernel', '-e', 'static_library,h,schedule',
                    '-p', '/home/jansel/conda/envs/pytorch/lib/libautoschedule_anderson2021.so',
                    'target=host-cuda-cuda_capability_86-strict_float-no_asserts', 'autoscheduler=Anderson2021']
        hl.main()

Note: you will need to update the path to libautoschedule_anderson2021.so for your system.

Output:

Unhandled exception: Internal Error at /home/jansel/Halide/src/autoschedulers/anderson2021/SearchSpace.cpp:486 triggered by user code at : Condition failed: !parallel_tilings.empty():  zero parallel tilings

Traceback (most recent call last):
  File "/home/jansel/pytorch/repro.py", line 32, in <module>
    hl.main()
RuntimeError: Generator failed: -1

This example is just adding two 1-element tensors.

Possible workarounds:

abadams commented 4 months ago

So this pipeline is a single scalar add operation?

I don't think any of us expected anyone to try to autoschedule a pipeline that does O(1) work. I think the appropriate schedule is gpu_single_thread(), but nobody taught the autoscheduler how to use that.

jansel commented 4 months ago

Yeah correct, it should be pretty trivial to schedule -- but it is a corner case the scheduler doesn't handle. This is coming from a unit test, but you occasionally have scalar operations (for example a learning rate update) in real models.

8256 has a more complicated example (a reduction to a single element) with similar errors. Reductions to a single element often happen in things like layernorm or softmax. Those are harder to schedule, since you have very little parallelism at the very end. You either need atomics, syncs, or multiple kernels.