jansel commented 4 months ago

repro.py:

import halide as hl

@hl.generator(name="kernel")
class Kernel:
    in_ptr0 = hl.InputBuffer(hl.Float(32), 1)
    in_ptr1 = hl.InputBuffer(hl.Float(32), 1)
    out_ptr0 = hl.OutputBuffer(hl.Float(32), 1)

    def generate(g):
        in_ptr0 = g.in_ptr0
        in_ptr1 = g.in_ptr1
        out_ptr0 = g.out_ptr0
        tmp0 = in_ptr0[0]
        tmp1 = in_ptr1[0]
        tmp2 = tmp0 + tmp1
        out_ptr0[hl.Var()] = tmp2

        assert g.using_autoscheduler()
        in_ptr0.set_estimates([hl.Range(0, 1)])
        in_ptr1.set_estimates([hl.Range(0, 1)])
        out_ptr0.set_estimates([hl.Range(0, 1)])

if __name__ == "__main__":
    import sys, tempfile

    with tempfile.TemporaryDirectory() as out:
        sys.argv = ['repro.py', '-g', 'kernel', '-o', out, '-f', 'halide_kernel', '-e', 'static_library,h,schedule',
                    '-p', '/home/jansel/conda/envs/pytorch/lib/libautoschedule_anderson2021.so',
                    'target=host-cuda-cuda_capability_86-strict_float-no_asserts', 'autoscheduler=Anderson2021']
        hl.main()

Note: you will need to update the path to libautoschedule_anderson2021.so for your system.

Output:

Unhandled exception: Internal Error at /home/jansel/Halide/src/autoschedulers/anderson2021/SearchSpace.cpp:486 triggered by user code at : Condition failed: !parallel_tilings.empty():  zero parallel tilings

Traceback (most recent call last):
  File "/home/jansel/pytorch/repro.py", line 32, in <module>
    hl.main()
RuntimeError: Generator failed: -1

This example is just adding two 1-element tensors.

Possible workarounds:

Switch to Li2018 autoscheduler, which seems to work on this example. Any recommendation from the Halide folks here? I don't know much about the different schedulers.
Increase out_ptr0.set_estimates from 1 to 2 (even though the real tensor is size 1). For some of the other schedulers (on CPU) I have gotten out of bounds access errors if I made the estimates larger than the actual value. Is doing this safe?

abadams commented 4 months ago

So this pipeline is a single scalar add operation?

I don't think any of us expected anyone to try to autoschedule a pipeline that does O(1) work. I think the appropriate schedule is gpu_single_thread(), but nobody taught the autoscheduler how to use that.

jansel commented 4 months ago

Yeah correct, it should be pretty trivial to schedule -- but it is a corner case the scheduler doesn't handle. This is coming from a unit test, but you occasionally have scalar operations (for example a learning rate update) in real models.

8256 has a more complicated example (a reduction to a single element) with similar errors. Reductions to a single element often happen in things like layernorm or softmax. Those are harder to schedule, since you have very little parallelism at the very end. You either need atomics, syncs, or multiple kernels.

halide / Halide

Internal Error at .../anderson2021/SearchSpace.cpp:486 ... Condition failed: !parallel_tilings.empty(): zero parallel tilings #8246