halide / Halide

a language for fast, portable data-parallel computation
https://halide-lang.org
Other
5.88k stars 1.07k forks source link

Halide Auto-Scheduler Produces same Schedules with Different Parallelism Settings #8356

Open LiQi19970428 opened 2 months ago

LiQi19970428 commented 2 months ago

I encountered an issue while using Halide to accelerate matrix multiplication. I am using template

include "Halide.h"

using namespace Halide;

class MatrixMultiplyGenerator : public Halide::Generator { public: Input<Buffer> input{"input", 2}; Input<Buffer> input1{"input1", 2}; Output<Buffer> output{"output", 2};

void generate() {
    Var i("i"), i_0("i_0");
    RDom i_1(0, 512);

    Func multiply("multiply");
    Expr zero = Expr(0.0);
    multiply(i, i_0) = zero;
    multiply(i, i_0) += input(i, i_1) * input1(i_1, i_0);

    output(i, i_0) = multiply(i, i_0);
}
void schedule() {
    if (using_autoscheduler()) {
        input.set_estimates({{0, 512}, {0, 512}});
        input1.set_estimates({{0, 512}, {0, 512}});
        output.set_estimates({{0, 512}, {0, 512}});
    }
}

}; HALIDE_REGISTER_GENERATOR(MatrixMultiplyGenerator, mat_mul_gen)

and Halide's automatic scheduling tools. During compilation, I tried different configurations, setting autoscheduler.parallelism=8 and autoscheduler.parallelism=2 to control the number of CPU cores used and I am using the Adams2019 model. However, I found that no matter the setting, the generated schedule files are identical, and the execution times are the same. Why is this happening? Is there a way to resolve this issue? and the Schedules are

// for target=x86-64-linux-avx-avx2-avx512-avx512_cannonlake-avx512_skylake-f16c-fma-sse41 // NOLINT // with autoscheduler_params=autoscheduler=Adams2019 autoscheduler.parallelism=8

include "Halide.h"

inline void apply_schedule_matrix_multiply_true( ::Halide::Pipeline pipeline, ::Halide::Target target ) { using ::Halide::Func; using ::Halide::MemoryType; using ::Halide::RVar; using ::Halide::TailStrategy; using ::Halide::Var; Func output = pipeline.get_func(3); Func multiply = pipeline.get_func(2); Var i(output.get_schedule().dims()[0].var); Var i_0(output.get_schedule().dims()[1].var); Var i_0i("i_0i"); Var ii("ii"); Var iii("iii"); RVar r8_x(multiply.update(0).get_schedule().dims()[0].var); output .split(i_0, i_0, i_0i, 32, TailStrategy::ShiftInwards) .split(i, i, ii, 64, TailStrategy::ShiftInwards) .split(ii, ii, iii, 16, TailStrategy::ShiftInwards) .vectorize(iii) .compute_root() .reorder({iii, ii, i_0i, i, i_0}) .parallel(i_0); multiply.update(0) .split(i, i, ii, 16, TailStrategy::GuardWithIf) .vectorize(ii) .reorder({ii, r8_x, i, i_0}); multiply .store_in(MemoryType::Stack) .split(i, i, ii, 16, TailStrategy::RoundUp) .vectorize(ii) .compute_at(output, i) .reorder({ii, i, i_0});

}

and

// for target=x86-64-linux-avx-avx2-avx512-avx512_cannonlake-avx512_skylake-f16c-fma-sse41 // NOLINT // with autoscheduler_params=autoscheduler=Adams2019 autoscheduler.parallelism=2

include "Halide.h"

inline void apply_schedule_matrix_multiply_true( ::Halide::Pipeline pipeline, ::Halide::Target target ) { using ::Halide::Func; using ::Halide::MemoryType; using ::Halide::RVar; using ::Halide::TailStrategy; using ::Halide::Var; Func output = pipeline.get_func(3); Func multiply = pipeline.get_func(2); Var i(output.get_schedule().dims()[0].var); Var i_0(output.get_schedule().dims()[1].var); Var i_0i("i_0i"); Var ii("ii"); Var iii("iii"); RVar r8_x(multiply.update(0).get_schedule().dims()[0].var); output .split(i_0, i_0, i_0i, 32, TailStrategy::ShiftInwards) .split(i, i, ii, 64, TailStrategy::ShiftInwards) .split(ii, ii, iii, 16, TailStrategy::ShiftInwards) .vectorize(iii) .compute_root() .reorder({iii, ii, i_0i, i, i_0}) .parallel(i_0); multiply.update(0) .split(i, i, ii, 16, TailStrategy::GuardWithIf) .vectorize(ii) .reorder({ii, r8_x, i, i_0}); multiply .store_in(MemoryType::Stack) .split(i, i, ii, 16, TailStrategy::RoundUp) .vectorize(ii) .compute_at(output, i) .reorder({ii, i, i_0}); }

abadams commented 2 months ago

The Adams2019 schedule just tries to ensure there's enough parallelism available in the schedule to satisfy the parallelism requested. So it's normal that the schedule might not vary. In this case it has decided on a schedule that works for 2 or 8 cores. If you set it to something huge it might start parallelizing more loops.

LiQi19970428 commented 2 months ago

The Adams2019 schedule just tries to ensure there's enough parallelism available in the schedule to satisfy the parallelism requested. So it's normal that the schedule might not vary. In this case it has decided on a schedule that works for 2 or 8 cores. If you set it to something huge it might start parallelizing more loops.

Thank you for your answer. I have a few more questions I'd like to ask. Does parallelism refer to the number of CPU cores used? So when parallelism is 2, does it actually use 4 threads? Is there a direct relationship between parallelism and the number of threads?

abadams commented 2 months ago

The generated schedule will always attempt to use all cores/threads available on the system. If the parallelism parameter was set too low, there might not be enough work available to do that (but there also might be). Parallelism is the minimum number of threads to try to use, so it's threads, not cores.