halide / Halide

a language for fast, portable data-parallel computation
https://halide-lang.org
Other
5.91k stars 1.07k forks source link

How to use and benchmark Halide autoscheduler? #8432

Open Rehanchy opened 1 month ago

Rehanchy commented 1 month ago

Hi Halide developers,

I was trying to use Halide autoscheduler to generate scheduler for matmul by following the old tutorials. (btw it's really old, it seems argument last_level_cache_size and balance are no longer in use nowadays)

I found that the schedule produced by autoscheduler is not having great performance, so I wish to see if you guys can help to check whether I'm using the autoscheduler correctly.

My generator (matmul_generator.cpp) looks like this:

`class MatMulGenerator : public Halide::Generator { public: Input<Buffer> A{"A", 2}; // Input matrix A (m x l) Input<Buffer> B{"B", 2}; // Input matrix B (l x n) Output<Buffer> C{"C", 2}; // Output matrix C (m x n)

void generate() {
    Var x("x"), y("y"), k("k");
    Func result("result");
    RDom r(0, A.dim(1).extent());

    result(x, y) = Halide::Expr(0.0);
    result(x, y) += A(x, r.x) * B(r.x, y);
    C(x, y) = result(x, y);
}
void schedule() {
    if (using_autoscheduler()) {
        A.set_estimates({{0, 4096}, {0, 4096}});
        B.set_estimates({{0, 4096}, {0, 4096}});
        C.set_estimates({{0, 4096}, {0, 4096}});
    } else {
        C.compute_root();
    }
}

};

HALIDE_REGISTER_GENERATOR(MatMulGenerator, matmul_generator) `

Then I'm using these commands to generate the schedule, following the tutorial. g++ matmul_generator.cpp /path/to/GenGen.cpp -g -std=c++17 -fno-rtti -I/path/to/halide/include -L/path/to/halide/lib -lHalide -lpthread -ldl -o matmul_generator

./matmul_generator -o . -g matmul_generator -f matmul_autoschedule_true -e static_library,h,schedule -p /path/to/halide/lib/libautoschedule_adams2019.so target=host autoscheduler=Adams2019 autoscheduler.parallelism=8

In another cpp file, I will use this line of code to call the scheduled matrix multiplication. matmul_autoschedule_true(A.raw_buffer(), B.raw_buffer(), C.raw_buffer());

I also have questions about how to benchmark halide autoscheduler's performance on a given kernel, I know that in test/performance/matrix_multiplication.cpp, out.realize(output); is called twice, because there will be code generation phase overhead in the first call, and we need to measure halide's performance with the second call.

To summarize, my questions are

  1. Is my way of using autoscheduler correct?
  2. I have a minor concern that when benchmarking halide using the second realize call, the cache is not cold, which may lead to performance overestimation.
  3. When using autoscheduler, and call the kernel like this matmul_autoschedule_true(A.raw_buffer(), B.raw_buffer(), C.raw_buffer());, does this function contain the code generation phase that could lead to performance underestimation?

Thanks a lot!