halide / Halide

a language for fast, portable data-parallel computation
https://halide-lang.org
Other
5.91k stars 1.07k forks source link

auto schedule for GPU #2852

Closed vmiheer closed 4 years ago

vmiheer commented 6 years ago

I was wondering can I do auto scheduling for GPUs? I have simple halide code:

  for (int y = 0; y < size; y++) {
    for (int x = 0; x < size; x++) {
      A(x, y) = rand() & 0xfff;
    }
  }
  Var j("j"), i("i"), k("k");
  Func out4("out4");
  out4(i, j) = A(i, j) * 2;

  Target target = get_target_from_environment();
  target.set_feature(Halide::Target::CUDACapability35);
  target.set_feature(Halide::Target::CUDA);

  Pipeline p(out4);
  out4.estimate(i, 0, size).estimate(j, 0, size);
  cout << p.auto_schedule(target);

I get schedule:

// Target: x86-64-linux-avx-avx2-cuda-cuda_capability_35-f16c-fma-sse41
// MachineParams: 16,16777216,40
Var i_vi("i_vi");
Var i_vo("i_vo");
Func out4 = pipeline.get_func(0);
{
    Var i = out4.args()[0];
    Var j = out4.args()[1];
    out4
        .compute_root()
        .split(i, i_vo, i_vi, 8)
        .vectorize(i_vi)
        .parallel(j);
}

Now even though the Target has feature: cuda, the code is not running on GPU. Am I missing something?

abadams commented 6 years ago

You're not missing anything - the current autoscheduler does not generate GPU schedules. We're working on it, but it won't be soon.

On Mon, Apr 2, 2018 at 7:50 PM, miheer vaidya notifications@github.com wrote:

I was wondering can I do auto scheduling for GPUs? I have simple halide code:

for (int y = 0; y < size; y++) { for (int x = 0; x < size; x++) { A(x, y) = rand() & 0xfff; } } Var j("j"), i("i"), k("k"); Func out4("out4"); out4(i, j) = A(i, j) * 2;

Target target = get_target_from_environment(); target.set_feature(Halide::Target::CUDACapability35); target.set_feature(Halide::Target::CUDA);

Pipeline p(out4); out4.estimate(i, 0, size).estimate(j, 0, size); cout << p.auto_schedule(target);

I get schedule:

// Target: x86-64-linux-avx-avx2-cuda-cuda_capability_35-f16c-fma-sse41// MachineParams: 16,16777216,40 Var i_vi("i_vi"); Var i_vo("i_vo"); Func out4 = pipeline.get_func(0); { Var i = out4.args()[0]; Var j = out4.args()[1]; out4 .compute_root() .split(i, i_vo, i_vi, 8) .vectorize(i_vi) .parallel(j); }

Now even though the Target has feature: cuda, the code is not running on GPU. Am I missing something?

— You are receiving this because you are subscribed to this thread. Reply to this email directly, view it on GitHub https://github.com/halide/Halide/issues/2852, or mute the thread https://github.com/notifications/unsubscribe-auth/AAfdRuwguedide5AE7YjuDPg9Y7p7jKdks5tkoD_gaJpZM4TD_fI .

marsupialtail commented 6 years ago

Is this now supported given the August ACM paper?

jrk commented 5 years ago

Apologies for the delinquent replies. The GPU autoscheduler from the differentiable Halide paper [Li et al. 2018] is a much simpler heuristic. It works well enough to be useful for the more constrained space you get from programs generated by reverse-mode automatic differentiation (which was its target); it's not powerful enough to be more than a simple baseline for a wider class of programs than those. It was also not production-level code at the time of publication.

Some good news since, though:

  1. Our to-appear SIGGRAPH paper, on an all new, much more powerful learned autoscheduler [Adams et al. 2019], includes preliminary GPU support. That code is now in master here: https://github.com/halide/Halide/tree/master/apps/autoscheduler, and GPU improvements will be landing there hopefully rapidly over the coming months.

  2. Using the plug-in autoscheduler interface developed for that, @BachiLi has ported and improved his simple GPU autoscheduler (the first one mentioned above). That should be landing in apps/ soon, as well. Again, it won't give state-of-the-art performance for more complex cases, but it should give decent first-order results very easily (and be quite useful for automatically differentiated programs, which are now also supported in master).

vmiheer commented 5 years ago

Seems like autoscheduler app is not built using cmake only makefile can build it.

steven-johnson commented 5 years ago

CMake support is spotty at best for various apps. It's a longstanding issue; most core developers use only Make, and as a result CMake support gets overlooked. IMHO we should really prioritize properly supporting all our build systems; though I personally have strong reservations about CMake in general, I do suspect we'd be better off in the long run by biting the bullet and standardizing on it as our only build system (as LLVM did a few years ago).

vmiheer commented 5 years ago

@jrk, about

Our to-appear SIGGRAPH paper, on an all new, much more powerful learned autoscheduler [Adams et al. 2019], includes preliminary GPU support. That code is now in master here: https://github.com/halide/Halide/tree/master/apps/autoscheduler

  1. Is there arxiv version of paper available (or preprint available somewhere)?
  2. I am building the app as make OPTIMIZE='-O0 -g' HL_TARGET='host-cuda' test and some changes in makefile/test.cpp to set target as host-cuda. With those changes I see target variable to generate_schedule passed in as "x86-64-linux-avx-cuda-cuda_capability_35-sse41". But I don't think it is generating GPU schedule.
    Func h = get_pipeline().get_func(1);
    Func f = get_pipeline().get_func(0);
    Var x(h.get_schedule().dims()[0].var);
    Var xi("xi");
    Var y(h.get_schedule().dims()[1].var);
    Var yi("yi");
    h
    .split(y, y, yi, 64, TailStrategy::ShiftInwards)
    .split(x, x, xi, 4, TailStrategy::ShiftInwards)
    .vectorize(xi)
    .compute_root()
    .reorder(xi, x, yi, y)
    .parallel(y);
    f
    .store_in(MemoryType::Stack)
    .split(x, x, xi, 4, TailStrategy::RoundUp)
    .unroll(x)
    .unroll(y)
    .vectorize(xi)
    .compute_at(h, x)
    .reorder(xi, x, y);
  3. Are the steps I am trying okay to use the primitive gpu scheduler?
  4. Maybe I need to send correct MachineParams?
abadams commented 5 years ago

The gpu support hasn't landed in master yet. Development is happening in standalone_autoscheduler_gpu, but I wouldn't try to use it yet (e.g. there are no good network weights right now).

jrk commented 5 years ago

Also @vmiheer the paper is here: https://halide-lang.org/papers/autoscheduler2019.html

woodknight commented 5 years ago

@steven-johnson

CMake support is spotty at best for various apps. It's a longstanding issue; most core developers use only Make, and as a result CMake support gets overlooked. IMHO we should really prioritize properly supporting all our build systems; though I personally have strong reservations about CMake in general, I do suspect we'd be better off in the long run by biting the bullet and standardizing on it as our only build system (as LLVM did a few years ago).

In my humble opinion, CMake is way much better and convenient than Make.

abadams commented 5 years ago

Fighting words! Steven and I have had conversations in person about how large a nail would you have to drive through your hand in order for it to be as painful as using cmake. More seriously, it totally depends on what you're doing. Standard C++ binaries or libraries are cleaner in cmake than make, but once you start doing unusual things (multi-phase compilation with generated intermediates, weird linker invocations, etc), make presents fewer barriers to getting work done.

steven-johnson commented 5 years ago

And here is where Andrew and I disagree: while we both dislike* CMake, I suspect at this point that we'd be better off settling on one build system for everyone to use, even if that means holding our nose and dealing with CMake's eccentricities.

woodknight commented 5 years ago

@abadams I see what you mean. May I paraphrase what you say as: CMake is easier for a superficial user of a library and Make is easier for a hardcore developer of a library.

alexreinking commented 4 years ago

We need to refactor the autoschedulers into separate modules and patch the CMake build to distribute them. This is a TODO in #4644

alexreinking commented 4 years ago

The CMake build now distributes both autoschedulers. Tests forthcoming...

steven-johnson commented 4 years ago

both autoschedulers

This is great news, but technically, there are three autoschedulers. (Presumably we'll look into splitting the 'built-in' into a separate package -- like the others -- once everything else lands.)

alsrgv commented 4 years ago

Any docs/examples about using them?

alexreinking commented 4 years ago

@alsrgv - when #4644 lands, the documentation will be in README_cmake.md

vakokako commented 3 years ago

why was this issue closed, was the auto scheduling for gpu implemented?

alexreinking commented 3 years ago

why was this issue closed, was the auto scheduling for gpu implemented?

Li2018 can produce GPU schedules. Improving those schedules is a different issue.

vakokako commented 3 years ago

great, thanks for the info!