halide / Halide

a language for fast, portable data-parallel computation
https://halide-lang.org
Other
5.83k stars 1.07k forks source link

Runs forever on Hexagon Simulator if using .parallel() #4692

Open andrewjong opened 4 years ago

andrewjong commented 4 years ago

I am testing the provided Hexagon Halide demos to compare performance of int vs float execution time, with .vectorize() enabled and disabled in the schedule respectively.

I am testing with the conv3x3a32 example. The schedule first tiles. In the schedule, I found if I replace .vectorize(xi) with .parallel(xi), the benchmark runs forever and does not stop. Or if I try to fuse xo and yo into a fused variable, then .parallel(fused) like from the Halide tutorials, it does not stop either. It was still running after several hours, so I conclude it must be stuck.

Is it expected behavior that computation gets stuck if .parallel() is used?

Programming Hexagon through the C++ SDK still allows float operations outside of HVX. Using floats through Hexagon without HVX computes just fine and finishes within a minute. While this is slower than it would be with HVX, it certainly does not take hours. Is there a way to match this performance on floats through Halide? The input is of reasonable size at 1920x1080.

andrewjong commented 4 years ago

Nevermind, I found out parallelism on the simulator was broken 2.5 years ago. #2108. Seems like it's still broken :(

steven-johnson commented 4 years ago

We should update the README to warn people about this.

dsharletg commented 4 years ago

I don't think .parallel should hang on the simulator. #2108 is about the fact that we don't actually simulate parallelism, but the program should still be functionally correct and run (just without any speedup from parallelism).

I think this is a new issue.

dpalermo commented 4 years ago

Answers to some of the other questions in the original post:

You might try running your test with a much smaller input (e.g. 128x128 or 128x16) just to see if it is truly hanging, or just taking a really long time (due to the parallelized inner loop).

andrewjong commented 4 years ago

Thanks Dan, I will try that and report back.

Update 4/13/2020: I tried it and it was just because the parallelized inner loop was taking a long time. I parallelized the outer loop instead (see comment below) and it finished running. However, I'm confused why it paralleization didn't make use of Hexagon's 4 hardware threads.

andrewjong commented 4 years ago

@dsharletg @pranavb-ca @dpalermo

I went and learned more about scheduling in Halide and more about Hexagon. This following schedule is for the conv3x3a16 Halide example in the Hexagon SDK 3.5.1.

I want to use all 4 available hardware threads on the Hexagon DSP.

So I tried this

Func(output)
  .tile(x, y, xi, yi, vector_size, 4, TailStrategy::RoundUp)
  .vectorize(xi)
  .parallel(yi);

where vector_size is set to 128 for HVX 128.

My understanding is: I'm making 128 x 4 sized tiles. I'm vectorizing xi to take advantage of HVX. Then I parallel yi to take advantage of all 4 hardware threads on Hexagon.

However, the simulator is reporting that only one hardware thread is used, not 4, even though I used .parallel(yi).

T0: Insns=18185954 Packets=9686990
T1: Insns=0 Packets=0
T2: Insns=0 Packets=0
T3: Insns=0 Packets=0
Total: Insns=18185954 Pcycles=19838300

In addition, the parallel strategy is much slower than, this came out to run much slower than the original unroll strategy in the example:

Func(output)
  .tile(x, y, xi, yi, vector_size, 4, TailStrategy::RoundUp)
  .vectorize(xi)
  .unroll(yi);

The unroll strategy came to 0.1579 cycles/pixel, whereas the parallel strategy came to 2.8886 cycles/pixel.

Can someone please help me understand why 1) .parallel() did not use all 4 hardware threads 2) why .parallel() is slower than .unroll()

Thanks!

vksnk commented 4 years ago

I think it might be better to parallelize over y and not yi, something like this:

Func(output)
  .tile(x, y, xi, yi, vector_size, 4, TailStrategy::RoundUp)
  .vectorize(xi)
  .parallel(y);

In your schedule loops are arranged like this:

for y:
  for x
    for yi <- parallelized
       for xi: <-- vectorized

which is probably too fine-grained to parallelize efficiently, because of the overheads and so.

I think you can get even better schedule by doing both parallel() and unroll(), like this:

Func(output)
  .tile(x, y, xi, yi, vector_size, 4, TailStrategy::RoundUp)
  .vectorize(xi)
  .unroll(yi)
  .parallel(y);
andrewjong commented 4 years ago

@vksnk thanks for your comment! Per your advice, I tried your suggested schedule, using unroll(yi) and .parallel(y). The hexagon simulator reports 0.2091 cycles/pixel, which is slower than the 0.1579 cycles/pixel of unroll(yi) alone.

Once again, the simulator reports that only 1 hardware thread is being used.

T0: Insns=3233950 Packets=1391755
T1: Insns=0 Packets=0
T2: Insns=0 Packets=0
T3: Insns=0 Packets=0
Total: Insns=3233950 Pcycles=3073589

This is my main confusion. Why is .parallel() not activating all 4 hardware threads on the Hexagon?

dsharletg commented 4 years ago

The Hexagon simulator uses a "fake" thread pool that doesn't actually use threads due to undiagnosed issues (#2108).

However, when running on a real Hexagon device, parallel uses a real thread pool and you should see increased performance.