halide / Halide

a language for fast, portable data-parallel computation
https://halide-lang.org
Other
5.9k stars 1.07k forks source link

Pathologically slow Hexagon codegen for certain schedules #4735

Open steven-johnson opened 4 years ago

steven-johnson commented 4 years ago

Steps to repeat. We are going to build and run one of the Generators in apps/nn_ops:

$ cd apps/nn_ops
$ make bin/host/DepthwiseConvolution.generator
$ time bin/host/DepthwiseConvolution.generator -g DepthwiseConvolution -o /tmp -e object -f DepthwiseConvolution_1 target=x86-64-linux-hvx_128-no_runtime depth_multiplier=1

At current master branch, on my stupid fast Linux box, I get:

real  0m9.602s
user  0m9.528s
sys 0m0.078s

To see the pathological schedule, go to apps/nn_ops/DepthwiseConvoution_generator.cpp and change the lines

        output_.split(y, y, yi, y_split_factor).parallel(y);
        output_.vectorize(depth, vector_size_u8, TailStrategy::RoundUp);

to

        Var di("di"), xi("xi");
        output_.split(y, y, yi, y_split_factor, TailStrategy::GuardWithIf)
            .tile(depth, x, di, xi, vector_size_u8, 2, TailStrategy::GuardWithIf)
            .vectorize(di)
            .unroll(xi)
            .parallel(y);

Now, re-run the steps above (still on master). I get

real  45m51.237s
user  45m37.659s
sys 0m13.206s
abadams commented 4 years ago

As usual, nearly all the time is in llvm. So we must be generating pathological .ll

So I'm not sure how to fix this without just not using this schedule - it scalarizes like crazy. I guess I could try to see how to avoid scalarization. It seems to be in the tail case though.

steven-johnson commented 4 years ago

Let me look at the revision history and see if that tells me anything.

steven-johnson commented 4 years ago

Adding @vksnk as he was involved in both the google-specific code and the version going into apps/nn_ops -- maybe he can shed light on whether the simpler version in apps/nn_ops came first (and the more complex version in google was a later rev), or vice versa, and why there is a divergence.

vksnk commented 4 years ago

Yeah, this generator was always the slowest one to compile. I think the one in google is a more recent and, if I remember correctly, has a slightly better schedule for some of the corner cases. That being said, I don't think this particular implementation is actively used right now, so probably should be okay to comment part of the schedule out to unblock the change. I can take care of it.

Although, we still probably want to report this to QC, so they can take a look (especially that there is a public repro case).

dsharletg commented 4 years ago

Was this fixed by #4744?