Performance degrading with parallel, unroll etc scheduling methods

jsksra1 commented 4 years ago

Hi,

I am working on Windows also my intel processor supports AVX2 intrinsics. I tried the scheduling tutorial example (https://halide-lang.org/tutorials/tutorial_lesson_05_scheduling_1.html ) and observed that performance degrades when I try to introduce "parallel".

So basically I am using AOT method and code:

ImageParam im1(type_of<FLOATING_PT>(),2);
ImageParam im2(type_of<FLOATING_PT>(), 2);
Func addition;
addition(x, y) = im1(x, y) + im2(x,y);
addition.compile_to_static_library("addition", { im1, im2 }, "addition");

Cycles consumed for an iteration of 10000 for a 600x600 image, with different optimization options is as follows:

                No options : 1.177 sec
With fuse,tile,parallel : 1.431 sec
vectorize  x by 8        : 1.198 sec
vectorize  x by 16      : 1.228 sec

So my main doubt is why Parallel is not helping in improving the performance ?? Do we need to add any threading library explicitly to kick-in multi-threading/parallel processing?

    Var x_i, x_o, y_o, y_i, tile;
    addition.tile(x,y, x_o, y_o, x_i, y_i,100,100)
        .fuse(x_o,y_o, tile)
        .parallel(tile);
    Var x_i_o, y_i_o, x_v, y_p;
    addition.tile(x_i, y_i, x_i_o, y_i_o, x_v, y_p, 4,2)
            .vectorize(x_v)
            .unroll(y_p);

abadams commented 4 years ago

Why parallel isn't speeding things up: You're doing nearly no work (just a single addition) in your kernel, so this is a memory bandwidth test, but your image is small enough to fit in cache, so that's going to be fast too. Basically your routine is so cheap that threading overheads are costing a lot in relative terms. You can reduce some of these overheads by removing the tiling - you only want tiling if you need spatial locality in 2D. Try this schedule instead: addition.parallel(y, 8).vectorize(x, 8);

Why vectorize isn't speeding things up: LLVM has a decent autovectorizer, and it can certainly handle this case, so Halide is probably just generating the same vector code with or without vectorization. You can test this hypothesis by adding the target flag disable_llvm_loop_opt and seeing if the no-schedule case gets slower.

jsksra1 commented 4 years ago

Thanks!

Overhead of threads is more for low resolution of images(600x600) than (1920x1920) hence, the performance for larger resolution improved with parallel & vectorize, but the same is getting degraded/no effect for lower resolutions.

I am experimenting Halide for a low size resolution 1D signals (Audio/speech) usecases, as it will be very useful to have this codegen for multiple targets. But my study resulted in poor performance of Halide for a size of 200-250 samples per frame whenever I try scheduling with parallel.

Example : Let's take a convolution of an array A[1000] with kernel[20]. I always compare this with handwritten C AVX2 SIMD code. Halide performance degrades when kernel length increases from 20 to 40or 80 . And as I try to do more schedule the cycles are shooting up.

Func Filtering("Discrete FIR");
Func clamped("clamped");
//Create the reduction domain for recursive delay and multiply operations
Halide::RDom rk(0, filterLen+1);

//Declare boundary conditions and it's beyond boundary values
clamped = Halide::BoundaryConditions::constant_exterior(input, 0);

//Convolution operation with RDom's representation
Filtering(x) = Halide::sum(kernel(filterLen - rk.x) * clamped(x + rk.x - (filterLen)));

So my doubt is ,

Is this Halide code that I wrote is not optimal?
Is Halide has a limitation with lower resolution inputs?
Due to more algebraic operations (delay and sum) instead of linear operations (scaling input etc) will result in higher cycle consumption with Halide ?? Finally why all Halide tutorials/usecase focus more on Image processing?

jsksra1 commented 4 years ago

Lookslike there is issue with my above approach.

I tweaked the algorithm of filtering (Halide implementation) and now Halide generated code performs better to my SIMD code.

halide / Halide

Performance degrading with parallel, unroll etc scheduling methods #4610