Reduction Operation in Halide-HLS

amisal88 commented 7 years ago

I am trying to implement simple algorithm in Halide-HLS which requires reduction sum on the whole image to compute the average. The "pipeline.cpp" file is attached below.

Running "make pipeline_hls.cpp" results in following error:

Internal error at /Halide-HLS directory path/src/ExtractHWKernelDAG.cpp:307 triggered by user code at ./pipeline.cpp:69:
Condition failed: consumer_stencils.size() > 0
Aborted (core dumped)

changing the definition of pipeline from:

Input(x,y) = Input_Image(x, y);

image_mean() = Halide::cast<uint32_t>(0);
image_mean() += Halide::cast<uint32_t>(Input(win.x, win.y));

image_mean() = image_mean() >> (W_2_Power + H_2_Power);

hw_output(x,y) = Input(x,y) - (Halide::cast<uint8_t>(image_mean())); 
output(x,y) = hw_output(x,y);

to

Input(x,y) = Input_Image(x, y);

image_mean(x,y) = Halide::cast<uint32_t>(0);
image_mean(x,y) += Halide::cast<uint32_t>(Input(win.x, win.y));

image_mean(x,y) = image_mean(x,y) >> (W_2_Power + H_2_Power);

hw_output(x,y) = Input(x,y) - (Halide::cast<uint8_t>(image_mean(x,y))); 
output(x,y) = hw_output(x,y);

results in following error:

Internal error at  /Halide-HLS directory path/src/ExtractHWKernelDAG.cpp:275 triggered by user code at ./pipeline.cpp:69:
Condition failed: extent_int
stencil window extent (((max((hw_output$1.s0.x.xi.base + hw_output$1.s0.x.xi), 255) - min((hw_output$1.s0.x.xi.base + hw_output$1.s0.x.xi), 0)) + 1)) is not a const.
Aborted (core dumped)

But changing this line

image_mean(x,y) += Halide::cast<uint32_t>(Input(win.x, win.y));

to

image_mean(x,y) += Halide::cast<uint32_t>(Input(x+win.x, y+win.y));

works, but the extracted "hls_target" function is not efficient, since it computes the average again for each pixel.

Any idea to compute and use image average efficiently?

Thanks!

Attachment: Content of "pipeline.cpp" file:

#include "Halide.h"
#include <stdio.h>

#define Image_Width 256
#define W_2_Power 8
#define Image_Height 256
#define H_2_Power 8

using namespace Halide;

Var x("x"), y("y"), z("z"), c("c");
Var xo("xo"), yo("yo"), xi("xi"), yi("yi");

class MyPipeline {
public:
    ImageParam Input_Image;
    Func output;
    Func hw_output;
    std::vector<Argument> args;
    Func Input;
    Func image_mean;
    RDom win;

    MyPipeline()
        : Input_Image(UInt(8), 2),
          hw_output("hw_output"),
          output("output"),
          win(0, Image_Width, 0, Image_Height)
    {

    Input(x,y) = Input_Image(x, y);

    image_mean() = Halide::cast<uint32_t>(0);
    image_mean() += Halide::cast<uint32_t>(Input(win.x, win.y));

    image_mean() = image_mean() >> (W_2_Power + H_2_Power);

    hw_output(x,y) = Input(x,y) - (Halide::cast<uint8_t>(image_mean())); 
    output(x,y) = hw_output(x,y);

    // Arguments
    args = {Input_Image};
    }

    void compile_cpu() {
        std::cout << "\ncompiling cpu code..." << std::endl;

        output.tile(x, y, xo, yo, xi, yi, Image_Width, Image_Height);
        output.compile_to_header("pipeline_native.h", args, "pipeline_native");
        output.compile_to_object("pipeline_native.o", args, "pipeline_native");
    }

    void compile_hls() {
        std::cout << "\ncompiling HLS code..." << std::endl;

        output.tile(x, y, xo, yo, xi, yi, Image_Width, Image_Height);
        hw_output.compute_at(output, xo);
        Input.compute_at(output, xo);
        hw_output.tile(x, y, xo, yo, xi, yi, Image_Width, Image_Height).accelerate({Input}, xi, xo);
        Input.fifo_depth(hw_output, Image_Width * Image_Height);

        output.print_loop_nest();
        // Create the target for HLS simulation
        Target hls_target = get_target_from_environment();
        hls_target.set_feature(Target::CPlusPlusMangling);
        output.compile_to_lowered_stmt("pipeline_hls.ir.html", args, HTML, hls_target);
        output.compile_to_hls("pipeline_hls.cpp", args, "pipeline_hls", hls_target);
        output.compile_to_header("pipeline_hls.h", args, "pipeline_hls", hls_target);

        std::vector<Target::Feature> features({Target::Zynq});
        Target target(Target::Linux, Target::ARM, 32, features);
        output.compile_to_zynq_c("pipeline_zynq.c", args, "pipeline_zynq", target);
        output.compile_to_header("pipeline_zynq.h", args, "pipeline_zynq", target);

        output.compile_to_object("pipeline_zynq.o", args, "pipeline_zynq", target);
        output.compile_to_lowered_stmt("pipeline_zynq.ir.html", args, HTML, target);
        output.compile_to_assembly("pipeline_zynq.s", args, "pipeline_zynq", target);
    }
};

int main(int argc, char **argv) {
    MyPipeline p1;
    p1.compile_cpu();

    MyPipeline p2;
    p2.compile_hls();

    return 0;
}

stevenbell commented 7 years ago

Others can give a more authoritative response, but the short answer is that you can't do this with the current paradigm. The Halide-HLS compiler assumes that the image can be streamed through the processing pipeline, which means that only a few lines of the image are stored in the hardware at once. To subtract the mean from the whole image, you need to have read in all the pixels before processing the first output pixel, which would require that entire image be stored on-chip.

If you want to do this, you'll need to do it in two passes, one which computes the mean, and a second which subtracts that from each pixel. In practice, both of these operations are going to be memory-bound (rather than compute-bound), so it's very unlikely that you'll see any speedup versus running on CPU.

amisal88 commented 7 years ago

@stevenbell Thank you very much for your response.

jingpu / Halide-HLS

Reduction Operation in Halide-HLS #15