jingpu / Halide-HLS

HLS branch of Halide
Other
77 stars 22 forks source link

FFT in Halide-HLS #13

Closed amisal88 closed 7 years ago

amisal88 commented 7 years ago

Hi @jingpu

I am trying to implement 2D FFT in Halide-HLS.

My first approach is using the Halide FFT (provided in apps/fft) even though it is in floating point format. At first I am going to use 1D FFT, so I write a function named as "My_fft2d_r2c" which is basically the first parts of "fft2d_r2c" function of halide fft in "fft.cpp" file. The "My_fft2d_r2c" function is attached below.

I changed all expressions of "target.natural_vector_size()" to 1. Also the following scheduling is removed from the last lines of "fft_dim1" function in "fft.cpp" file.

    for (size_t i = 0; i + 1 < stages.size(); i++) {
        Func stage = stages[i].first;
        stage.compute_at(x, group).update().vectorize(n0);
    }

Also my ''pipeline.cpp'' is attached below. Since ".accelerate" method acts on funcs only (NOT complex funcs) "re()" operator is used for "hw_output()" (because of an error complaining about Stream size).

The PROBLEM is: When I run "make pipeline_hls.cpp", Halide-HLS reports an error like this:

Internal error at /<Halide-HLS-directory path>/src/StreamOpt.cpp:408
Condition failed: produce && consume && produce->name == consume->name
Aborted (core dumped)

Which I believe causes from this line of code in "fft_dim1" function:

exchange(A({n0, n1}, args)) = undef_z(V.output_types()[0]);

So if I try to bypass the error somehow, using following change to that line:

exchange(A({n0, n1}, args)) = x(A({n0, n1}, args));

it reports another error like this

Internal error at /<Halide-HLS-directory path>/src/ExtractHWKernelDAG.cpp:275 triggered by user code at ./pipeline.cpp:89:
Condition failed: extent_int
stencil window extent (((max((hw_output.s0.y.yi.base + hw_output.s0.y.yi), 15) - min((hw_output.s0.y.yi.base + hw_output.s0.y.yi), 0)) + 1)) is not a const.
Aborted (core dumped)

It reports that error even though the extent of yi is specified in the tile command:

hw_output.tile(x, y, xo, yo, xi, yi, 4, 4).accelerate({In_f},xi, xo);

Can you please help me with this? Thanks!

Attachments:

Attachment No.1: Definition of "My_fft2d_r2c" function: (which is basically the first parts of "fft2d_r2c" function of halide fft in "fft.cpp" file)

ComplexFunc My_fft2d_r2c(Func r,
                      const vector<int> &R0,
                      const vector<int> &R1,
                      const Target& target,
                      const Fft2dDesc& desc) {

string prefix = desc.name.empty() ? "r2c_" : desc.name + "_";

    vector<Var> args(r.args());
    Var n0(args[0]), n1(args[1]);
    args.erase(args.begin());
    args.erase(args.begin());

    // Get the innermost variable outside the FFT.
    Var outer = Var::outermost();
    if (!args.empty()) {
        outer = args.front();
    }

    int N0 = product(R0);
    int N1 = product(R1);

    // Cache of twiddle factors for this FFT.
    TwiddleFactorSet twiddle_cache;

    // The gain requested of the FFT.
    Expr gain = desc.gain;

    ComplexFunc zipped(prefix + "zipped");
    int zip_width = desc.vector_width;
    if (zip_width <= 0) {
        zip_width = 1;
    }
    // Ensure the zip width divides the zipped extent.
    zip_width = gcd(zip_width, N0 / 2);
    Expr zip_n0 = (n0 / zip_width) * zip_width * 2 + (n0 % zip_width);
    zipped(A({n0, n1}, args)) =
        ComplexExpr(r(A({zip_n0, n1}, args)),
                    r(A({zip_n0 + zip_width, n1}, args)));

    // DFT down the columns first.
    ComplexFunc dft1;
    dft1 = fft_dim1(zipped,
                                R1,
                                -1,  // sign
                                std::min(zip_width, N0 / 2),  // extent of dim 0
                                1.0f,
                                false,  // We parallelize unzipped below instead.
                                prefix,
                                target,
                                &twiddle_cache);    

    return dft1;
}

ComplexFunc My_fft2d_r2c(Func r,
                      int N0, int N1,
                      const Target& target,
                      const Fft2dDesc& desc) {
    return My_fft2d_r2c(r, radix_factor(N0), radix_factor(N1), target, desc);
}

Attachment No.2: The content of "pipeline.cpp" file:

#include "Halide.h"
#include "fft.h"
#include "complex.h"

using namespace Halide;

Var x("x"), y("y"), z("z"), c("c");
Var xo("xo"), yo("yo"), xi("xi"), yi("yi");

class MyPipeline {
public:
    ImageParam Input_Image;
    Func In_f;
    Func output, hw_output;
    ComplexFunc tmpfunc;
    std::vector<Argument> args;

    MyPipeline()
        : Input_Image(UInt(8), 2),
          In_f("In_f"),
          tmpfunc("tmpfunc"),
          hw_output("hw_output"),
          output("output")
    {
    Target target = get_jit_target_from_environment();

    Fft2dDesc fwd_desc;

    In_f(x,y) = Halide::cast<float>(Input_Image(x,y));
    tmpfunc = My_fft2d_r2c(In_f, 16, 16, target, fwd_desc);
    hw_output(x,y) = re(tmpfunc(x,y));
    output(x,y) = hw_output(x,y);

   // Arguments
   args = {Input_Image};

    }

    void compile_hls() {
        std::cout << "\ncompiling HLS code..." << std::endl;

        output.tile(x, y, xo, yo, xi, yi, 4, 4);
        hw_output.compute_at(output, xo);
        In_f.compute_root();
        hw_output.tile(x, y, xo, yo, xi, yi, 4, 4).accelerate({In_f},xi, xo);

        // Create the target for HLS simulation
        Target hls_target = get_target_from_environment();
        hls_target.set_feature(Target::CPlusPlusMangling);
std::cout << "\ncompiling HLS1 code..." << std::endl;
        output.compile_to_lowered_stmt("pipeline_hls.ir.html", args, HTML, hls_target);
std::cout << "\ncompiling HLS2 code..." << std::endl;
        output.compile_to_hls("pipeline_hls.cpp", args, "pipeline_hls", hls_target);
std::cout << "\ncompiling HLS3 code..." << std::endl;
        output.compile_to_header("pipeline_hls.h", args, "pipeline_hls", hls_target);

        std::vector<Target::Feature> features({Target::Zynq});
        Target target(Target::Linux, Target::ARM, 32, features);
        output.compile_to_zynq_c("pipeline_zynq.c", args, "pipeline_zynq", target);
        output.compile_to_header("pipeline_zynq.h", args, "pipeline_zynq", target);

        output.compile_to_object("pipeline_zynq.o", args, "pipeline_zynq", target);
        output.compile_to_lowered_stmt("pipeline_zynq.ir.html", args, HTML, target);
        output.compile_to_assembly("pipeline_zynq.s", args, "pipeline_zynq", target);

    }
};

int main(int argc, char **argv) {

    MyPipeline p2;
    p2.compile_hls();

    return 0;
}
jingpu commented 7 years ago

@amisal88 I believe the current framework won't work for FFT. The reason is that the current implementation can only generate a streaming hardware pipeline, more specifically a line buffered pipeline (see the paper at https://arxiv.org/abs/1610.09405 for more details). However, it is hard to map a FFT to such a pipeline architecture. I apologize that the error information wasn't clear to illustrate this limitation.

amisal88 commented 7 years ago

@jingpu Thank you very much for the clarification.