halide / Halide

a language for fast, portable data-parallel computation
https://halide-lang.org
Other
5.79k stars 1.07k forks source link

Error message "Unsupported HVX type: float32x32" #8170

Open jxl1080 opened 3 months ago

jxl1080 commented 3 months ago

Hi,

I got the error message below when I run my generator with Adams2019 auto-scheduler, it doesn't happen if I run my generator without any auto-scheduler. But I don't really understand what it tells me and what I should do:

Unhandled exception: Internal Error at /glnxa64/Halide/src/HexagonOptimize.cpp:105 triggered by user code at : Unsupported HVX type: float32x32

Below is my Halide Generator Class:

include "Halide.h"

include

include

using namespace Halide; class mMatmul_matmul_out1_fcn_halide_generator : public Halide::Generator {

public:
    Input<Buffer<float>> B1{"B1", 2};
    Input<Buffer<float>> A1{"A1", 2};
    Output<Buffer<float>> matmul_out1_fcn{"matmul_out1_fcn", 2};

    void generate() {
        RDom r(0, 100);
        matmul_out1(d1, d2) = sum(A1(d1, r) * B1(r, d2));
        matmul_out1_fcn(d1, d2) = matmul_out1(d1, d2);
    }

    void schedule() {
    // Schedule is determined by autoscheduler. Need to set estimate on buffer
        if(using_autoscheduler()) {
            B1.dim(1).set_estimate(0, 100);
            B1.dim(0).set_estimate(0, 100);
            A1.dim(1).set_estimate(0, 100);
            A1.dim(0).set_estimate(0, 100);
            matmul_out1_fcn.set_estimate(d1, 0, 100).set_estimate(d2, 0, 100);
        }  else {
            // Default schedule
        }
    }

private:
    Var d1{"d1"};
    Var d2{"d2"};
    Func matmul_out1{"matmul_out1"};

}; HALIDE_REGISTER_GENERATOR(mMatmul_matmul_out1_fcn_halide_generator, mMatmul_matmul_out1_fcn_halide_gen)

Thank you!

abadams commented 3 months ago

The error means you're trying to compile to hvx, but your pipeline uses vectorized floats. I think our hexagon backend doesn't support the newer versions of hvx that support float vectors.

I think it isn't triggering without the autoscheduler, because then the schedule uses scalar floats only, which is fine. The autoscheduler isn't aware of that restriction on hexagon so it's trying to just vectorize everything.

jxl1080 commented 3 months ago

@abadams Thank you so much for your quick reply! Is there any suggestion on how to resolve this error message?

The error means you're trying to compile to hvx, but your pipeline uses vectorized floats. I think our hexagon backend doesn't support the newer versions of hvx that support float vectors.

I think it isn't triggering without the autoscheduler, because then the schedule uses scalar floats only, which is fine. The autoscheduler isn't aware of that restriction on hexagon so it's trying to just vectorize everything.

abadams commented 3 months ago

Don't try to do a floating point matrix multiply on hexagon. (Or at least the versions of hvx that Halide supports). It's not a good processor for running that algorithm, because you can't vectorize it. Do a fixed-point matrix multiply instead.

jxl1080 commented 3 months ago

@abadams Hi Adams, I'm not sure if I misunderstood your point by 'not try to do a floating point matrix multiply'. I changed my data type to 'uint8_t', but I'm getting a worse situation when I run my generator with Adams2019. There is a segmentation fault but without any error message.

abadams commented 3 months ago

Can you share a repro that crashes (including the build commands you're using)?

jxl1080 commented 3 months ago

Can you share a repro that crashes (including the build commands you're using)?

@abadams Thank you for your help! Below is the code of my Halide Generator Class:

include "Halide.h"

include

include

using namespace Halide; class mMatmul_matmul_out1_fcn_halide_generator : public Halide::Generator {

public:
    Input<Buffer<uint8_t>> B1{"B1", 2};
    Input<Buffer<uint8_t>> A1{"A1", 2};
    Output<Buffer<uint16_t>> matmul_out1_fcn{"matmul_out1_fcn", 2};

    void generate() {
        RDom r(0, 100);
        matmul_out1(d1, d2) = sum(cast<uint16_t>(A1(d1, r))*cast<uint16_t>(B1(r, d2)));
        matmul_out1_fcn(d1, d2) = matmul_out1(d1, d2);
    }

    void schedule() {
    // Schedule is determined by autoscheduler. Need to set estimate on buffer
        if(using_autoscheduler()) {
            B1.dim(1).set_estimate(0, 100);
            B1.dim(0).set_estimate(0, 100);
            A1.dim(1).set_estimate(0, 100);
            A1.dim(0).set_estimate(0, 100);
            matmul_out1_fcn.set_estimate(d1, 0, 100).set_estimate(d2, 0, 100);
        }  else {
            // Default schedule
        }
    }

private:
    Var d1{"d1"};
    Var d2{"d2"};
    Func matmul_out1{"matmul_out1"};

}; HALIDE_REGISTER_GENERATOR(mMatmul_matmul_out1_fcn_halide_generator, mMatmul_matmul_out1_fcn_halide_gen)

I used binary 'Halide-17.0.1-x86-64-linux-52541176253e74467dabc42eeee63d9a62c199f6.tar.gz' downloaded from: https://github.com/halide/Halide/releases

My command for compiling the Halide Genertor Class is: $ g++ mMatmul_matmul_out1_fcn_halide.cpp -std=c++17 ....../Halide-17.0.1-x86-64-linux/share/Halide/tools/GenGen.cpp -L ....../Halide-17.0.1-x86-64-linux/lib -lHalide -I ....../Halide-17.0.1-x86-64-linux/include -o mMatmul_matmul_out1_fcn_halide

My command for running generator with Adams2019 is (which gave me segmentation fault): $ export LD_LIBRARY_PATH=${LD_LIBRARY_PATH}:....../Halide-17.0.1-x86-64-linux/lib $ ./mMatmul_matmul_out1_fcn_halide -f myPipeline -g mMatmul_matmul_out1_fcn_halide_gen -e h,o target=hexagon-32-noos-hvx-no_runtime autoscheduler.parallelism=2 autoscheduler=Adams2019 -p ....../Halide-17.0.1-x86-64-linux/lib/libautoschedule_adams2019.so -o ./

My command for running generator with no auto-scheduler (which worked for me): $ export LD_LIBRARY_PATH=${LD_LIBRARY_PATH}:....../Halide-17.0.1-x86-64-linux/lib $ ./mMatmul_matmul_out1_fcn_halide -f myPipeline -g mMatmul_matmul_out1_fcn_halide_gen -e h,o target=hexagon-32-noos-hvx-no_runtime -o ./

abadams commented 3 months ago

Looks like it's a compiler bug caused by the adams autoscheduler not really understanding what to do on hexagon, and producing some very strange code that then hit a corner case bug in the simplifier.

Let's use the human Adams autoscheduler instead. A reasonable schedule for this pipeline is:

matmul_out1_fcn.vectorize(d1, 128).parallel(d2, (B1.dim(1).extent() + 3) / 4);

but a more typical matmul schedule (for large matrices) is

   void generate() {
        RDom r(0, 100);
        // Note: changed from sum to += so that I can schedule the reduction var
        matmul_out1(d1, d2) += cast<uint16_t>(A1(d1, r)) * cast<uint16_t>(B1(r, d2));
        matmul_out1_fcn(d1, d2) = matmul_out1(d1, d2);

        Var d1i, d2i, d1o, d2o;
        matmul_out1_fcn.tile(d1, d2, d1o, d2o, d1i, d2i, 3 * 128, 4).vectorize(d1i, 128).unroll(d1i).unroll(d2i).parallel(d2o);
        matmul_out1.compute_at(matmul_out1_fcn, d1o).vectorize(d1, 128).unroll(d1).unroll(d2);
        matmul_out1.update().reorder(d1, d2, r).vectorize(d1, 128).unroll(d1).unroll(d2);
    }

I usually do my scheduling inside the generate() method. In this case I needed to to access the RDom. You could also make the RDom a class member instead of a local.

For a great schedule, you need to start worrying about things like managing dmas into Hexagon's cache.