halide / Halide

a language for fast, portable data-parallel computation
https://halide-lang.org
Other
5.88k stars 1.07k forks source link

Hexagon HVX uint32 x uin32x vector multiply clarification #7290

Open RafaeNoor opened 1 year ago

RafaeNoor commented 1 year ago

Hello! I'm trying to understand how the hexagon backend (CodeGen_Hexagon.cpp) is lowering uint32 multiplies. In this runtime file .

I have the following c++ file contents:

    Input<Buffer<uint32_t>> input1{"input1", 2};
    Input<Buffer<uint32_t>> input2{"input2", 2};
    Output<Buffer<uint32_t>> output{"output", 2};        

        Var x("x");
        Var y("y");

        output(x,y) = input1(x,y) * input2(x,y);

        const int vector_size = natural_vector_size<uint32_t>() ;
        output
                .vectorize(x, vector_size);

The corresponding stmt contents are:

output[ramp((output.s0.x.x*32) + ((output.stride.1*t21) + t22), 1, 32)] = input1[ramp((output.s0.x.x*32) + ((input1.stride.1*t21) + t9), 1, 32)]*input2[ramp((output.s0.x.x*32) + ((input2.stride.1*t21) + t10), 1, 32)]

When I look at the LLVM bitcode file being generated, the corresponding hvx runtime method is being invoked @halide.hexagon.mul.vuw.vuw (<64 x i32> %a, <64 x i32> %b). Note that these are <64 x i32> instead of <32 x i32>. hvx_128.ll

The statements of the files are:

 %71 = tail call <64 x i32> @llvm.hexagon.V6.vcombine.128B(<32 x i32> undef, <32 x i32> %66) 
   %72 = tail call <64 x i32> @llvm.hexagon.V6.vcombine.128B(<32 x i32> undef, <32 x i32> %70) 
   %a_lo.i.1 = tail call <32 x i32> @llvm.hexagon.V6.lo.128B(<64 x i32> %71) #11
   %a_hi.i.1 = tail call <32 x i32> @llvm.hexagon.V6.hi.128B(<64 x i32> %71) #11
   %b_lo.i.1 = tail call <32 x i32> @llvm.hexagon.V6.lo.128B(<64 x i32> %72) #11
   %b_hi.i.1 = tail call <32 x i32> @llvm.hexagon.V6.hi.128B(<64 x i32> %72) #11
   %a_e.i.1 = tail call <32 x i32> @llvm.hexagon.V6.vshufeh.128B(<32 x i32> %a_hi.i.1, <32 x i32> %a_lo.i.1) #11
   %a_o.i.1 = tail call <32 x i32> @llvm.hexagon.V6.vshufoh.128B(<32 x i32> %a_hi.i.1, <32 x i32> %a_lo.i.1) #11
   %b_e.i.1 = tail call <32 x i32> @llvm.hexagon.V6.vshufeh.128B(<32 x i32> %b_hi.i.1, <32 x i32> %b_lo.i.1) #11
   %b_o.i.1 = tail call <32 x i32> @llvm.hexagon.V6.vshufoh.128B(<32 x i32> %b_hi.i.1, <32 x i32> %b_lo.i.1) #11
   %ab_e.i.1 = tail call <64 x i32> @llvm.hexagon.V6.vmpyuhv.128B(<32 x i32> %a_e.i.1, <32 x i32> %b_e.i.1) #11
   %ab_o1.i.1 = tail call <64 x i32> @llvm.hexagon.V6.vmpyuhv.128B(<32 x i32> %a_o.i.1, <32 x i32> %b_e.i.1) #11
   %ab_o.i.1 = tail call <64 x i32> @llvm.hexagon.V6.vmpyuhv.acc.128B(<64 x i32> %ab_o1.i.1, <32 x i32> %a_e.i.1, <32 x i32> %b_o.i.1) #11
   %a_lo.i.i.1 = tail call <32 x i32> @llvm.hexagon.V6.lo.128B(<64 x i32> %ab_e.i.1) #11
   %l_lo.i.i.1 = tail call <32 x i32> @llvm.hexagon.V6.lo.128B(<64 x i32> %ab_o.i.1) #11
   %s_lo.i.i.1 = tail call <32 x i32> @llvm.hexagon.V6.vaslw.acc.128B(<32 x i32> %a_lo.i.i.1, <32 x i32> %l_lo.i.i.1, i32 16) #11
   %a_hi.i.i.1 = tail call <32 x i32> @llvm.hexagon.V6.hi.128B(<64 x i32> %ab_e.i.1) #11
   %l_hi.i.i.1 = tail call <32 x i32> @llvm.hexagon.V6.hi.128B(<64 x i32> %ab_o.i.1) #11
   %s_hi.i.i.1 = tail call <32 x i32> @llvm.hexagon.V6.vaslw.acc.128B(<32 x i32> %a_hi.i.i.1, <32 x i32> %l_hi.i.i.1, i32 16) #11
   %s.i.i.1 = tail call <64 x i32> @llvm.hexagon.V6.vcombine.128B(<32 x i32> %s_hi.i.i.1, <32 x i32> %s_lo.i.i.1) #11
   %73 = tail call <32 x i32> @llvm.hexagon.V6.lo.128B(<64 x i32> %s.i.i.1)

%66 and %70 are the vector registers being loaded. According to these wouldn't there be undefined behavior as the vectors are being combined with undef and then there are multiplications occurring using the undef state? Is there any description of how this runtime method is correct?

rootjalex commented 1 year ago

@pranavb-ca might be able to help clarify this?