Hexagon HVX uint32 x uin32x vector multiply clarification

Hello! I'm trying to understand how the hexagon backend (CodeGen_Hexagon.cpp) is lowering uint32 multiplies. In this runtime file .

I have the following c++ file contents:

    Input<Buffer<uint32_t>> input1{"input1", 2};
    Input<Buffer<uint32_t>> input2{"input2", 2};
    Output<Buffer<uint32_t>> output{"output", 2};        

        Var x("x");
        Var y("y");

        output(x,y) = input1(x,y) * input2(x,y);

        const int vector_size = natural_vector_size<uint32_t>() ;
        output
                .vectorize(x, vector_size);

The corresponding stmt contents are:

output[ramp((output.s0.x.x*32) + ((output.stride.1*t21) + t22), 1, 32)] = input1[ramp((output.s0.x.x*32) + ((input1.stride.1*t21) + t9), 1, 32)]*input2[ramp((output.s0.x.x*32) + ((input2.stride.1*t21) + t10), 1, 32)]

When I look at the LLVM bitcode file being generated, the corresponding hvx runtime method is being invoked @halide.hexagon.mul.vuw.vuw (<64 x i32> %a, <64 x i32> %b). Note that these are <64 x i32> instead of <32 x i32>. hvx_128.ll

The statements of the files are:

 %71 = tail call <64 x i32> @llvm.hexagon.V6.vcombine.128B(<32 x i32> undef, <32 x i32> %66) 
   %72 = tail call <64 x i32> @llvm.hexagon.V6.vcombine.128B(<32 x i32> undef, <32 x i32> %70) 
   %a_lo.i.1 = tail call <32 x i32> @llvm.hexagon.V6.lo.128B(<64 x i32> %71) #11
   %a_hi.i.1 = tail call <32 x i32> @llvm.hexagon.V6.hi.128B(<64 x i32> %71) #11
   %b_lo.i.1 = tail call <32 x i32> @llvm.hexagon.V6.lo.128B(<64 x i32> %72) #11
   %b_hi.i.1 = tail call <32 x i32> @llvm.hexagon.V6.hi.128B(<64 x i32> %72) #11
   %a_e.i.1 = tail call <32 x i32> @llvm.hexagon.V6.vshufeh.128B(<32 x i32> %a_hi.i.1, <32 x i32> %a_lo.i.1) #11
   %a_o.i.1 = tail call <32 x i32> @llvm.hexagon.V6.vshufoh.128B(<32 x i32> %a_hi.i.1, <32 x i32> %a_lo.i.1) #11
   %b_e.i.1 = tail call <32 x i32> @llvm.hexagon.V6.vshufeh.128B(<32 x i32> %b_hi.i.1, <32 x i32> %b_lo.i.1) #11
   %b_o.i.1 = tail call <32 x i32> @llvm.hexagon.V6.vshufoh.128B(<32 x i32> %b_hi.i.1, <32 x i32> %b_lo.i.1) #11
   %ab_e.i.1 = tail call <64 x i32> @llvm.hexagon.V6.vmpyuhv.128B(<32 x i32> %a_e.i.1, <32 x i32> %b_e.i.1) #11
   %ab_o1.i.1 = tail call <64 x i32> @llvm.hexagon.V6.vmpyuhv.128B(<32 x i32> %a_o.i.1, <32 x i32> %b_e.i.1) #11
   %ab_o.i.1 = tail call <64 x i32> @llvm.hexagon.V6.vmpyuhv.acc.128B(<64 x i32> %ab_o1.i.1, <32 x i32> %a_e.i.1, <32 x i32> %b_o.i.1) #11
   %a_lo.i.i.1 = tail call <32 x i32> @llvm.hexagon.V6.lo.128B(<64 x i32> %ab_e.i.1) #11
   %l_lo.i.i.1 = tail call <32 x i32> @llvm.hexagon.V6.lo.128B(<64 x i32> %ab_o.i.1) #11
   %s_lo.i.i.1 = tail call <32 x i32> @llvm.hexagon.V6.vaslw.acc.128B(<32 x i32> %a_lo.i.i.1, <32 x i32> %l_lo.i.i.1, i32 16) #11
   %a_hi.i.i.1 = tail call <32 x i32> @llvm.hexagon.V6.hi.128B(<64 x i32> %ab_e.i.1) #11
   %l_hi.i.i.1 = tail call <32 x i32> @llvm.hexagon.V6.hi.128B(<64 x i32> %ab_o.i.1) #11
   %s_hi.i.i.1 = tail call <32 x i32> @llvm.hexagon.V6.vaslw.acc.128B(<32 x i32> %a_hi.i.i.1, <32 x i32> %l_hi.i.i.1, i32 16) #11
   %s.i.i.1 = tail call <64 x i32> @llvm.hexagon.V6.vcombine.128B(<32 x i32> %s_hi.i.i.1, <32 x i32> %s_lo.i.i.1) #11
   %73 = tail call <32 x i32> @llvm.hexagon.V6.lo.128B(<64 x i32> %s.i.i.1)

%66 and %70 are the vector registers being loaded. According to these wouldn't there be undefined behavior as the vectors are being combined with undef and then there are multiplications occurring using the undef state? Is there any description of how this runtime method is correct?

halide / Halide

Hexagon HVX uint32 x uin32x vector multiply clarification #7290