I've been trying to write some Halide code for performing image warping (somewhat similar to torch.grid_sample), but I'm repeatedly having issues in convincing the compiler to generate vgather instructions for the Hexagon DSP.
There seems to be no way to specify that an input buffer is in the VTCM, meaning that Halide always has to do the allocation and copying itself.
If the image is copied to the VTCM at root (compute_root()), parallelizing the output seems to break the gather instructions. If I'm reading the Hexagon HVX manual right, the gather ops have quite high latency, so not being able to hide the latency with parallelism can have a significant performance penalty.
It seems that any "no-op" transformations applied to the input image, such as reinterpreting the values or reshaping also break the gathers.
If I avoid these 3 issues, I do manage to generate gather instructions, but I think points 1 and 2 hurt performance quite a bit. Without point 1. above, I could implement parallelism myself, splitting the output into horizontal slices before passing them to Halide, but I'm quite sure the code is memory bound, so having to copy the data each and every time defeats the purpose.
Am I doing something wrong, or are these currently limitations of the compiler? If so, any ideas how to work around them?
Hi there,
I've been trying to write some Halide code for performing image warping (somewhat similar to torch.grid_sample), but I'm repeatedly having issues in convincing the compiler to generate
vgather
instructions for the Hexagon DSP.compute_root()
), parallelizing the output seems to break the gather instructions. If I'm reading the Hexagon HVX manual right, the gather ops have quite high latency, so not being able to hide the latency with parallelism can have a significant performance penalty.If I avoid these 3 issues, I do manage to generate gather instructions, but I think points 1 and 2 hurt performance quite a bit. Without point 1. above, I could implement parallelism myself, splitting the output into horizontal slices before passing them to Halide, but I'm quite sure the code is memory bound, so having to copy the data each and every time defeats the purpose.
Am I doing something wrong, or are these currently limitations of the compiler? If so, any ideas how to work around them?