For vectorized reads, think of interpolation strategies which require several
consecutive values in memory. Linear & cosine require two, others four or more. For
all reads not near the border of the circular buffer, this could be a nice speedup
over regular indexing. Near the border, the values aren't consecutive in memory so
a slow path has to be handled.
This requires (?) using intrinsics so assumed SSE2 or use CPU detection?
(thx to @i-Zaak for idea)
For vectorized reads, think of interpolation strategies which require several consecutive values in memory. Linear & cosine require two, others four or more. For all reads not near the border of the circular buffer, this could be a nice speedup over regular indexing. Near the border, the values aren't consecutive in memory so a slow path has to be handled.
This requires (?) using intrinsics so assumed SSE2 or use CPU detection?