Vectorized reads for history interpolation

(thx to @i-Zaak for idea)

For vectorized reads, think of interpolation strategies which require several consecutive values in memory. Linear & cosine require two, others four or more. For all reads not near the border of the circular buffer, this could be a nice speedup over regular indexing. Near the border, the values aren't consecutive in memory so a slow path has to be handled.

This requires (?) using intrinsics so assumed SSE2 or use CPU detection?

maedoc / libtvb

Vectorized reads for history interpolation #95