Open william-silversmith opened 4 years ago
Not clear if this is possible. You might just end up exchanging latency on the core data for latency in the calculation on range and vertex.
Might be able to make use of __builtin_prefetch
for the Z axis (for g++ and clang).
In testing, it seems that about 69% of the time my test volume was bound by memory latency. This would seem to indicate that a cache aware version could be ~2-3x faster.