Closed manodeep closed 8 years ago
In a scathing indictment of how much I don't understand about processors, caches and typical particle loads, the correct loop blocking implementation is slower !
Keeping this issue open for now for further testing.
With the kernel approach slotted for full release with v2.0
, this loop-blocking will become a non-issue.
Loop-blocking is not effective simply because the typical particle load per cell is very small <~ O(1k). For double precision types, this amounts to a total data load of (1k particles per cell, 8 bytes per element, 3 fields of positions, 2 cells that are being used) ~ 1k*8*3*2
< 64 kB
(typical L1 cache).
Looking at
countpairs.c
underxi_theory/xi_of_r
, the quadruple for loop is in the wrong order. It should bei
, thenj
, thenii
and thenjj
.