Loop blocking is incorrect

manodeep / Corrfunc

⚡️⚡️⚡️Blazing fast correlation functions on the CPU.

https://corrfunc.readthedocs.io

MIT License

164 stars 50 forks source link

Loop blocking is incorrect #38

Closed manodeep closed 8 years ago

manodeep commented 8 years ago

Looking at countpairs.c under xi_theory/xi_of_r, the quadruple for loop is in the wrong order. It should be i, then j, then ii and then jj.

manodeep commented 8 years ago

In a scathing indictment of how much I don't understand about processors, caches and typical particle loads, the correct loop blocking implementation is slower !

Keeping this issue open for now for further testing.

manodeep commented 8 years ago

With the kernel approach slotted for full release with v2.0, this loop-blocking will become a non-issue.

manodeep commented 8 years ago

Loop-blocking is not effective simply because the typical particle load per cell is very small <~ O(1k). For double precision types, this amounts to a total data load of (1k particles per cell, 8 bytes per element, 3 fields of positions, 2 cells that are being used) ~ 1k*8*3*2 < 64 kB (typical L1 cache).