Open patmorin opened 9 years ago
Hi Pat, Thank you for your input.
Actually, I've implemented deep prefetching. Look into eg sortheapbinaryaheadsimplevarianta.hpp and you'll see the code:
prefetch<1, 0>(a + std::min(left * 8, end));
ssize_t const newIndex = index * 2
+ compOp(a[left], Below, a[right]);
If the prefetching depth is N-level then the number of simultaneous prefetches is up to N.
In cascading variants, for a heap of depth N, there can be up to about N simultaneous prefetches, because cascading heapsort does about N simultaneous sift-downs, each with single level depth prefetching.
In practice, it seems that within the processor caches (ie in the top of the heap) prefetching doesn't do any good (what you've probably found) and usually a vast majority of heap levels fit in processor caches. Also, we're limited by memory subsystem as to the number of simultaneous prefetches. Increasing the number of them doesn't improve the performance linearly. I've done some experiments with random array access and found that prefetching can improve performance about 2.5x on Core 2 Duo IIRC.
I was discussing the benchmark on: http://encode.su/threads/1914-Sorting-Algorithms-Benchmark You can find more of my findings there.
I'm currently writing a paper about searching ordered arrays that contains an explanation for why your cascading/prefetching variant of binary heap sort is so fast. It also contains an idea that will probably make it faster: Rather than prefetch a child of the current node, prefetch one of its 4 grandchildren, 8 great-grandchildren, or 16 great-great-granchildren. (Which to do depends on your data size, in particular, how many elements fit into a cache line.)
You can read about this in the draft paper. Section 3.2 and (to a lesser extent) Section 4.1 contain the relevant information.