An explanation for cascading variants

Hi Pat, Thank you for your input.

Actually, I've implemented deep prefetching. Look into eg sortheapbinaryaheadsimplevarianta.hpp and you'll see the code:

                prefetch<1, 0>(a + std::min(left * 8, end));
                ssize_t const newIndex = index * 2
                        + compOp(a[left], Below, a[right]);

If the prefetching depth is N-level then the number of simultaneous prefetches is up to N.

In cascading variants, for a heap of depth N, there can be up to about N simultaneous prefetches, because cascading heapsort does about N simultaneous sift-downs, each with single level depth prefetching.

In practice, it seems that within the processor caches (ie in the top of the heap) prefetching doesn't do any good (what you've probably found) and usually a vast majority of heap levels fit in processor caches. Also, we're limited by memory subsystem as to the number of simultaneous prefetches. Increasing the number of them doesn't improve the performance linearly. I've done some experiments with random array access and found that prefetching can improve performance about 2.5x on Core 2 Duo IIRC.

I was discussing the benchmark on: http://encode.su/threads/1914-Sorting-Algorithms-Benchmark You can find more of my findings there.

tarsa / SortingAlgorithmsBenchmark

An explanation for cascading variants #1