Open JobLeonard opened 6 years ago
(apologies for the noise, I accidentally hit submit too early. Deleted the unfinished comment and will work it out in a MarkDown editor first, then post)
So I came up with another potential optimization. I doubt I'm the first to think of this method, but still. This one removes swaps altogether, allowing for direct assignment instead. The approach adds a fixed memory overhead for lifting up to 256 values out of the buckets to create "holes" to move to, and of course adds some extra overhead and quite some complexity to deal with initializing and fixing up those holes in the end.
It took me quite a few hours to work out the idea, but I think I got it. Now, I haven't tested the code I wrote - it's written from scratch and I already spent too much time on it today. Hopefully the code+comments manage to explain the idea regardless of whether it compiles.
The style of the code below will be based on @skarupke's explanation of American Flag Sort (AFS) given in his C++Now 2017 talk about ska_sort (BTW, thanks for a really clear presentation!)
The example uses uint_8_t
, so we don't have to worry about recursion. I have not tested the code below, but hopefully it is correct enough to get the idea across:
/*
* (note: I have not written C++ in years and this looks like it's a
* very finicky algorithm, so expect some dumb mistakes in here)
*/
void american_flag_sort(uint8_t * begin, const uint8_t * end)
{
// This is purely for readability later on, and can be
// optimized away by the compiler, I'm sure.
uint8_t * input = begin;
const size_t input_length = end - begin;
// The (initial) lower bound of each bucket
size_t lower_bound[256] = {0};
// The upper bound of each bucket
size_t upper_bound[256] = {0};
// count how often each item appears in lower_bound
// (two reads × N, one write × N)
for (const uint8_t * it = begin; it != end; ++it)
++lower_bound[*it];
// convert to lower and upper bounds
// (256 reads, 512 writes)
size_t total = 0;
for (size_t i = 0; i < 256; i++)
{
size_t old_count = lower_bound[i];
// only happens if all values are in the same bucket,
// in which case there is nothing to sort.
if (old_count == input_length) return;
lower_bound[i] = total;
total += old_count;
upper_bound[i] = total;
}
// Array to store a value from each non-empty bucket.
uint8_t lifted_values[256] = {0};
// Keep track of whether or not a bucket has a lifted value,
// and whether it has values to move around.
// 0b01 - "has lifted values"-flag
// 0b10 - "values to move"-flag
// So:
// 0b00: has neither lifted values nor values to move
// 0b01: has a lifted value, and all values have finished moving
// 0b10: has no lifted values, and not all values have moved yet
// 0b11: has a lifted value, and not all values have moved yet
uint8_t bucket_state[256] = {0};
// Keep track of how many buckets have values to move around
uint8_t total_moving = 0;
// Value to move, which does double duty storing the bucket
// we are moving the value to (due to how radix sort works).
uint8_t val = 0;
// We go over the buckets from back to front so that val ends
// up pointing to the first non-empty bucket. This ensures
// that for already sorted arrays, we will only do one
// ascending pass from front to back, which theoretically
// is more cache friendly.
for (uint8_t i = 255; i >= 0; i--)
{
size_t lb = lower_bound[i];
size_t ub = upper_bound[i];
// Is the bucket non-empty?
if (lb < ub)
{
// Create a "hole" in the bucket (dear Liza, dear Liza)
lifted_values[i] = input[lb++];
// Since we lifted input[lb], that bucket has one fewer
// value to move. We want lower bound to point to values
// in the bucket, so we add one to it. Note that when the
// bucket has only one value, lb now equals ub.
// Also, whenever a bucket has a hole, that hole's index
// will be at lower_bound[bucket] - 1.
lower_bound[i] = lb;
if (lb < ub)
{
++total_moving;
// lifted value and values to move
bucket_state[i] = 0b11;
}
else
{
// lifted value, but no values to move
bucket_state[i] = 0b01;
}
// val does double duty to store the first bucket
// to load a value from in the next loop
val = i;
}
}
if (total_moving == 0) goto finished_moving;
// We are moving the lifted value to its final position,
// so the "lifted" flag is removed from its bucket.
bucket_state[val] &= 0b10;
// Load the lifted value
val = lifted_values[val];
// The main loop for moving values around
for(;;)
{
size_t next_val_idx = lower_bound[val];
// next_val_idx may already point to the upper bound of
// a bucket, but even then next_val_idx - 1 will still
// point to the hole in that bucket (it was made for val,
// DRR DRR DRR).
input[next_val_idx - 1] = val;
if (next_val_idx < upper_bound[val])
{
// Only reached if next_val_idx points to a new
// value in the current bucket to move
val = input[next_val_idx];
}
else
{
// No more values to move from this bucket.
// Break loop if all buckets are done with moving
if (--total_moving == 0) goto finished_moving;
// Otherwise, unset moving flag for this bucket and
// find another bucket with values to move (which is
// guaranteed to exist at this point, otherwise we
// would have broken out of the loop above)
bucket_state[val] &= 0b01;
do val = (val + 1) & 0xFF;
while ((bucket_state[val] & 0b10) == 0);
}
}
finished_moving:
// At this point, all we have to do is fix the holes in the
// buckets (dear Henry, dear Henry) by moving the remaining
// lifted values to the right buckets. The holes that were
// initially created are now at the back of the buckets,
// and we can insert the lifted values with one pass:
for(uint8_t i = 0; i < 255; i++)
{
if (bucket_state[i]&0b01)
{
uint8_t val = lifted_values[i];
input[--upper_bound[val]] = val;
}
}
}
Let's compare the above AFS without swapping (AFS-NS) to an imagined AFS with swapping (AFS-WS) that matches the explanation of skarupke's lecture linked earlier before (here it is again). We'll make a number of assumptions.
First, that all other overhead is roughly the same under most circumstances. This seems plausible to work out in practice: both versions need to count how often each value occurs, both convert those counts to prefix sums for lower_bound
and upper_bound
. In theory both only have to move a value when it goes to a different bucket, and both can skip over buckets for which lower boundary equals upper boundary (I did not include those optimizations, for brevity).
Assuming size_t
is 64 bits, lower_bound
and upper_bound
take up 16 KiB both, which we can expect to fit into most L2 caches for desktop CPUs, accessing those arrays is probably less of a bottleneck as accessing the input array.
With AFS-NS, moving a value takes one read and one assignment, plus the fixed overhead of the initial 255 lifted values. AFS-NS also does to not have to go back and forth between two parts of the input array for every swap. It might end up being more cache friendly as a result.
For AFS-WS, it is probably safe to expect that the majority of elements require two swaps to reach their final position. Swapping typically requires a third temporary, but that will likely be a register without significant overhead. The real bottleneck will be the two reads and writes to the input array per element.
That suggests the benefit should grow larger as the input array grows larger and needs to access RAM (or worse, disk). So the ballpark figure for the expected benefit would be at most half of however much time is spent on moving values around in AFS-WS. Real perf benefits are likely less, since we already need a pass over the input to count all values, meaning we actually only reduced the number of reads by a third (then again, writes are the slower of the two).
As far as I can see, during the phase where each element gets swapped to its final bucket among 256 for the current byte being compared, swapping an element takes (more or less) two swaps: one out of its initial position, and one to its final position. Since swapping takes three assignments for two elements, that translates to about three swaps per element.
Andrei Alexandrescu introduced the "vacancy" trick for Hoare's partition, that basically turns swaps into "half-swaps", reducing three assignments to two (or one and a half per element to one)
This doesn't directly translate to radix sort, but after a bit of tinkering I came up with a way: basically, use two temporaries and ping-pong between them:
I don't know if this wording is easy to follow, but in essence it reduces the three assignments required to move an element to two, which given the impact of memory access could be quite significant