viceroypenguin / SuperLinq

Extensions to LINQ to Objects
Apache License 2.0
115 stars 12 forks source link

`PriorityQueue` implementation of `SortedMergeBy` #654

Closed danielearwicker closed 5 months ago

danielearwicker commented 5 months ago

I want to do typical external merge sort over a large-ish number of sorted buckets, and SortedMerge gets me the fully sorted results after I presort chunks. For large data sets I have enough memory to do this across a few hundred sorted chunks.

Problem is that SortedMerge gets slower linearly with the number of enumerables it merges across, because it does a linear search through the remaining enumerators, performing key comparisons with half of them on average.

An alternative is to use the CLR's PriorityQueue to store the enumerators, which is very fast at returning the item with the lowest key, so the key here is just the comparison key obtained the from enumerators' Current property.

Implementation and test results (done on a release build) here:

https://github.com/danielearwicker/PriorityQueueMergeSort

Rough idea:

var queue = new PriorityQueue<IEnumerator<TSource>, TKey>(comparer);

foreach (var e in enumerators)
{
    if (e.MoveNext()) queue.Enqueue(e, keySelector(e.Current));
}

while (queue.TryDequeue(out var e, out var _))
{
    yield return e.Current;

    if (e.MoveNext()) queue.Enqueue(e, keySelector(e.Current));
}

Time to sort ints split randomly between N sorted collections, N = 2 to 30:

image

The current implementation is $O(N)$, the PQ version is $O(\log N)$. At N = 100, PQ is over 8x faster. Sorting strings it's 6x faster.

But:

  1. For very small N it's slightly slower, the overhead of delete/insert on the PQ overpowers the saving on comparisons.
  2. PriorityQueue makes no guarantees about the ordering of equal keys based on how they were inserted. So the sort may behave differently where keys are equal and that might theoretically affect existing apps.
  3. If you have two lists to merge then the left.SortedMerge(right) pattern is natural. If you have 100, sources.First().SortedMerge(sources.Skip(1)) is awkward.

So given 1 and 2 it seems safer to let the user choose the implementation. Seems like one natural way to do this would be to solve 3 at the same time and expose it as an extension method on IEnumerable<IEnumerable<T>>, so if you have a sequence of sorted sequences (as you would if there were lots of them), you can directly call SortedMerge on it, and the implementation will be suitable for many sequences.

I'd be happy to PR this, or another approach?

viceroypenguin commented 5 months ago

This sounds like a great idea! I'm good with unilaterally replacing it with PQ. The difference is negligible at 2, and better at all others; so even though 2 is most common, I'm comfortable using the PQ always. Please feel free to do a PR whenever ready.

danielearwicker commented 5 months ago

Had a look last night, PriorityQueue<T> being a dotnet 6 thing, so I'll look for a standalone nuget package that performs well in the same test.

viceroypenguin commented 5 months ago

I mean, UpdatablePriorityQueue<> is already present in the codebase...

danielearwicker commented 5 months ago

LOL, that simplifies things!

Although I dropped it into the test code (have pushed here) and it performs significantly worse than current CLR PriorityQueue, worse than the existing linear search approach in SuperLinq when merging 20 or fewer sequences, so you wouldn't want to switch to it:

image

I guess this is due to the extra stuff in Enqueue relating to _elementIndex and updatability.

Could add a separate copy of PriorityQueue with no changes?

viceroypenguin commented 5 months ago

Sure, but make it internal, so that external consumers don't rely on our version accidentally. While you're at it, check that there are no changes in PriorityQueue that need to be migrated to UpdatablePriorityQueue - I've not reviewed it since I copied it in net6 days.

viceroypenguin commented 5 months ago

Also, do me a favor, and use BenchmarkDotNet for your benchmarking?

viceroypenguin commented 5 months ago

Actually, another benchmark to run: Instead of using a PQ, just use a sorted list. It's a O(1) to grab the first element, and O(n) to do an insertion sort, but for a small list, it should be fast.

Alternatively, look at SortedSet<>, which is available on netcoreapp3.1

danielearwicker commented 5 months ago

Did some more work on this over the weekend, moved to BenchmarkDotNet and compared some more approaches.

SortedSet is not directly suitable because the elements have to be unique, so I had to combine with List to hold buckets of items with the same key. It had an awkward API for this situation. SortedDictionary was less awkward so I changed to that, but keys still have to be unique. It is the worst performing in all cases except string keys with lots of chunks where it beats current SuperLinq.

Using a sorted list with linear insertion sort isn't great at large chunk numbers, because it basically flips the situation with current SuperLinq: almost all iterations have to do both remove and an insert (because a chunk's next value most likely sorts into a different position). So we make yield/remove faster but re-insert slower, and remain at O(n) overall, so lots of chunks is still hard.

But then I got a bit creative: binary search will find the place to insert in log-time. There's a built-in Array.BinarySearch so by keeping the list of enumerators in an array I can use that (SortedArraySortedMergeExtensions). The remove/yield/insert looks like this:

var e = arr[0];
Array.Copy(arr, 1, arr, 0, --count);

yield return e.Current;

if (e.MoveNext())
{
    var index = Array.BinarySearch(arr, 0, count, e, sourceComparer);
    if (index < 0) index = ~index;
    if (index < count) Array.Copy(arr, index, arr, index + 1, count - index);
    arr[index] = e;
    count++;
}

Doesn't need any supporting classes, and is comparable to PriorityQueue for string keys (even faster for lots of chunks), but unfortunately is noticeably slower for int keys. I guess the above binary search must do a very good job of limiting key comparisons, so when those are the main cost, it wins.

I also tried using LinkedList, which is often poorer than List because of worse locality, but is interesting because it in theory avoids copying chunks of data around in array, so faster removes/inserts? I also combined that idea with binary search and that combination was not as good as binary search on arrays, so not worth pursuing.

So across the results for int and string, PriorityQueue still seems like the best all-rounder. Can't easily switch implementations automatically (e.g. PQ vs. binary search on arrays), because the best choice depends on the cost of the key comparisons rather than any known scale factor.

viceroypenguin commented 5 months ago

After running some more testing and modifying some of your benchmark code, I agree with your conclusion. Let's use PriorityQueue on net6+. you can gate that with #if NET6_0_OR_GREATER. do an #else and use the sortedarray algorithm below:

            try
            {
                while (count > 1)
                {
                    var e = arr[0];
                    yield return e.Current;

                    if (!e.MoveNext())
                    {
                        e.Dispose();
                        count--;
                        Array.Copy(arr, 1, arr, 0, count);
                        continue;
                    }

                    var index = Array.BinarySearch(arr, 1, count - 1, e, sourceComparer);
                    if (index < 0) index = ~index;

                    index--;

                    if (index > 0)
                    {
                        Array.Copy(arr, 1, arr, 0, index);
                    }

                    arr[index] = e;
                }

                while (arr[0].MoveNext())
                    yield return arr[0].Current;
            }
            finally
            {
                foreach (var e in arr)
                {
                    e.Dispose();
                }
            }

This alteration to your code will reduce the amount of copies that need to be done each loop.

danielearwicker commented 5 months ago

Yeah that's nifty, as we're always removing the first, just do a single move of the upper section to make a hole for the re-insert, and if re-inserting at the top then no move needed. Will do PR asap.