Closed danielearwicker closed 5 months ago
This sounds like a great idea! I'm good with unilaterally replacing it with PQ. The difference is negligible at 2, and better at all others; so even though 2 is most common, I'm comfortable using the PQ always. Please feel free to do a PR whenever ready.
Had a look last night, PriorityQueue<T>
being a dotnet 6 thing, so I'll look for a standalone nuget package that performs well in the same test.
I mean, UpdatablePriorityQueue<>
is already present in the codebase...
LOL, that simplifies things!
Although I dropped it into the test code (have pushed here) and it performs significantly worse than current CLR PriorityQueue
, worse than the existing linear search approach in SuperLinq when merging 20 or fewer sequences, so you wouldn't want to switch to it:
I guess this is due to the extra stuff in Enqueue
relating to _elementIndex
and updatability.
Could add a separate copy of PriorityQueue
with no changes?
Sure, but make it internal
, so that external consumers don't rely on our version accidentally. While you're at it, check that there are no changes in PriorityQueue
that need to be migrated to UpdatablePriorityQueue
- I've not reviewed it since I copied it in net6 days.
Also, do me a favor, and use BenchmarkDotNet
for your benchmarking?
Actually, another benchmark to run: Instead of using a PQ, just use a sorted list. It's a O(1) to grab the first element, and O(n) to do an insertion sort, but for a small list, it should be fast.
Alternatively, look at SortedSet<>
, which is available on netcoreapp3.1
Did some more work on this over the weekend, moved to BenchmarkDotNet
and compared some more approaches.
SortedSet
is not directly suitable because the elements have to be unique, so I had to combine with List
to hold buckets of items with the same key. It had an awkward API for this situation. SortedDictionary
was less awkward so I changed to that, but keys still have to be unique. It is the worst performing in all cases except string keys with lots of chunks where it beats current SuperLinq.
Using a sorted list with linear insertion sort isn't great at large chunk numbers, because it basically flips the situation with current SuperLinq: almost all iterations have to do both remove and an insert (because a chunk's next value most likely sorts into a different position). So we make yield/remove faster but re-insert slower, and remain at O(n) overall, so lots of chunks is still hard.
But then I got a bit creative: binary search will find the place to insert in log-time. There's a built-in Array.BinarySearch
so by keeping the list of enumerators in an array I can use that (SortedArraySortedMergeExtensions
). The remove/yield/insert looks like this:
var e = arr[0];
Array.Copy(arr, 1, arr, 0, --count);
yield return e.Current;
if (e.MoveNext())
{
var index = Array.BinarySearch(arr, 0, count, e, sourceComparer);
if (index < 0) index = ~index;
if (index < count) Array.Copy(arr, index, arr, index + 1, count - index);
arr[index] = e;
count++;
}
Doesn't need any supporting classes, and is comparable to PriorityQueue
for string keys (even faster for lots of chunks), but unfortunately is noticeably slower for int keys. I guess the above binary search must do a very good job of limiting key comparisons, so when those are the main cost, it wins.
I also tried using LinkedList
, which is often poorer than List
because of worse locality, but is interesting because it in theory avoids copying chunks of data around in array, so faster removes/inserts? I also combined that idea with binary search and that combination was not as good as binary search on arrays, so not worth pursuing.
So across the results for int and string, PriorityQueue
still seems like the best all-rounder. Can't easily switch implementations automatically (e.g. PQ vs. binary search on arrays), because the best choice depends on the cost of the key comparisons rather than any known scale factor.
After running some more testing and modifying some of your benchmark code, I agree with your conclusion. Let's use PriorityQueue
on net6+. you can gate that with #if NET6_0_OR_GREATER
. do an #else
and use the sortedarray algorithm below:
try
{
while (count > 1)
{
var e = arr[0];
yield return e.Current;
if (!e.MoveNext())
{
e.Dispose();
count--;
Array.Copy(arr, 1, arr, 0, count);
continue;
}
var index = Array.BinarySearch(arr, 1, count - 1, e, sourceComparer);
if (index < 0) index = ~index;
index--;
if (index > 0)
{
Array.Copy(arr, 1, arr, 0, index);
}
arr[index] = e;
}
while (arr[0].MoveNext())
yield return arr[0].Current;
}
finally
{
foreach (var e in arr)
{
e.Dispose();
}
}
This alteration to your code will reduce the amount of copies that need to be done each loop.
Yeah that's nifty, as we're always removing the first, just do a single move of the upper section to make a hole for the re-insert, and if re-inserting at the top then no move needed. Will do PR asap.
I want to do typical external merge sort over a large-ish number of sorted buckets, and
SortedMerge
gets me the fully sorted results after I presort chunks. For large data sets I have enough memory to do this across a few hundred sorted chunks.Problem is that
SortedMerge
gets slower linearly with the number of enumerables it merges across, because it does a linear search through the remaining enumerators, performing key comparisons with half of them on average.An alternative is to use the CLR's
PriorityQueue
to store the enumerators, which is very fast at returning the item with the lowest key, so the key here is just the comparison key obtained the from enumerators'Current
property.Implementation and test results (done on a release build) here:
https://github.com/danielearwicker/PriorityQueueMergeSort
Rough idea:
Time to sort
int
s split randomly between N sorted collections, N = 2 to 30:The current implementation is $O(N)$, the PQ version is $O(\log N)$. At N = 100, PQ is over 8x faster. Sorting
string
s it's 6x faster.But:
PriorityQueue
makes no guarantees about the ordering of equal keys based on how they were inserted. So the sort may behave differently where keys are equal and that might theoretically affect existing apps.left.SortedMerge(right)
pattern is natural. If you have 100,sources.First().SortedMerge(sources.Skip(1))
is awkward.So given 1 and 2 it seems safer to let the user choose the implementation. Seems like one natural way to do this would be to solve 3 at the same time and expose it as an extension method on
IEnumerable<IEnumerable<T>>
, so if you have a sequence of sorted sequences (as you would if there were lots of them), you can directly callSortedMerge
on it, and the implementation will be suitable for many sequences.I'd be happy to PR this, or another approach?