slices: use dual-pivot quicksort instead of pdqsort

PeterRK commented 1 year ago

Reinplement #52789 based on go 1.21rc2. Here is new benchmark result.

Xeon-8374C

name                        old time/op  new time/op  delta
SlicesSortInts-4            7.67ms ± 0%  6.23ms ± 0%  -18.82%  (p=0.000 n=10+10)
SlicesSortInts_Sorted-4      113µs ± 1%    88µs ± 2%  -22.31%  (p=0.000 n=10+10)
SlicesSortInts_Reversed-4    148µs ± 0%   126µs ± 1%  -15.04%  (p=0.000 n=10+9)
SlicesSortStrings-4         28.0ms ± 1%  27.3ms ± 4%   -2.35%  (p=0.022 n=9+10)
SlicesSortStrings_Sorted-4   628µs ± 1%   567µs ± 0%   -9.80%  (p=0.000 n=10+9)
SortFuncStructs-4           14.1ms ± 2%  13.4ms ± 2%   -5.53%  (p=0.000 n=10+10)

EPYC-7K83

name                        old time/op  new time/op  delta
SlicesSortInts-4            7.10ms ± 0%  5.69ms ± 0%  -19.82%  (p=0.000 n=9+10)
SlicesSortInts_Sorted-4     96.7µs ± 1%  90.8µs ± 1%   -6.06%  (p=0.000 n=10+10)
SlicesSortInts_Reversed-4    133µs ± 2%   113µs ± 2%  -15.20%  (p=0.000 n=10+10)
SlicesSortStrings-4         26.9ms ± 4%  26.0ms ± 2%   -3.36%  (p=0.000 n=10+9)
SlicesSortStrings_Sorted-4   658µs ± 1%   644µs ± 0%   -2.20%  (p=0.000 n=10+8)
SortFuncStructs-4           13.5ms ± 8%  12.9ms ± 2%   -4.42%  (p=0.002 n=10+8)

Dual-pivot quicksort usually has better performance than current pdqsort. It can be even faster with api change.

Xeon-8374C (X86-64)

name               std time/op  new time/op  delta
Int/Small-1K       27.7µs ± 2%  24.1µs ± 3%  -12.92%  (p=0.008 n=5+5)
Int/Small-10K       315µs ± 0%   249µs ± 0%  -20.96%  (p=0.008 n=5+5)
Int/Small-100K     3.76ms ± 0%  2.86ms ± 0%  -24.06%  (p=0.008 n=5+5)
Int/Small-1M       44.6ms ± 0%  35.4ms ± 0%  -20.57%  (p=0.008 n=5+5)
Int/Random-1K      48.4µs ± 1%  38.8µs ± 1%  -19.69%  (p=0.008 n=5+5)
Int/Random-10K      614µs ± 1%   464µs ± 0%  -24.44%  (p=0.008 n=5+5)
Int/Random-100K    7.58ms ± 0%  5.61ms ± 0%  -26.07%  (p=0.008 n=5+5)
Int/Random-1M      90.3ms ± 1%  65.8ms ± 0%  -27.08%  (p=0.008 n=5+5)
Int/Constant-1K    1.33µs ± 1%  0.93µs ± 1%  -30.13%  (p=0.008 n=5+5)
Int/Constant-10K   11.1µs ± 4%   8.0µs ± 2%  -27.36%  (p=0.008 n=5+5)
Int/Constant-100K  98.0µs ± 5%  67.0µs ± 2%  -31.61%  (p=0.008 n=5+5)
Int/Constant-1M     932µs ± 1%   646µs ± 0%  -30.72%  (p=0.008 n=5+5)
Int/Ascent-1K      1.33µs ± 1%  0.93µs ± 1%  -30.03%  (p=0.008 n=5+5)
Int/Ascent-10K     10.8µs ± 3%   8.0µs ± 3%  -26.21%  (p=0.008 n=5+5)
Int/Ascent-100K    99.1µs ± 5%  66.2µs ± 1%  -33.20%  (p=0.008 n=5+5)
Int/Ascent-1M       926µs ± 0%   646µs ± 0%  -30.19%  (p=0.008 n=5+5)
Int/Descent-1K     1.85µs ± 1%  1.46µs ± 0%  -21.19%  (p=0.008 n=5+5)
Int/Descent-10K    14.8µs ± 3%  12.0µs ± 6%  -18.43%  (p=0.008 n=5+5)
Int/Descent-100K    132µs ± 1%   104µs ± 2%  -21.46%  (p=0.008 n=5+5)
Int/Descent-1M     1.32ms ± 0%  1.04ms ± 1%  -21.37%  (p=0.008 n=5+5)
Int/Mixed-1K       19.9µs ± 3%  17.1µs ± 3%  -13.87%  (p=0.008 n=5+5)
Int/Mixed-10K       219µs ± 1%   231µs ± 0%   +5.53%  (p=0.008 n=5+5)
Int/Mixed-100K     2.54ms ± 0%  2.55ms ± 0%     ~     (p=0.310 n=5+5)
Int/Mixed-1M       29.5ms ± 0%  28.0ms ± 0%   -5.11%  (p=0.008 n=5+5)
Hybrid/5%          4.09ms ± 0%  3.06ms ± 0%  -25.07%  (p=0.008 n=5+5)
Hybrid/10%         7.12ms ± 0%  5.33ms ± 0%  -25.04%  (p=0.008 n=5+5)
Hybrid/20%         13.1ms ± 0%   9.9ms ± 0%  -24.73%  (p=0.008 n=5+5)
Hybrid/30%         19.2ms ± 1%  14.4ms ± 0%  -25.00%  (p=0.008 n=5+5)
Hybrid/50%         31.2ms ± 1%  23.5ms ± 0%  -24.70%  (p=0.008 n=5+5)
Float/1K           58.2µs ± 2%  48.6µs ± 1%  -16.58%  (p=0.008 n=5+5)
Float/10K           751µs ± 0%   549µs ± 0%  -26.80%  (p=0.008 n=5+5)
Float/100K         9.28ms ± 0%  6.41ms ± 0%  -30.91%  (p=0.008 n=5+5)
Float/1M            110ms ± 0%    73ms ± 0%  -33.44%  (p=0.008 n=5+5)
Str/1K              130µs ± 0%   125µs ± 0%   -4.00%  (p=0.008 n=5+5)
Str/10K            1.69ms ± 0%  1.64ms ± 0%   -2.73%  (p=0.008 n=5+5)
Str/100K           21.6ms ± 1%  21.1ms ± 1%   -2.41%  (p=0.008 n=5+5)
Str/1M              272ms ± 1%   269ms ± 2%     ~     (p=0.095 n=5+5)
Struct/1K           104µs ± 1%    86µs ± 0%  -17.34%  (p=0.008 n=5+5)
Struct/10K         1.36ms ± 0%  1.07ms ± 0%  -20.93%  (p=0.008 n=5+5)
Struct/100K        16.8ms ± 0%  13.0ms ± 0%  -22.48%  (p=0.008 n=5+5)
Struct/1M           201ms ± 0%   152ms ± 0%  -24.21%  (p=0.008 n=5+5)
Stable/1K           183µs ± 1%    84µs ± 0%  -53.88%  (p=0.008 n=5+5)
Stable/10K         2.71ms ± 0%  1.14ms ± 0%  -58.16%  (p=0.008 n=5+5)
Stable/100K        37.9ms ± 0%  17.4ms ± 0%  -54.15%  (p=0.016 n=5+4)
Stable/1M           498ms ± 1%   175ms ± 1%  -64.74%  (p=0.008 n=5+5)
Pointer/1K         78.7µs ± 1%  66.1µs ± 1%  -15.96%  (p=0.008 n=5+5)
Pointer/10K        1.05ms ± 0%  0.90ms ± 0%  -14.87%  (p=0.008 n=5+5)
Pointer/100K       14.0ms ± 0%  12.1ms ± 0%  -13.52%  (p=0.008 n=5+5)
Pointer/1M          171ms ± 0%   159ms ± 0%   -6.82%  (p=0.008 n=5+5)

Yitian-710 (ARM64)

name               std time/op  new time/op  delta
Int/Small-1K       21.4µs ± 0%  17.4µs ± 0%  -18.95%  (p=0.008 n=5+5)
Int/Small-10K       253µs ± 0%   207µs ± 0%  -18.18%  (p=0.008 n=5+5)
Int/Small-100K     3.02ms ± 0%  2.47ms ± 0%  -18.42%  (p=0.008 n=5+5)
Int/Small-1M       35.7ms ± 0%  28.8ms ± 0%  -19.54%  (p=0.008 n=5+5)
Int/Random-1K      37.7µs ± 0%  27.6µs ± 0%  -26.84%  (p=0.008 n=5+5)
Int/Random-10K      492µs ± 0%   362µs ± 0%  -26.46%  (p=0.008 n=5+5)
Int/Random-100K    6.09ms ± 0%  4.51ms ± 0%  -25.96%  (p=0.008 n=5+5)
Int/Random-1M      72.5ms ± 0%  53.9ms ± 0%  -25.69%  (p=0.008 n=5+5)
Int/Constant-1K    1.08µs ± 1%  0.70µs ± 0%  -35.23%  (p=0.016 n=5+4)
Int/Constant-10K   9.60µs ± 0%  6.55µs ± 0%  -31.77%  (p=0.008 n=5+5)
Int/Constant-100K  93.1µs ± 0%  63.5µs ± 1%  -31.78%  (p=0.008 n=5+5)
Int/Constant-1M     929µs ± 0%   640µs ± 1%  -31.09%  (p=0.008 n=5+5)
Int/Ascent-1K      1.08µs ± 1%  0.71µs ± 4%  -34.36%  (p=0.008 n=5+5)
Int/Ascent-10K     9.61µs ± 0%  6.57µs ± 0%  -31.57%  (p=0.008 n=5+5)
Int/Ascent-100K    93.1µs ± 0%  63.1µs ± 0%  -32.20%  (p=0.008 n=5+5)
Int/Ascent-1M       929µs ± 0%   639µs ± 1%  -31.27%  (p=0.008 n=5+5)
Int/Descent-1K     1.45µs ± 0%  1.07µs ± 1%  -26.66%  (p=0.008 n=5+5)
Int/Descent-10K    13.2µs ± 0%   9.9µs ± 0%  -25.07%  (p=0.016 n=5+4)
Int/Descent-100K    130µs ± 0%   100µs ± 0%  -23.09%  (p=0.008 n=5+5)
Int/Descent-1M     1.30ms ± 0%  1.02ms ± 1%  -22.02%  (p=0.008 n=5+5)
Int/Mixed-1K       16.3µs ± 4%  12.1µs ± 0%  -25.65%  (p=0.008 n=5+5)
Int/Mixed-10K       187µs ± 0%   150µs ± 0%  -19.82%  (p=0.008 n=5+5)
Int/Mixed-100K     2.20ms ± 0%  1.77ms ± 0%  -19.24%  (p=0.008 n=5+5)
Int/Mixed-1M       25.3ms ± 0%  20.5ms ± 0%  -19.21%  (p=0.008 n=5+5)
Hybrid/5%          3.50ms ± 0%  2.59ms ± 0%  -26.14%  (p=0.008 n=5+5)
Hybrid/10%         5.92ms ± 0%  4.36ms ± 0%  -26.33%  (p=0.008 n=5+5)
Hybrid/20%         10.8ms ± 0%   7.9ms ± 0%  -26.48%  (p=0.008 n=5+5)
Hybrid/30%         15.6ms ± 0%  11.5ms ± 0%  -26.43%  (p=0.008 n=5+5)
Hybrid/50%         25.3ms ± 0%  18.6ms ± 0%  -26.49%  (p=0.008 n=5+5)
Float/1K           46.1µs ± 0%  35.6µs ± 0%  -22.75%  (p=0.008 n=5+5)
Float/10K           606µs ± 0%   467µs ± 0%  -22.85%  (p=0.008 n=5+5)
Float/100K         7.51ms ± 0%  5.79ms ± 0%  -22.83%  (p=0.008 n=5+5)
Float/1M           89.5ms ± 0%  69.0ms ± 0%  -22.84%  (p=0.008 n=5+5)
Str/1K              124µs ± 0%   119µs ± 0%   -4.40%  (p=0.008 n=5+5)
Str/10K            1.54ms ± 0%  1.48ms ± 0%   -3.70%  (p=0.008 n=5+5)
Str/100K           20.1ms ± 1%  19.6ms ± 0%   -2.37%  (p=0.008 n=5+5)
Str/1M              290ms ± 1%   286ms ± 1%   -1.10%  (p=0.032 n=5+5)
Struct/1K           100µs ± 0%    85µs ± 0%  -14.55%  (p=0.008 n=5+5)
Struct/10K         1.32ms ± 0%  1.07ms ± 0%  -19.19%  (p=0.008 n=5+5)
Struct/100K        16.4ms ± 0%  12.7ms ± 0%  -22.38%  (p=0.008 n=5+5)
Struct/1M           196ms ± 0%   141ms ± 2%  -28.06%  (p=0.008 n=5+5)
Stable/1K           186µs ± 0%    73µs ± 0%  -60.49%  (p=0.008 n=5+5)
Stable/10K         2.84ms ± 0%  1.22ms ± 0%  -57.16%  (p=0.008 n=5+5)
Stable/100K        40.5ms ± 0%  14.7ms ± 0%  -63.75%  (p=0.008 n=5+5)
Stable/1M           534ms ± 0%   163ms ± 0%  -69.43%  (p=0.016 n=5+4)
Pointer/1K         63.7µs ± 0%  55.3µs ± 0%  -13.18%  (p=0.008 n=5+5)
Pointer/10K         873µs ± 0%   775µs ± 1%  -11.23%  (p=0.008 n=5+5)
Pointer/100K       12.4ms ± 0%  11.3ms ± 0%   -9.58%  (p=0.008 n=5+5)
Pointer/1M          159ms ± 0%   147ms ± 0%   -7.59%  (p=0.008 n=5+5)

By the way, this new implement is also shorter and simpler.

ianlancetaylor commented 1 year ago

CC @eliben

rsc commented 1 year ago

I think we already used this decade's "change to a new sort algorithm" coupon.

eliben commented 1 year ago

cc @zhangyunhao116

PeterRK commented 1 year ago

I think we already used this decade's "change to a new sort algorithm" coupon.

Dual-pivot quicksort is not a new sort algorithm. It can be found in Java standard library many years ago. New algorithm like pdqsort is not always best.

rsc commented 1 year ago

My point is we just put the ecosystem through the churn of a sort change in Go 1.19 (2022). I don't see a strong argument that slices.Sort is somehow fundamentally different from sort.Slice and merits a different algorithm, nor do I see a strong argument for invalidating our decision just last year to switch to pdqsort. We could spend lots of time flip-flopping between different sort algorithms in the standard library, or we could do other work. Having just changed the sort algorithm last year, I think we owe it to ourselves and our users not to change it again for at least five years (that is, Go 1.29 at the earliest) unless we have very good reasons.

We should definitely not make a change for Go 1.21 - it is too late in the cycle.

PeterRK commented 1 year ago

My point is we just put the ecosystem through the churn of a sort change in Go 1.19 (2022). I don't see a strong argument that slices.Sort is somehow fundamentally different from sort.Slice and merits a different algorithm, nor do I see a strong argument for invalidating our decision just last year to switch to pdqsort. We could spend lots of time flip-flopping between different sort algorithms in the standard library, or we could do other work. Having just changed the sort algorithm last year, I think we owe it to ourselves and our users not to change it again for at least five years (that is, Go 1.29 at the earliest) unless we have very good reasons.

We should definitely not make a change for Go 1.21 - it is too late in the cycle.

I agree that it’s not a good idea to make changes frequently. People now can use different sort algorithms with third-party libraries. Just let it fly.

eliben commented 1 year ago

To add to @rsc 's point, we're also putting the community through a "sorting change" right now, in the upcoming Go 1.21, where the slices package with its sorting functionality is entirely new. It provides better performance than the sort package in many cases, so we're going to recommend Go users to shift to it, if possible. This is a fairly major change.

pdqsort was a large performance improvement for some cases. The slices.Sort (vs. sort) is also a significant speedup. Compared to these, the quoted improvements with dual-pivot quicksort are relatively modest (except sorting a slice of ints). So I definitely agree that we should give the community time to settle with the recent changes, and if people require the absolutely fastest sort of an int slice, they can use a 3rd party package.

PeterRK commented 1 year ago

To add to @rsc 's point, we're also putting the community through a "sorting change" right now, in the upcoming Go 1.21, where the slices package with its sorting functionality is entirely new. It provides better performance than the sort package in many cases, so we're going to recommend Go users to shift to it, if possible. This is a fairly major change.

pdqsort was a large performance improvement for some cases. The slices.Sort (vs. sort) is also a significant speedup. Compared to these, the quoted improvements with dual-pivot quicksort are relatively modest (except sorting a slice of ints). So I definitely agree that we should give the community time to settle with the recent changes, and if people require the absolutely fastest sort of an int slice, they can use a 3rd party package.

Patern-defeating techique just makes fastest cases faster. It helps little in real hybrid workload. New apis are needed to achieve significant speedup. But I don't think it's worthy to change standard library apis for speed now.

PeterRK commented 1 year ago

I didn't really care which sort algorithm is used in go standard library until someone claimed pdqsort is fastest algorithm in Go. That's not the truth, but got credit from go tream. @eliben @rsc

zhangyunhao116 commented 1 year ago

@PeterRK

I'm the author of this article, and I don't think I claimed that "PDQSORT is the fastest sorting algorithm in Go".

We can see the summary of this article(translated from Chinese directly):

At present, most of the unstable sorting algorithms used in the industry have basically changed from a single sorting algorithm in textbooks to a hybrid sorting algorithm to deal with various sequences in practical scenarios.

Relying on its performance advantages in common scenarios compared with previous algorithms, pdqsort has gradually become the mainstream implementation of unstable sorting algorithms. Based on the generics brought by Go1.18, the implementation of sorting algorithms is greatly simplified, and it also gives us the possibility to implement new algorithms. But pdqsort is not a panacea. In some cases, other algorithms still maintain advantages (for example, timsort of the Python standard library is better than pdqsort in mixed ascending && descending scenarios). However, in most cases, pdqsort has become a better choice for unstable algorithms depending on its specific optimization for different situations.

The title of this article is "Building the fastest sorting algorithm"(it is ongoing), PDQSORT is just an important milestone on the road to building the fastest sorting algorithm. There are still many other languages or libraries that use this algorithm, and I still haven't found any algorithm that is faster than PDQSORT in all cases for now. PDQSORT is still one of the fastest hybrid sorting algorithms in some cases.

In addition, I am not the author of the PDQSORT algorithm itself. This implementation is first used internally at my work and then contributed to the community.

This article is more to let everyone understand the principles of the algorithm (in fact, this article is part of a course for college students within the company), and to attract more people to participate in community building, if you have a better idea for sorting algorithms, feel free.

PeterRK commented 1 year ago

Pattern-defeating technique can be found in timsort. Branch-eliminating technique comes from blockquicksort. Cache-exploiting technique comes from dual-pivot quicksort. Pdqsort trys to combine those techniques, but fail to handle the conflict between branch-eliminating technique and cache-exploiting technique. The author of pdqsort explains in paper why he choose branch-eliminating technique after trade-off. Cache-exploiting technique should be used in where branch-eliminating technique doesn’t work. Implement in go standard library distorts the idea of pdqsort and exaggerate its power.

Most worryingly, go team seems to support this mistake.

zhangyunhao116 commented 1 year ago

In the process of implementation, we often need some engineering compromises, which means that the optimization of the paper wouldn't always work. For example, the implementation in the standard library is about 10%~ 20% slower than the original implementation I used. This is because it needs to support both interface-sorting and integer-sorting, and these algorithms need to be generated using the same file.

If the algorithm only needs to support integer sorting or doesn't need to be generated via the same file, many other optimizations can be used, but it doesn't mean these engineering compromises are a mistake.

golang / go

slices: use dual-pivot quicksort instead of pdqsort #61027