orlp / glidesort

A Rust implementation of Glidesort, my stable adaptive quicksort/mergesort hybrid sorting algorithm.
1.57k stars 23 forks source link

Median candidate analysis #10

Open mlochbaum opened 1 year ago

mlochbaum commented 1 year ago

Have had this half-finished for a while; I'm writing up what I can even though there's more to investigate. CC @scandum, @Voultapher as I believe ipn's picked up glidesort's candidate approach.

I looked into how the choice of pivot candidates affects median accuracy using a simulation on random piecewise-linear inputs. These seem like they capture one type of order that might be expected in an input, and I don't have any other promising ideas for testing that.

I found that the accuracy depends not only on the way candidates are grouped into sets of 3 (what I initially wanted to test) but also the positions of those candidates (unexpected, but obvious in retrospect). So I measured both for an even distribution, and Glidesort's recursive selection, using midpoints of the nine intervals that come out of the recursion. I copied the arithmetic for this by hand; it looks right but it's possible I made a mistake here. I've also tested various positions not shown here. Error is measured as cost, assuming sorting each of the two partitions takes time proportional to n log(n). For context, the baseline cost is 8965784, so that a relatively high cost difference of 30000 adds 0.33% to total sorting cost due to that step (it'll compound at every partition). The tables below show cost for the best and worst groupings after testing every possibility, as well as some others:

The arrangement 011201220 that I've used and recommended previously does badly, often worse than 000111222.

Beyond any particular choice of grouping it doesn't seem like Glidesort's positions do well: the "true of 9" row is the true median of those and is worse than most pseudomedians with even spacing. The middle 4 makes Glidesort's positions skewed and prevents perfect performance on 1-piece (totally linear) inputs, but changing it to 3.5 isn't much of an improvement (in fact the true median is worse for 2-piece inputs). So I think it's mainly the clustering that weakens it, although I can't say exactly why the effect is so strong. Something that seems a little better is to use 3 as the midpoint for one or two intervals and 4 for the others.

I also did some theoretical analysis on the median-of-median-of-... idea. I find that 3^k candidates processed with recursive medians have about as much power as a true median of (π/2)*2.25^k pivots. For example the pseudomedian of 243 for a 32768-element list is worth a true median of 91. The probability distributions for the median value found this way with random candidates are equal at the exact midpoint, and the shapes seem similar (if anything the pseudomedian is wider). I can write the math up if you're interested.

Even spacing (0.5/9, ..., 8.5/9):

Partition Diagram 1 2 3 7
True of 3 ~147 0 13413 25575 77986
True of 9 012345678 0 1557 2940 8960
012102120 048/136/257 0 1557 3174 16728
012012012 036/147/258 0 1557 4276 18077
000111222 012/345/678 0 13413 20636 28802
001112022 016/234/578 35848 31942 29785 30945

Glidesort arrangement (approx 0.008, 0.07, 0.117, 0.508, 0.57, 0.617, 0.883, 0.945, 0.992):

Partition Diagram 1 2 3 7
True of 3 ~147 0 13413 25575 77986
True of 9 012345678 14184 69439 77943 74393
001122102 017/236/458 184 26029 43807 74776
012102120 048/136/257 14184 69439 79653 81007
012012012 036/147/258 14184 69439 84625 85439
000111222 012/345/678 14184 79120 97432 98255
011120220 058/123/467 39866 141889 139266 99083

Source for this, run with CBQN. I can translate to some other language on request. The inputs are created with •rand which changes between runs, but the results above don't change significantly.

# All partitions of 9 candidates into 3+3+3
part ← (⊐⊸≡∧(3⥊3)≡/⁼)¨⊸/⥊↕3⌊1+↕9
pfmt ← > ('0'⊸+ ⋈ ·∾⟜"/"⊸∾´ '0'+⊔)¨ part  # Display
# Hard-coded partitions of interest
pint ← "000111222"‿"012012012"‿"012102120" #‿"011201220"

# Make random piecewise-linear functions
GetFn ← {
  R ← •rand.Range
  p←0∾1∾˜∧(𝕩-2)R 0 ⋄ q←𝕩 R 0  # x endpoints, y endpoints
  m←q÷○(«⊸-)p ⋄ y←q-m×p       # Slope, y intercept
  {(⊏⟜y+𝕩×⊏⟜m)(1↓p)⍋𝕩}
}
dist ← GetFn¨ 1e3/≍2‿3‿4‿8    # 1e3 sets with each of 2, 3, 4, 8 vertices
Sample ← dist {𝕎𝕩}¨ <         # Sample values given positions

# Candidate sampling
Mid ← {(0.5+↕𝕩)÷𝕩}                  # Midpoints of equal-sized intervals, 𝕩 total
glide ← {t←0‿4‿7÷8 ⋄ ⥊t+⌜(t+÷16)÷8} # Midpoints of glidesort candidate intervals

# Scoring: cost increase relative to true median on 1e3 elements
list ← Sample Mid l←1e3
ScoreAll ← {+˝˘ (2×{𝕩×2⋆⁼𝕩}l÷2) -˜ +○{𝕩×2⋆⁼𝕩}⟜(l⊸-) list +´∘≤¨⎉∞‿¯1 𝕩}
MakeTable ← {
  cand ← Sample 𝕩
  med ← part (1⊑∧){𝔽𝔽¨}∘⊔⌜ cand                # Pseudomedians
  score ← +˝˘ scoremat ← ScoreAll med
  t39 ← > (⌊2÷˜3‿9) ⊑⟜∧¨¨ ⟨Sample Mid 3, cand⟩ # True medians
  ∾⟨
    ["True of 3"‿"147","True of 9"‿('0'+↕9)] ∾˘ ⌊ ScoreAll t39
    ((⌽⊸∨0=↕∘≠)∨pint∊˜⊏˘)⊸/ score ⍋⊸⊏ (⌊scoremat) ∾˘˜ pfmt
  ⟩
}
•Show∘MakeTable¨ ⟨Mid 9, glide⟩
orlp commented 1 year ago

One thing I'd like to note that the 'piecewise-linear' case in glidesort is taken care of by run analysis and merging, and thus is not a deciding factor for the quicksort pivot selection. Instead I wanted to minimize the number of cache lines that the recursive method touches. That's why I don't do even spacing.

It is an interesting idea to change up the grouping though. I would venture to guess that 012 120 201 would perform the best here, which disperses one lo/mid/hi position across each group.

mlochbaum commented 1 year ago

The script tests all combinations, so if 012 120 201 was the absolute best, it'd be in the table. I see that it's just a touch better than 000 111 222, equal for 1- and 2-piece inputs and better on 3- and 7-piece ones with either set of positions.

A cache line is 64 bytes, so 16 elements in the 4-byte case, and the smallest gap is between 1 and 8 elements, so groups of 3 will usually fit in one line and adjacent groups might share one? That's better than I realized; I hadn't worked through all the * 8s and / 8s to figure out where it stopped. Still, once you get above that size the only reason to stick with the clumping is code simplicity, or is there something else? Note that a median of 3 with even spacing tests better than any of the clumpy 3-of-3 medians on ≤3-piece inputs, so more samples doesn't always help even if you can get them from the same cache line.

Yes, natural run detection catches exact piecewise-linear inputs (if the number of pieces is small enough, as it is here). I expect it's common to have inputs that follow the basic idea of "a value is close to neighboring values" without large natural runs: piecewise-linear with a little noise or a few swaps, or small ordered sections, say.

orlp commented 1 year ago

Still, once you get above that size the only reason to stick with the clumping is code simplicity, or is there something else?

Code simplicity translates to code size, and I'd like to minimize that where possible. There's definitely experiments to be had though.

mlochbaum commented 1 year ago

It occurred to me to try a narrower distribution, so I changed the 0, 4, 7 offsets to 1, 4, 6. It's a major improvement! Overall this distribution is much closer to even, so it makes sense that the performance would be more similar. I'm not sure whether this is applicable to the recursive case, but a reason to think so is that the median of 3 seems to do best with an even or very slightly narrower distribution. A way to look at this is that the ends of the list are likely to be atypical, and for a median we don't want a representative picture of the whole list, just a best guess at what's typical. Which seems like a sound principle even outside the piecewise-linear framework.

Even distribution:

Partition Diagram 1 2 3 7
True of 3 ~147 0 14681 26447 77240
True of 9 012345678 0 1655 2918 8910
012102120 048/136/257 0 1655 3070 15718
012012012 036/147/258 0 1655 4274 16950
000111222 012/345/678 0 14681 19770 29456
011012220 038/124/567 35848 32760 30405 29658

Glidesort based on 1, 4, 6 (approx 0.148, 0.195, 0.227, 0.523, 0.57, 0.602, 0.773, 0.82, 0.852):

Partition Diagram 1 2 3 7
True of 3 ~147 0 14681 26447 77240
True of 9 012345678 14184 22257 27761 49349
001120221 015/238/467 1526 10364 18405 52902
012102120 048/136/257 14184 22257 27827 52224
012012012 036/147/258 14184 22257 28161 57044
000111222 012/345/678 14184 25705 32574 64075
000122112 012/367/458 30231 41024 49248 65698

Glidesort based on 1, 3.5, 6 (approx 0.148, 0.188, 0.227, 0.461, 0.5, 0.539, 0.773, 0.813, 0.852):

Partition Diagram 1 2 3 7
True of 3 ~147 0 14681 26447 77240
True of 9 012345678 0 11615 18112 44243
012102120 048/136/257 0 11615 18231 47409
012012012 036/147/258 0 11615 18504 51787
000111222 012/345/678 0 15424 23531 59706
000122112 012/367/458 4393 18546 25307 58149

With this last balanced distribution, which would change n8 * 4 to (n8 * 7) / 2, the difference between 000111222 and other groupings (or true median for that matter) isn't that large. So this is something that can be an improvement without changing the order and potentially losing cache lines. And the tighter grouping does make cache line reuse slightly more likely.

mlochbaum commented 1 year ago

Outer offsets of 1, 3.5, 6 combined with inner offsets of 0, 3.5, 7 do better still, so the better thing to do, going by this model, is try to evenly space the initial selection of three regions instead of putting the outer two at the edges. This kind of points to the fact that even spacing has a scaling factor of 3 instead of 8, so relative to scaling by 8 it keeps getting wider at each step.

Regarding the scheme as a whole, here's the dilemma I see. It does take position into account, as otherwise you'd just pick a bunch of elements from the start of the array and be done with it. The reason for this has to be concern about spatial locality, right? But both empirically and intuitively, I'm finding that the fractal recursion doesn't do much better than a median of 3 at handling locality (it will filter noise fine of course). Tight median-of-3 at the lowest level or two to compute faster and save cache makes sense to me; for higher levels my still-tentative thinking is that it's best to go with more even spacing, and reordering or larger medians if possible.

(I'll stop bothering you. Felt if I'm going to do this study I may as well share it...)

orlp commented 1 year ago

Regarding the scheme as a whole, here's the dilemma I see. It does take position into account, as otherwise you'd just pick a bunch of elements from the start of the array and be done with it. The reason for this has to be concern about spatial locality, right?

Actually my main justification for spreading out our pivot sample over the input array is distribution drift. The elements at the start of the array might not follow the same distribution as those somewhere else. This is rare in benchmarks but not necessarily so in real world data.

orlp commented 1 year ago

I fat fingered the close button, oops.

Voultapher commented 1 year ago

Talking about perf, I'll add that in my testing just median of 3 for small sizes < threshold e.g. 64 is not ideal for sorting integers. Getting a higher quality median pays of with a total runtime improvement of a couple percent. I recently simplified pivot selection for ipnsort https://github.com/Voultapher/sort-research-rs/commit/4c5a8925d29e232836dae3bf3340960f09a2bccc but I'm still not happy with it.

mlochbaum commented 1 year ago

Random data I assume? I've been switching to pisort (simple merge, half of piposort more or less) at size 192 and below. Uses O(n) memory though.

Voultapher commented 1 year ago

Random data I assume?

Yes, as generated by the rand crate. But it's faster for other patterns too. My theory is that it allows it to avoid more ineffective partitions for small sizes, and more quickly switches to the specialized small-sort.