Median candidate analysis

mlochbaum commented 1 year ago

Have had this half-finished for a while; I'm writing up what I can even though there's more to investigate. CC @scandum, @Voultapher as I believe ipn's picked up glidesort's candidate approach.

I looked into how the choice of pivot candidates affects median accuracy using a simulation on random piecewise-linear inputs. These seem like they capture one type of order that might be expected in an input, and I don't have any other promising ideas for testing that.

I found that the accuracy depends not only on the way candidates are grouped into sets of 3 (what I initially wanted to test) but also the positions of those candidates (unexpected, but obvious in retrospect). So I measured both for an even distribution, and Glidesort's recursive selection, using midpoints of the nine intervals that come out of the recursion. I copied the arithmetic for this by hand; it looks right but it's possible I made a mistake here. I've also tested various positions not shown here. Error is measured as cost, assuming sorting each of the two partitions takes time proportional to n log(n). For context, the baseline cost is 8965784, so that a relatively high cost difference of 30000 adds 0.33% to total sorting cost due to that step (it'll compound at every partition). The tables below show cost for the best and worst groupings after testing every possibility, as well as some others:

000111222 is the current glidesort scheme, which I find is fairly poor regardless of positions
012012012 transposes it, and seems generally good
012102120 stands out as consistently very good

The arrangement 011201220 that I've used and recommended previously does badly, often worse than 000111222.

Beyond any particular choice of grouping it doesn't seem like Glidesort's positions do well: the "true of 9" row is the true median of those and is worse than most pseudomedians with even spacing. The middle 4 makes Glidesort's positions skewed and prevents perfect performance on 1-piece (totally linear) inputs, but changing it to 3.5 isn't much of an improvement (in fact the true median is worse for 2-piece inputs). So I think it's mainly the clustering that weakens it, although I can't say exactly why the effect is so strong. Something that seems a little better is to use 3 as the midpoint for one or two intervals and 4 for the others.

I also did some theoretical analysis on the median-of-median-of-... idea. I find that 3^k candidates processed with recursive medians have about as much power as a true median of (π/2)*2.25^k pivots. For example the pseudomedian of 243 for a 32768-element list is worth a true median of 91. The probability distributions for the median value found this way with random candidates are equal at the exact midpoint, and the shapes seem similar (if anything the pseudomedian is wider). I can write the math up if you're interested.

Even spacing (0.5/9, ..., 8.5/9):

Partition	Diagram	1	2	3	7
True of 3	~147	0	13413	25575	77986
True of 9	012345678	0	1557	2940	8960
012102120	048/136/257	0	1557	3174	16728
012012012	036/147/258	0	1557	4276	18077
000111222	012/345/678	0	13413	20636	28802
001112022	016/234/578	35848	31942	29785	30945

Glidesort arrangement (approx 0.008, 0.07, 0.117, 0.508, 0.57, 0.617, 0.883, 0.945, 0.992):

Partition	Diagram	1	2	3	7
True of 3	~147	0	13413	25575	77986
True of 9	012345678	14184	69439	77943	74393
001122102	017/236/458	184	26029	43807	74776
012102120	048/136/257	14184	69439	79653	81007
012012012	036/147/258	14184	69439	84625	85439
000111222	012/345/678	14184	79120	97432	98255
011120220	058/123/467	39866	141889	139266	99083

Source for this, run with CBQN. I can translate to some other language on request. The inputs are created with •rand which changes between runs, but the results above don't change significantly.

# All partitions of 9 candidates into 3+3+3
part ← (⊐⊸≡∧(3⥊3)≡/⁼)¨⊸/⥊↕3⌊1+↕9
pfmt ← > ('0'⊸+ ⋈ ·∾⟜"/"⊸∾´ '0'+⊔)¨ part  # Display
# Hard-coded partitions of interest
pint ← "000111222"‿"012012012"‿"012102120" #‿"011201220"

# Make random piecewise-linear functions
GetFn ← {
  R ← •rand.Range
  p←0∾1∾˜∧(𝕩-2)R 0 ⋄ q←𝕩 R 0  # x endpoints, y endpoints
  m←q÷○(«⊸-)p ⋄ y←q-m×p       # Slope, y intercept
  {(⊏⟜y+𝕩×⊏⟜m)(1↓p)⍋𝕩}
}
dist ← GetFn¨ 1e3/≍2‿3‿4‿8    # 1e3 sets with each of 2, 3, 4, 8 vertices
Sample ← dist {𝕎𝕩}¨ <         # Sample values given positions

# Candidate sampling
Mid ← {(0.5+↕𝕩)÷𝕩}                  # Midpoints of equal-sized intervals, 𝕩 total
glide ← {t←0‿4‿7÷8 ⋄ ⥊t+⌜(t+÷16)÷8} # Midpoints of glidesort candidate intervals

# Scoring: cost increase relative to true median on 1e3 elements
list ← Sample Mid l←1e3
ScoreAll ← {+˝˘ (2×{𝕩×2⋆⁼𝕩}l÷2) -˜ +○{𝕩×2⋆⁼𝕩}⟜(l⊸-) list +´∘≤¨⎉∞‿¯1 𝕩}
MakeTable ← {
  cand ← Sample 𝕩
  med ← part (1⊑∧){𝔽𝔽¨}∘⊔⌜ cand                # Pseudomedians
  score ← +˝˘ scoremat ← ScoreAll med
  t39 ← > (⌊2÷˜3‿9) ⊑⟜∧¨¨ ⟨Sample Mid 3, cand⟩ # True medians
  ∾⟨
    ["True of 3"‿"147","True of 9"‿('0'+↕9)] ∾˘ ⌊ ScoreAll t39
    ((⌽⊸∨0=↕∘≠)∨pint∊˜⊏˘)⊸/ score ⍋⊸⊏ (⌊scoremat) ∾˘˜ pfmt
  ⟩
}
•Show∘MakeTable¨ ⟨Mid 9, glide⟩

orlp commented 1 year ago

One thing I'd like to note that the 'piecewise-linear' case in glidesort is taken care of by run analysis and merging, and thus is not a deciding factor for the quicksort pivot selection. Instead I wanted to minimize the number of cache lines that the recursive method touches. That's why I don't do even spacing.

It is an interesting idea to change up the grouping though. I would venture to guess that 012 120 201 would perform the best here, which disperses one lo/mid/hi position across each group.

mlochbaum commented 1 year ago

The script tests all combinations, so if 012 120 201 was the absolute best, it'd be in the table. I see that it's just a touch better than 000 111 222, equal for 1- and 2-piece inputs and better on 3- and 7-piece ones with either set of positions.

A cache line is 64 bytes, so 16 elements in the 4-byte case, and the smallest gap is between 1 and 8 elements, so groups of 3 will usually fit in one line and adjacent groups might share one? That's better than I realized; I hadn't worked through all the * 8s and / 8s to figure out where it stopped. Still, once you get above that size the only reason to stick with the clumping is code simplicity, or is there something else? Note that a median of 3 with even spacing tests better than any of the clumpy 3-of-3 medians on ≤3-piece inputs, so more samples doesn't always help even if you can get them from the same cache line.

Yes, natural run detection catches exact piecewise-linear inputs (if the number of pieces is small enough, as it is here). I expect it's common to have inputs that follow the basic idea of "a value is close to neighboring values" without large natural runs: piecewise-linear with a little noise or a few swaps, or small ordered sections, say.

orlp commented 1 year ago

Still, once you get above that size the only reason to stick with the clumping is code simplicity, or is there something else?

Code simplicity translates to code size, and I'd like to minimize that where possible. There's definitely experiments to be had though.

mlochbaum commented 1 year ago

It occurred to me to try a narrower distribution, so I changed the 0, 4, 7 offsets to 1, 4, 6. It's a major improvement! Overall this distribution is much closer to even, so it makes sense that the performance would be more similar. I'm not sure whether this is applicable to the recursive case, but a reason to think so is that the median of 3 seems to do best with an even or very slightly narrower distribution. A way to look at this is that the ends of the list are likely to be atypical, and for a median we don't want a representative picture of the whole list, just a best guess at what's typical. Which seems like a sound principle even outside the piecewise-linear framework.

Even distribution:

Partition	Diagram	1	2	3	7
True of 3	~147	0	14681	26447	77240
True of 9	012345678	0	1655	2918	8910
012102120	048/136/257	0	1655	3070	15718
012012012	036/147/258	0	1655	4274	16950
000111222	012/345/678	0	14681	19770	29456
011012220	038/124/567	35848	32760	30405	29658

Glidesort based on 1, 4, 6 (approx 0.148, 0.195, 0.227, 0.523, 0.57, 0.602, 0.773, 0.82, 0.852):

Partition	Diagram	1	2	3	7
True of 3	~147	0	14681	26447	77240
True of 9	012345678	14184	22257	27761	49349
001120221	015/238/467	1526	10364	18405	52902
012102120	048/136/257	14184	22257	27827	52224
012012012	036/147/258	14184	22257	28161	57044
000111222	012/345/678	14184	25705	32574	64075
000122112	012/367/458	30231	41024	49248	65698

Glidesort based on 1, 3.5, 6 (approx 0.148, 0.188, 0.227, 0.461, 0.5, 0.539, 0.773, 0.813, 0.852):

Partition	Diagram	1	2	3	7
True of 3	~147	0	14681	26447	77240
True of 9	012345678	0	11615	18112	44243
012102120	048/136/257	0	11615	18231	47409
012012012	036/147/258	0	11615	18504	51787
000111222	012/345/678	0	15424	23531	59706
000122112	012/367/458	4393	18546	25307	58149

With this last balanced distribution, which would change n8 * 4 to (n8 * 7) / 2, the difference between 000111222 and other groupings (or true median for that matter) isn't that large. So this is something that can be an improvement without changing the order and potentially losing cache lines. And the tighter grouping does make cache line reuse slightly more likely.

mlochbaum commented 1 year ago

Outer offsets of 1, 3.5, 6 combined with inner offsets of 0, 3.5, 7 do better still, so the better thing to do, going by this model, is try to evenly space the initial selection of three regions instead of putting the outer two at the edges. This kind of points to the fact that even spacing has a scaling factor of 3 instead of 8, so relative to scaling by 8 it keeps getting wider at each step.

Regarding the scheme as a whole, here's the dilemma I see. It does take position into account, as otherwise you'd just pick a bunch of elements from the start of the array and be done with it. The reason for this has to be concern about spatial locality, right? But both empirically and intuitively, I'm finding that the fractal recursion doesn't do much better than a median of 3 at handling locality (it will filter noise fine of course). Tight median-of-3 at the lowest level or two to compute faster and save cache makes sense to me; for higher levels my still-tentative thinking is that it's best to go with more even spacing, and reordering or larger medians if possible.

(I'll stop bothering you. Felt if I'm going to do this study I may as well share it...)

orlp commented 1 year ago

Regarding the scheme as a whole, here's the dilemma I see. It does take position into account, as otherwise you'd just pick a bunch of elements from the start of the array and be done with it. The reason for this has to be concern about spatial locality, right?

Actually my main justification for spreading out our pivot sample over the input array is distribution drift. The elements at the start of the array might not follow the same distribution as those somewhere else. This is rare in benchmarks but not necessarily so in real world data.

orlp commented 1 year ago

I fat fingered the close button, oops.

Voultapher commented 1 year ago

Talking about perf, I'll add that in my testing just median of 3 for small sizes < threshold e.g. 64 is not ideal for sorting integers. Getting a higher quality median pays of with a total runtime improvement of a couple percent. I recently simplified pivot selection for ipnsort https://github.com/Voultapher/sort-research-rs/commit/4c5a8925d29e232836dae3bf3340960f09a2bccc but I'm still not happy with it.

mlochbaum commented 1 year ago

Random data I assume? I've been switching to pisort (simple merge, half of piposort more or less) at size 192 and below. Uses O(n) memory though.

Voultapher commented 1 year ago

Random data I assume?

Yes, as generated by the rand crate. But it's faster for other patterns too. My theory is that it allows it to avoid more ineffective partitions for small sizes, and more quickly switches to the specialized small-sort.

orlp / glidesort

Median candidate analysis #10