haskell / containers

Assorted concrete container types
https://hackage.haskell.org/package/containers
316 stars 178 forks source link

Speed up unstable sorting #423

Closed treeowl closed 6 years ago

treeowl commented 7 years ago

I recently figured out how to write a (reasonably) efficient unstable sort for arbitrary Traversable containers. Surprisingly, that seems to be competitive with Data.Sequence.unstableSort, at least for short to medium-length sequences. I think this means we have some room to improve in Data.Sequence. A few possibilities:

  1. Check out what if anything should be made stricter (or lazier).

  2. Use a different heap representation.

  3. Reconsider how we rebuild a sequence from a heap. The fact that we first "unroll" the heap to a list looks suspicious to me.

oisdk commented 7 years ago

I've been playing around with this for a while, and I'm getting some decent speedups with a couple of changes.

Specialise

The main thing is that adding a specialise pragma to replicateA seems to give a 15-30% speedup.

Removing intermediate list

The list is only being constructed in order to be deconstructed in the fromList2 function, which only uses unconsing, so it seems like you should be able to use a minView function directly:

pqState = State minView

Or something. However, the heap type can't represent an empty heap:

data PQueue e = PQueue e (PQL e)
data PQL e = Nil | {-# UNPACK #-} !(PQueue e) :& PQL e

Which means that if you rewrite the unroll function into the direct-state style, it turns out to be too strict. If you replace the implementation of <*> with a handwritten definition (instead of ap), it's not too strict:

fs <*> xs =
  State $ \s -> case runState fs s of
    (s',f) -> case runState xs s' of
      (s'',x) -> (s'', f x)

But I don't know if that effects other code negatively. That lets you remove the intermediate list:

pqState cmp = State unrollPQ' where
  {-# INLINE unrollPQ' #-}
  unrollPQ' (PQueue x ts) = (mergePQs ts,x)
  mergePQs (t :& Nil) = t
  mergePQs (t1 :& t2 :& Nil) = t1 <+> t2
  mergePQs (t1 :& t2 :& ts) = (t1 <+> t2) <+> mergePQs ts
  mergePQs Nil = undefined
  (<+>) = mergePQ cmp

This speedup is less significant, (possible slower on small inputs) but on larger inputs it looks like it's significant.

A Different Heap

In my own testing, skew heaps have been quicker than pairing for the sort-traversable functions, but I haven't run any tests with sequence yet.

Benchmarks

All of these benchmarks are on a list of random ints. I haven't looked at nearly-sorted lists, etc.

Old vs. All optimisations

benchmarking 500/old
time                 130.9 μs   (128.4 μs .. 132.8 μs)
                     0.997 R²   (0.996 R² .. 0.999 R²)
mean                 133.0 μs   (131.2 μs .. 134.9 μs)
std dev              6.509 μs   (5.411 μs .. 8.084 μs)
variance introduced by outliers: 50% (moderately inflated)

benchmarking 500/new
time                 100.5 μs   (99.58 μs .. 101.6 μs)
                     0.998 R²   (0.997 R² .. 0.999 R²)
mean                 101.3 μs   (99.82 μs .. 103.0 μs)
std dev              5.435 μs   (4.131 μs .. 7.558 μs)
variance introduced by outliers: 56% (severely inflated)

benchmarking 50000/old
time                 60.96 ms   (59.87 ms .. 63.09 ms)
                     0.998 R²   (0.996 R² .. 1.000 R²)
mean                 60.23 ms   (58.99 ms .. 60.96 ms)
std dev              1.692 ms   (543.6 μs .. 3.016 ms)

benchmarking 50000/new
time                 38.52 ms   (37.78 ms .. 39.57 ms)
                     0.998 R²   (0.994 R² .. 0.999 R²)
mean                 38.75 ms   (38.13 ms .. 39.45 ms)
std dev              1.286 ms   (913.8 μs .. 2.232 ms)

benchmarking 500000/old
time                 1.349 s    (1.344 s .. 1.354 s)
                     1.000 R²   (1.000 R² .. 1.000 R²)
mean                 1.350 s    (1.349 s .. 1.350 s)
std dev              933.7 μs   (0.0 s .. 1.010 ms)
variance introduced by outliers: 19% (moderately inflated)

benchmarking 500000/new
time                 973.4 ms   (931.8 ms .. 1.001 s)
                     1.000 R²   (1.000 R² .. 1.000 R²)
mean                 969.9 ms   (962.6 ms .. 974.9 ms)
std dev              7.470 ms   (0.0 s .. 8.622 ms)
variance introduced by outliers: 19% (moderately inflated)

benchmarking 1000000/old
time                 3.479 s    (2.747 s .. 4.043 s)
                     0.993 R²   (0.991 R² .. 1.000 R²)
mean                 3.095 s    (2.849 s .. 3.258 s)
std dev              244.0 ms   (0.0 s .. 281.7 ms)
variance introduced by outliers: 21% (moderately inflated)

benchmarking 1000000/new
time                 2.357 s    (2.263 s .. 2.497 s)
                     1.000 R²   (0.999 R² .. 1.000 R²)
mean                 2.378 s    (2.358 s .. 2.396 s)
std dev              30.39 ms   (0.0 s .. 31.78 ms)
variance introduced by outliers: 19% (moderately inflated)

Old with SPECIALIZE pragma vs. All optimisations:

benchmarking 500/old
time                 85.07 μs   (83.52 μs .. 86.83 μs)
                     0.997 R²   (0.996 R² .. 0.998 R²)
mean                 85.99 μs   (84.74 μs .. 87.36 μs)
std dev              4.242 μs   (3.673 μs .. 5.049 μs)
variance introduced by outliers: 52% (severely inflated)

benchmarking 500/new
time                 96.67 μs   (95.19 μs .. 98.16 μs)
                     0.998 R²   (0.997 R² .. 0.999 R²)
mean                 96.64 μs   (95.28 μs .. 98.07 μs)
std dev              4.678 μs   (3.986 μs .. 5.547 μs)
variance introduced by outliers: 51% (severely inflated)

benchmarking 50000/old
time                 42.00 ms   (41.46 ms .. 42.74 ms)
                     0.999 R²   (0.997 R² .. 1.000 R²)
mean                 42.57 ms   (42.17 ms .. 42.94 ms)
std dev              756.9 μs   (528.2 μs .. 1.094 ms)

benchmarking 50000/new
time                 34.79 ms   (34.25 ms .. 35.23 ms)
                     0.999 R²   (0.997 R² .. 1.000 R²)
mean                 36.27 ms   (35.77 ms .. 36.80 ms)
std dev              1.081 ms   (758.3 μs .. 1.511 ms)

benchmarking 500000/old
time                 1.126 s    (1.106 s .. 1.142 s)
                     1.000 R²   (1.000 R² .. 1.000 R²)
mean                 1.111 s    (1.108 s .. 1.116 s)
std dev              3.888 ms   (0.0 s .. 4.194 ms)
variance introduced by outliers: 19% (moderately inflated)

benchmarking 500000/new
time                 951.6 ms   (921.9 ms .. NaN s)
                     1.000 R²   (1.000 R² .. 1.000 R²)
mean                 950.8 ms   (944.3 ms .. 954.4 ms)
std dev              5.733 ms   (136.0 as .. 6.142 ms)
variance introduced by outliers: 19% (moderately inflated)

benchmarking 1000000/old
time                 2.722 s    (2.636 s .. 2.789 s)
                     1.000 R²   (1.000 R² .. 1.000 R²)
mean                 2.733 s    (2.713 s .. 2.745 s)
std dev              19.28 ms   (0.0 s .. 22.26 ms)
variance introduced by outliers: 19% (moderately inflated)

benchmarking 1000000/new
time                 2.485 s    (1.926 s .. 2.740 s)
                     0.994 R²   (0.986 R² .. 1.000 R²)
mean                 2.515 s    (2.425 s .. 2.571 s)
std dev              84.18 ms   (0.0 s .. 96.03 ms)
variance introduced by outliers: 19% (moderately inflated)

Both of these versions coexist on my fork.

treeowl commented 7 years ago

It appears that the containers benchmark suite doesn't exercise either sorting function! When you find the implementation you prefer, do you think you could submit a PR making that change and also adding a couple benchmarks? Thanks a lot for your work so far.

P.S., I'm looking forward to your attempts at fast total Traversable sorting without compiler plugins or rewrite rules.... Alternatively, I can try it myself, but I can't seem to figure out how to prove associativity for zeroless binary addition without a jillion helper functions and a gazillion lines of code. That's probably just because I don't have enough experience with such things, though.

treeowl commented 7 years ago

Also, I don't understand why that definition of <*> would be lazier. What am I missing?

oisdk commented 7 years ago

Huh - looks like I was wrong about <*>. I was hitting bottoms when I used ap, but they went away when I replaced it with a manual definition of <*> - it must have been something else, though, since I can't replicate it now.

I'll include the benchmarks in the pull request.

(still hacking away at the binomial heap - I'm going to try using some of the tricks from Stephanie Weirich's talk at POPL to build up the proofs of each property in the heap GADT, see if that works. It's hard to go back to handwritten proofs after seeing how easy the plugins make it, though!)

m-renaud commented 6 years ago

Since https://github.com/haskell/containers/pull/425 was merged can we close this issue? Or do you have more performance improvements lined up? :)

treeowl commented 6 years ago

@m-renaud, I think the "use a different heap" option @oisdk mentioned is something we should consider before closing this.

oisdk commented 6 years ago

After testing this with some other heaps, I can't find anything that's better than the pairing heap here. The closest was a skew heap:

benchmarking 1000000/unsorted/pairing
time                 2.005 s    (NaN s .. 2.102 s)
                     1.000 R²   (0.998 R² .. 1.000 R²)
mean                 2.069 s    (2.060 s .. 2.075 s)
std dev              9.340 ms   (0.0 s .. 10.67 ms)
variance introduced by outliers: 19% (moderately inflated)

benchmarking 1000000/unsorted/skew
time                 2.042 s    (1.637 s .. 2.267 s)
                     0.995 R²   (0.990 R² .. NaN R²)
mean                 2.165 s    (2.065 s .. 2.217 s)
std dev              87.10 ms   (0.0 s .. 91.26 ms)
variance introduced by outliers: 19% (moderately inflated)

benchmarking 1000000/ascending/pairing
time                 191.4 ms   (187.8 ms .. 193.5 ms)
                     1.000 R²   (0.999 R² .. 1.000 R²)
mean                 197.0 ms   (194.7 ms .. 200.0 ms)
std dev              3.221 ms   (2.441 ms .. 3.924 ms)
variance introduced by outliers: 14% (moderately inflated)

benchmarking 1000000/ascending/skew
time                 232.3 ms   (227.0 ms .. 238.9 ms)
                     0.999 R²   (0.997 R² .. 1.000 R²)
mean                 233.9 ms   (230.6 ms .. 236.2 ms)
std dev              3.678 ms   (2.790 ms .. 4.777 ms)
variance introduced by outliers: 14% (moderately inflated)

benchmarking 1000000/descending/pairing
time                 204.6 ms   (190.2 ms .. 214.1 ms)
                     0.998 R²   (0.991 R² .. 1.000 R²)
mean                 208.4 ms   (204.1 ms .. 210.6 ms)
std dev              4.051 ms   (1.299 ms .. 5.288 ms)
variance introduced by outliers: 14% (moderately inflated)

benchmarking 1000000/descending/skew
time                 229.9 ms   (212.7 ms .. 240.1 ms)
                     0.998 R²   (0.996 R² .. 1.000 R²)
mean                 238.8 ms   (231.3 ms .. 241.4 ms)
std dev              5.006 ms   (269.0 μs .. 6.151 ms)
variance introduced by outliers: 16% (moderately inflated)
treeowl commented 6 years ago

Thanks a lot. Could you possibly add some notes on that to the source, perhaps in a separate file?

On Jan 11, 2018 11:01 AM, "Donnacha Oisín Kidney" notifications@github.com wrote:

After testing this with some other heaps, I can't find anything that's better than the pairing heap here. The closest was a skew heap:

benchmarking 1000000/unsorted/pairing time 2.005 s (NaN s .. 2.102 s) 1.000 R² (0.998 R² .. 1.000 R²) mean 2.069 s (2.060 s .. 2.075 s) std dev 9.340 ms (0.0 s .. 10.67 ms) variance introduced by outliers: 19% (moderately inflated)

benchmarking 1000000/unsorted/skew time 2.042 s (1.637 s .. 2.267 s) 0.995 R² (0.990 R² .. NaN R²) mean 2.165 s (2.065 s .. 2.217 s) std dev 87.10 ms (0.0 s .. 91.26 ms) variance introduced by outliers: 19% (moderately inflated)

benchmarking 1000000/ascending/pairing time 191.4 ms (187.8 ms .. 193.5 ms) 1.000 R² (0.999 R² .. 1.000 R²) mean 197.0 ms (194.7 ms .. 200.0 ms) std dev 3.221 ms (2.441 ms .. 3.924 ms) variance introduced by outliers: 14% (moderately inflated)

benchmarking 1000000/ascending/skew time 232.3 ms (227.0 ms .. 238.9 ms) 0.999 R² (0.997 R² .. 1.000 R²) mean 233.9 ms (230.6 ms .. 236.2 ms) std dev 3.678 ms (2.790 ms .. 4.777 ms) variance introduced by outliers: 14% (moderately inflated)

benchmarking 1000000/descending/pairing time 204.6 ms (190.2 ms .. 214.1 ms) 0.998 R² (0.991 R² .. 1.000 R²) mean 208.4 ms (204.1 ms .. 210.6 ms) std dev 4.051 ms (1.299 ms .. 5.288 ms) variance introduced by outliers: 14% (moderately inflated)

benchmarking 1000000/descending/skew time 229.9 ms (212.7 ms .. 240.1 ms) 0.998 R² (0.996 R² .. 1.000 R²) mean 238.8 ms (231.3 ms .. 241.4 ms) std dev 5.006 ms (269.0 μs .. 6.151 ms) variance introduced by outliers: 16% (moderately inflated)

— You are receiving this because you authored the thread. Reply to this email directly, view it on GitHub https://github.com/haskell/containers/issues/423#issuecomment-356973089, or mute the thread https://github.com/notifications/unsubscribe-auth/ABzi_VOO54k944SxCQQTJk5RMIdlAjFUks5tJjB3gaJpZM4MzN-J .

oisdk commented 6 years ago

Sure! How should I structure it in a separate file? Would a .md file in the same directory work?

treeowl commented 6 years ago

Yeah, that sounds reasonable. Just show what you've already tried (preferably with code) so no one else has to try the same things again.

On Jan 11, 2018 12:22 PM, "Donnacha Oisín Kidney" notifications@github.com wrote:

Sure! How should I structure it in a separate file? Would a .md file in the same directory work?

— You are receiving this because you authored the thread. Reply to this email directly, view it on GitHub https://github.com/haskell/containers/issues/423#issuecomment-356998938, or mute the thread https://github.com/notifications/unsubscribe-auth/ABzi_VW-HJxjAEAVlEu4v3_7d_T7K44Rks5tJkM6gaJpZM4MzN-J .

treeowl commented 6 years ago

Based on #490, it looks like any further improvements will likely require new inspiration, so I'm closing this for now. Thanks again, @oisdk.