Closed treeowl closed 6 years ago
I've been playing around with this for a while, and I'm getting some decent speedups with a couple of changes.
The main thing is that adding a specialise pragma to replicateA
seems to give a 15-30% speedup.
The list is only being constructed in order to be deconstructed in the fromList2
function, which only uses unconsing, so it seems like you should be able to use a minView
function directly:
pqState = State minView
Or something. However, the heap type can't represent an empty heap:
data PQueue e = PQueue e (PQL e)
data PQL e = Nil | {-# UNPACK #-} !(PQueue e) :& PQL e
Which means that if you rewrite the unroll function into the direct-state style, it turns out to be too strict. If you replace the implementation of <*>
with a handwritten definition (instead of ap
), it's not too strict:
fs <*> xs =
State $ \s -> case runState fs s of
(s',f) -> case runState xs s' of
(s'',x) -> (s'', f x)
But I don't know if that effects other code negatively. That lets you remove the intermediate list:
pqState cmp = State unrollPQ' where
{-# INLINE unrollPQ' #-}
unrollPQ' (PQueue x ts) = (mergePQs ts,x)
mergePQs (t :& Nil) = t
mergePQs (t1 :& t2 :& Nil) = t1 <+> t2
mergePQs (t1 :& t2 :& ts) = (t1 <+> t2) <+> mergePQs ts
mergePQs Nil = undefined
(<+>) = mergePQ cmp
This speedup is less significant, (possible slower on small inputs) but on larger inputs it looks like it's significant.
In my own testing, skew heaps have been quicker than pairing for the sort-traversable functions, but I haven't run any tests with sequence yet.
All of these benchmarks are on a list of random ints. I haven't looked at nearly-sorted lists, etc.
Old vs. All optimisations
benchmarking 500/old
time 130.9 μs (128.4 μs .. 132.8 μs)
0.997 R² (0.996 R² .. 0.999 R²)
mean 133.0 μs (131.2 μs .. 134.9 μs)
std dev 6.509 μs (5.411 μs .. 8.084 μs)
variance introduced by outliers: 50% (moderately inflated)
benchmarking 500/new
time 100.5 μs (99.58 μs .. 101.6 μs)
0.998 R² (0.997 R² .. 0.999 R²)
mean 101.3 μs (99.82 μs .. 103.0 μs)
std dev 5.435 μs (4.131 μs .. 7.558 μs)
variance introduced by outliers: 56% (severely inflated)
benchmarking 50000/old
time 60.96 ms (59.87 ms .. 63.09 ms)
0.998 R² (0.996 R² .. 1.000 R²)
mean 60.23 ms (58.99 ms .. 60.96 ms)
std dev 1.692 ms (543.6 μs .. 3.016 ms)
benchmarking 50000/new
time 38.52 ms (37.78 ms .. 39.57 ms)
0.998 R² (0.994 R² .. 0.999 R²)
mean 38.75 ms (38.13 ms .. 39.45 ms)
std dev 1.286 ms (913.8 μs .. 2.232 ms)
benchmarking 500000/old
time 1.349 s (1.344 s .. 1.354 s)
1.000 R² (1.000 R² .. 1.000 R²)
mean 1.350 s (1.349 s .. 1.350 s)
std dev 933.7 μs (0.0 s .. 1.010 ms)
variance introduced by outliers: 19% (moderately inflated)
benchmarking 500000/new
time 973.4 ms (931.8 ms .. 1.001 s)
1.000 R² (1.000 R² .. 1.000 R²)
mean 969.9 ms (962.6 ms .. 974.9 ms)
std dev 7.470 ms (0.0 s .. 8.622 ms)
variance introduced by outliers: 19% (moderately inflated)
benchmarking 1000000/old
time 3.479 s (2.747 s .. 4.043 s)
0.993 R² (0.991 R² .. 1.000 R²)
mean 3.095 s (2.849 s .. 3.258 s)
std dev 244.0 ms (0.0 s .. 281.7 ms)
variance introduced by outliers: 21% (moderately inflated)
benchmarking 1000000/new
time 2.357 s (2.263 s .. 2.497 s)
1.000 R² (0.999 R² .. 1.000 R²)
mean 2.378 s (2.358 s .. 2.396 s)
std dev 30.39 ms (0.0 s .. 31.78 ms)
variance introduced by outliers: 19% (moderately inflated)
Old with SPECIALIZE
pragma vs. All optimisations:
benchmarking 500/old
time 85.07 μs (83.52 μs .. 86.83 μs)
0.997 R² (0.996 R² .. 0.998 R²)
mean 85.99 μs (84.74 μs .. 87.36 μs)
std dev 4.242 μs (3.673 μs .. 5.049 μs)
variance introduced by outliers: 52% (severely inflated)
benchmarking 500/new
time 96.67 μs (95.19 μs .. 98.16 μs)
0.998 R² (0.997 R² .. 0.999 R²)
mean 96.64 μs (95.28 μs .. 98.07 μs)
std dev 4.678 μs (3.986 μs .. 5.547 μs)
variance introduced by outliers: 51% (severely inflated)
benchmarking 50000/old
time 42.00 ms (41.46 ms .. 42.74 ms)
0.999 R² (0.997 R² .. 1.000 R²)
mean 42.57 ms (42.17 ms .. 42.94 ms)
std dev 756.9 μs (528.2 μs .. 1.094 ms)
benchmarking 50000/new
time 34.79 ms (34.25 ms .. 35.23 ms)
0.999 R² (0.997 R² .. 1.000 R²)
mean 36.27 ms (35.77 ms .. 36.80 ms)
std dev 1.081 ms (758.3 μs .. 1.511 ms)
benchmarking 500000/old
time 1.126 s (1.106 s .. 1.142 s)
1.000 R² (1.000 R² .. 1.000 R²)
mean 1.111 s (1.108 s .. 1.116 s)
std dev 3.888 ms (0.0 s .. 4.194 ms)
variance introduced by outliers: 19% (moderately inflated)
benchmarking 500000/new
time 951.6 ms (921.9 ms .. NaN s)
1.000 R² (1.000 R² .. 1.000 R²)
mean 950.8 ms (944.3 ms .. 954.4 ms)
std dev 5.733 ms (136.0 as .. 6.142 ms)
variance introduced by outliers: 19% (moderately inflated)
benchmarking 1000000/old
time 2.722 s (2.636 s .. 2.789 s)
1.000 R² (1.000 R² .. 1.000 R²)
mean 2.733 s (2.713 s .. 2.745 s)
std dev 19.28 ms (0.0 s .. 22.26 ms)
variance introduced by outliers: 19% (moderately inflated)
benchmarking 1000000/new
time 2.485 s (1.926 s .. 2.740 s)
0.994 R² (0.986 R² .. 1.000 R²)
mean 2.515 s (2.425 s .. 2.571 s)
std dev 84.18 ms (0.0 s .. 96.03 ms)
variance introduced by outliers: 19% (moderately inflated)
Both of these versions coexist on my fork.
It appears that the containers
benchmark suite doesn't exercise either sorting function! When you find the implementation you prefer, do you think you could submit a PR making that change and also adding a couple benchmarks? Thanks a lot for your work so far.
P.S., I'm looking forward to your attempts at fast total Traversable
sorting without compiler plugins or rewrite rules.... Alternatively, I can try it myself, but I can't seem to figure out how to prove associativity for zeroless binary addition without a jillion helper functions and a gazillion lines of code. That's probably just because I don't have enough experience with such things, though.
Also, I don't understand why that definition of <*>
would be lazier. What am I missing?
Huh - looks like I was wrong about <*>
. I was hitting bottoms when I used ap
, but they went away when I replaced it with a manual definition of <*>
- it must have been something else, though, since I can't replicate it now.
I'll include the benchmarks in the pull request.
(still hacking away at the binomial heap - I'm going to try using some of the tricks from Stephanie Weirich's talk at POPL to build up the proofs of each property in the heap GADT, see if that works. It's hard to go back to handwritten proofs after seeing how easy the plugins make it, though!)
Since https://github.com/haskell/containers/pull/425 was merged can we close this issue? Or do you have more performance improvements lined up? :)
@m-renaud, I think the "use a different heap" option @oisdk mentioned is something we should consider before closing this.
After testing this with some other heaps, I can't find anything that's better than the pairing heap here. The closest was a skew heap:
benchmarking 1000000/unsorted/pairing
time 2.005 s (NaN s .. 2.102 s)
1.000 R² (0.998 R² .. 1.000 R²)
mean 2.069 s (2.060 s .. 2.075 s)
std dev 9.340 ms (0.0 s .. 10.67 ms)
variance introduced by outliers: 19% (moderately inflated)
benchmarking 1000000/unsorted/skew
time 2.042 s (1.637 s .. 2.267 s)
0.995 R² (0.990 R² .. NaN R²)
mean 2.165 s (2.065 s .. 2.217 s)
std dev 87.10 ms (0.0 s .. 91.26 ms)
variance introduced by outliers: 19% (moderately inflated)
benchmarking 1000000/ascending/pairing
time 191.4 ms (187.8 ms .. 193.5 ms)
1.000 R² (0.999 R² .. 1.000 R²)
mean 197.0 ms (194.7 ms .. 200.0 ms)
std dev 3.221 ms (2.441 ms .. 3.924 ms)
variance introduced by outliers: 14% (moderately inflated)
benchmarking 1000000/ascending/skew
time 232.3 ms (227.0 ms .. 238.9 ms)
0.999 R² (0.997 R² .. 1.000 R²)
mean 233.9 ms (230.6 ms .. 236.2 ms)
std dev 3.678 ms (2.790 ms .. 4.777 ms)
variance introduced by outliers: 14% (moderately inflated)
benchmarking 1000000/descending/pairing
time 204.6 ms (190.2 ms .. 214.1 ms)
0.998 R² (0.991 R² .. 1.000 R²)
mean 208.4 ms (204.1 ms .. 210.6 ms)
std dev 4.051 ms (1.299 ms .. 5.288 ms)
variance introduced by outliers: 14% (moderately inflated)
benchmarking 1000000/descending/skew
time 229.9 ms (212.7 ms .. 240.1 ms)
0.998 R² (0.996 R² .. 1.000 R²)
mean 238.8 ms (231.3 ms .. 241.4 ms)
std dev 5.006 ms (269.0 μs .. 6.151 ms)
variance introduced by outliers: 16% (moderately inflated)
Thanks a lot. Could you possibly add some notes on that to the source, perhaps in a separate file?
On Jan 11, 2018 11:01 AM, "Donnacha Oisín Kidney" notifications@github.com wrote:
After testing this with some other heaps, I can't find anything that's better than the pairing heap here. The closest was a skew heap:
benchmarking 1000000/unsorted/pairing time 2.005 s (NaN s .. 2.102 s) 1.000 R² (0.998 R² .. 1.000 R²) mean 2.069 s (2.060 s .. 2.075 s) std dev 9.340 ms (0.0 s .. 10.67 ms) variance introduced by outliers: 19% (moderately inflated)
benchmarking 1000000/unsorted/skew time 2.042 s (1.637 s .. 2.267 s) 0.995 R² (0.990 R² .. NaN R²) mean 2.165 s (2.065 s .. 2.217 s) std dev 87.10 ms (0.0 s .. 91.26 ms) variance introduced by outliers: 19% (moderately inflated)
benchmarking 1000000/ascending/pairing time 191.4 ms (187.8 ms .. 193.5 ms) 1.000 R² (0.999 R² .. 1.000 R²) mean 197.0 ms (194.7 ms .. 200.0 ms) std dev 3.221 ms (2.441 ms .. 3.924 ms) variance introduced by outliers: 14% (moderately inflated)
benchmarking 1000000/ascending/skew time 232.3 ms (227.0 ms .. 238.9 ms) 0.999 R² (0.997 R² .. 1.000 R²) mean 233.9 ms (230.6 ms .. 236.2 ms) std dev 3.678 ms (2.790 ms .. 4.777 ms) variance introduced by outliers: 14% (moderately inflated)
benchmarking 1000000/descending/pairing time 204.6 ms (190.2 ms .. 214.1 ms) 0.998 R² (0.991 R² .. 1.000 R²) mean 208.4 ms (204.1 ms .. 210.6 ms) std dev 4.051 ms (1.299 ms .. 5.288 ms) variance introduced by outliers: 14% (moderately inflated)
benchmarking 1000000/descending/skew time 229.9 ms (212.7 ms .. 240.1 ms) 0.998 R² (0.996 R² .. 1.000 R²) mean 238.8 ms (231.3 ms .. 241.4 ms) std dev 5.006 ms (269.0 μs .. 6.151 ms) variance introduced by outliers: 16% (moderately inflated)
— You are receiving this because you authored the thread. Reply to this email directly, view it on GitHub https://github.com/haskell/containers/issues/423#issuecomment-356973089, or mute the thread https://github.com/notifications/unsubscribe-auth/ABzi_VOO54k944SxCQQTJk5RMIdlAjFUks5tJjB3gaJpZM4MzN-J .
Sure! How should I structure it in a separate file? Would a .md file in the same directory work?
Yeah, that sounds reasonable. Just show what you've already tried (preferably with code) so no one else has to try the same things again.
On Jan 11, 2018 12:22 PM, "Donnacha Oisín Kidney" notifications@github.com wrote:
Sure! How should I structure it in a separate file? Would a .md file in the same directory work?
— You are receiving this because you authored the thread. Reply to this email directly, view it on GitHub https://github.com/haskell/containers/issues/423#issuecomment-356998938, or mute the thread https://github.com/notifications/unsubscribe-auth/ABzi_VW-HJxjAEAVlEu4v3_7d_T7K44Rks5tJkM6gaJpZM4MzN-J .
Based on #490, it looks like any further improvements will likely require new inspiration, so I'm closing this for now. Thanks again, @oisdk.
I recently figured out how to write a (reasonably) efficient unstable sort for arbitrary
Traversable
containers. Surprisingly, that seems to be competitive withData.Sequence.unstableSort
, at least for short to medium-length sequences. I think this means we have some room to improve inData.Sequence
. A few possibilities:Check out what if anything should be made stricter (or lazier).
Use a different heap representation.
Reconsider how we rebuild a sequence from a heap. The fact that we first "unroll" the heap to a list looks suspicious to me.