jackfirth / rebellion

A collection of core libraries for Racket
https://pkgs.racket-lang.org/package/rebellion
Apache License 2.0
80 stars 15 forks source link

Sorting transducer is slow #301

Open jackfirth opened 4 years ago

jackfirth commented 4 years ago

Using the sorting transducer is over an order of magnitude slower than using the built-in sort function in racket/base, even when when the sorting transducer is followed immediately by a taking transducer. Relevant links:

Even with generic sequences, transducer-based sorting should meet the following performance criteria:

samdphillips commented 4 years ago

I haven't formally worked out the bounds on this but it has the feel of an O(n log(n)).

I have built a copy of the algorithm without all of the transducer machinery that runs faster.

1000 iterations, 1000 random integers (0-999), taking lowest 10.

transducer: 34260ms
bare bones: 11806ms (34% of transducer version)

The extra 20ish seconds seems split between contract checking, transducers, and sequence handling.

Edit: There is still some contract checking overhead in the bare bones version. I was able to shave off 4s by using cons (no contract) instead of list-insert. From the profile there is 5-6s more of contract overhead.

jackfirth commented 4 years ago

We could add a no-contract submodule to rebellion/collection/list then. That would make the sorting transducer contracts on list-insert toggleable, and if we do it for some of the other modules too then we can investigate the relative costs of each set of contracts.

samdphillips commented 4 years ago

My thoughts exactly.

rocketnia commented 4 years ago

Summarizing from discussion elsewhere:

About a month ago, when I saw @jackfirth talking about lazy sorting, I started some work on a lazy merge sort using Racket's promises and streams. Over the past few days, I've polished that up just enough to paste it here: https://gist.github.com/rocketnia/5e319a54e4fd980ea5e26d024645628d

It's a stream-to-stream function, but since any sorting transducer will consume the entire input before emitting the first output, I bet a stream-to-stream sorting function can be adapted pretty trivially (using (batching into-list) followed by append-mapping) to make a sorting transducer.

I haven't tested this lazy merge sort yet, but assuming it works the way I expect, its worst-case cost is something like O(max(N, K log N)) time and comparisons to find the first K of N sorted elements. Breaking this down a little: The first O(N) steps turn the given stream into a list of singleton streams, merge all the streams, and take the first element. Then it takes up to O(log N) time to get each subsequent element from the merged stream.

A moment ago @jackfirth suggested I check out this issue, so I think in the next couple of days I'll try to hook this up to the benchmarks and see how it goes. 🙂

rocketnia commented 4 years ago

I looked into the performance of in-merge-sorted.

First I verified I was able to run the modified benchmark as-is:

$ racket rebellion-sorting-benchmark.rkt benchmark
'#hash(
  ("Rebellion lazy sort (milliseconds)" .
    #<stats: #:average 96247/1000 #:count 1000 #:max 275 #:min 86 #:sum 96247>)
  ("Standard racket sort (milliseconds)" .
    #<stats: #:average 8907/1000 #:count 1000 #:max 23 #:min 8 #:sum 8907>))

(I've manually formatted the output a little for readability.)

Then I made this change to the "standard racket sort" to simulate how a Rebellion user might implement their own sorting in terms of Racket's sort if sorting weren't provided by the library already:

 (define (bottom-10/racket gems)
-  (take (sort (sequence->list gems) < #:key gemstone-weight) 10))
+  (transduce gems
+             (transducer-pipe
+               (batching into-list)
+               (append-mapping
+                 (λ (lst) (in-list (sort lst < #:key gemstone-weight)))))
+             (taking 10)
+             #:into into-list))

From this point of view, Rebellion's sorting transducer is much better than nothing:

$ racket rebellion-sorting-benchmark.rkt benchmark
'#hash(
  ("Rebellion lazy sort (milliseconds)" .
    #<stats: #:average 12202/125 #:count 1000 #:max 276 #:min 87 #:sum 97616>)
  ("Standard racket sort (milliseconds)" .
    #<stats: #:average 25231/200 #:count 1000 #:max 172 #:min 120 #:sum 126155>))

(The in-list call may be unnecessary, but whatever performance impact it has seems to be negligible.)

Then I added my lazy merge sort, in-merge-sorted, as a contender:

(define (bottom-10/lazy-merge-sort gems)
  (transduce gems
             (transducer-pipe
               (batching into-list)
               (append-mapping
                 (λ (lst)
                   (in-merge-sorted
                     (λ (a b) (<= (gemstone-weight a) (gemstone-weight b)))
                     (in-list lst)))))
             (taking 10)
             #:into into-list))

This uncovered a bug in in-merge-sorted, so I fixed that, and then these were my results:

$ racket rebellion-sorting-benchmark.rkt benchmark
'#hash(
  ("Lazy merge sort (milliseconds)" .
    #<stats: #:average 125869/1000 #:count 1000 #:max 254 #:min 113 #:sum 125869>)
  ("Rebellion lazy sort (milliseconds)" .
    #<stats: #:average 12564/125 #:count 1000 #:max 287 #:min 87 #:sum 100512>)
  ("Standard racket sort (milliseconds)" .
    #<stats: #:average 131687/1000 #:count 1000 #:max 321 #:min 118 #:sum 131687>))

It looks like for this benchmark, in-merge-sorted-as-a-transducer takes about as long as sort-as-a-transducer, and Rebellion's sorting is still in first place.

Here's a new Gist with the modifications I made to the benchmark file as well as the bug fix for in-merge-sorted.

jackfirth commented 4 years ago

It looks like for this benchmark, in-merge-sorted-as-a-transducer takes about as long as sort-as-a-transducer, and Rebellion's sorting is still in first place.

Hmm, that's suspicious. I think transducer-pipe, batching, and append-mapping are introducing a lot of overhead and skewing the results, rather than showing any difference between the sorting algorithms themselves.

jackfirth commented 4 years ago

Idea: you can determine the overhead of pipe/batching/appending by just timing (transducer-pipe (batching into-list) (append-mapping values)).

rocketnia commented 4 years ago

I see what you're saying. It's too bad we can't subtract that from the sorting time too; surely there's some unnecessary overhead in the interface there too.

If I add in this code:

(define (bottom-10/passthrough-wrapper gems)
  (transduce gems
             (transducer-pipe (batching into-list) (append-mapping values))
             (taking 10)
             #:into into-list))
       (measure-runtime "Passthrough wrapper (milliseconds)"
                        (λ () (bottom-10/passthrough-wrapper gems)))

I get:

$ time rebellion-sorting-benchmark.rkt benchmark
'#hash(
  ("Lazy merge sort (milliseconds)" .
    #<stats: #:average 60277/500 #:count 1000 #:max 177 #:min 112 #:sum 120554>)
  ("Passthrough wrapper (milliseconds)" .
    #<stats: #:average 117137/1000 #:count 1000 #:max 279 #:min 109 #:sum 117137>)
  ("Rebellion lazy sort (milliseconds)" .
    #<stats: #:average 96781/1000 #:count 1000 #:max 142 #:min 84 #:sum 96781>)
  ("Standard racket sort (milliseconds)" .
    #<stats: #:average 15743/125 #:count 1000 #:max 292 #:min 117 #:sum 125944>))
Algorithm Time - Wrapper Time (ms)
Lazy merge sort 3417
Rebellion lazy sort -20356 (if only it made sense to subtract)
Standard racket sort 5390