Sorting transducer is slow

jackfirth commented 5 years ago

Using the sorting transducer is over an order of magnitude slower than using the built-in sort function in racket/base, even when when the sorting transducer is followed immediately by a taking transducer. Relevant links:

This comment where I first inspected the performance situation.
This benchmark program that compares sorting and (taking 10) with using (take (sort (sequence->list seq) ...) 10).
This modified benchmark that includes profiling (thank you @samdphillips!).

Even with generic sequences, transducer-based sorting should meet the following performance criteria:

It should be asymptotically faster than the built-in sort function when only a small number of elements are consumed, such as when (sorting) is followed by (taking 10). The current implementation doesn't seem to be.
It should be less than 10x slower than converting the input sequence to a list and calling the built-in sort function.

samdphillips commented 5 years ago

I haven't formally worked out the bounds on this but it has the feel of an O(n log(n)).

I have built a copy of the algorithm without all of the transducer machinery that runs faster.

1000 iterations, 1000 random integers (0-999), taking lowest 10.

transducer: 34260ms
bare bones: 11806ms (34% of transducer version)

The extra 20ish seconds seems split between contract checking, transducers, and sequence handling.

Edit: There is still some contract checking overhead in the bare bones version. I was able to shave off 4s by using cons (no contract) instead of list-insert. From the profile there is 5-6s more of contract overhead.

jackfirth commented 5 years ago

We could add a no-contract submodule to rebellion/collection/list then. That would make the sorting transducer contracts on list-insert toggleable, and if we do it for some of the other modules too then we can investigate the relative costs of each set of contracts.

samdphillips commented 5 years ago

My thoughts exactly.

rocketnia commented 4 years ago

Summarizing from discussion elsewhere:

About a month ago, when I saw @jackfirth talking about lazy sorting, I started some work on a lazy merge sort using Racket's promises and streams. Over the past few days, I've polished that up just enough to paste it here: https://gist.github.com/rocketnia/5e319a54e4fd980ea5e26d024645628d

It's a stream-to-stream function, but since any sorting transducer will consume the entire input before emitting the first output, I bet a stream-to-stream sorting function can be adapted pretty trivially (using (batching into-list) followed by append-mapping) to make a sorting transducer.

I haven't tested this lazy merge sort yet, but assuming it works the way I expect, its worst-case cost is something like O(max(N, K log N)) time and comparisons to find the first K of N sorted elements. Breaking this down a little: The first O(N) steps turn the given stream into a list of singleton streams, merge all the streams, and take the first element. Then it takes up to O(log N) time to get each subsequent element from the merged stream.

A moment ago @jackfirth suggested I check out this issue, so I think in the next couple of days I'll try to hook this up to the benchmarks and see how it goes. 🙂

rocketnia commented 4 years ago

I looked into the performance of in-merge-sorted.

First I verified I was able to run the modified benchmark as-is:

$ racket rebellion-sorting-benchmark.rkt benchmark
'#hash(
  ("Rebellion lazy sort (milliseconds)" .
    #<stats: #:average 96247/1000 #:count 1000 #:max 275 #:min 86 #:sum 96247>)
  ("Standard racket sort (milliseconds)" .
    #<stats: #:average 8907/1000 #:count 1000 #:max 23 #:min 8 #:sum 8907>))

(I've manually formatted the output a little for readability.)

Then I made this change to the "standard racket sort" to simulate how a Rebellion user might implement their own sorting in terms of Racket's sort if sorting weren't provided by the library already:

 (define (bottom-10/racket gems)
-  (take (sort (sequence->list gems) < #:key gemstone-weight) 10))
+  (transduce gems
+             (transducer-pipe
+               (batching into-list)
+               (append-mapping
+                 (λ (lst) (in-list (sort lst < #:key gemstone-weight)))))
+             (taking 10)
+             #:into into-list))

From this point of view, Rebellion's sorting transducer is much better than nothing:

$ racket rebellion-sorting-benchmark.rkt benchmark
'#hash(
  ("Rebellion lazy sort (milliseconds)" .
    #<stats: #:average 12202/125 #:count 1000 #:max 276 #:min 87 #:sum 97616>)
  ("Standard racket sort (milliseconds)" .
    #<stats: #:average 25231/200 #:count 1000 #:max 172 #:min 120 #:sum 126155>))

(The in-list call may be unnecessary, but whatever performance impact it has seems to be negligible.)

Then I added my lazy merge sort, in-merge-sorted, as a contender:

(define (bottom-10/lazy-merge-sort gems)
  (transduce gems
             (transducer-pipe
               (batching into-list)
               (append-mapping
                 (λ (lst)
                   (in-merge-sorted
                     (λ (a b) (<= (gemstone-weight a) (gemstone-weight b)))
                     (in-list lst)))))
             (taking 10)
             #:into into-list))

This uncovered a bug in in-merge-sorted, so I fixed that, and then these were my results:

$ racket rebellion-sorting-benchmark.rkt benchmark
'#hash(
  ("Lazy merge sort (milliseconds)" .
    #<stats: #:average 125869/1000 #:count 1000 #:max 254 #:min 113 #:sum 125869>)
  ("Rebellion lazy sort (milliseconds)" .
    #<stats: #:average 12564/125 #:count 1000 #:max 287 #:min 87 #:sum 100512>)
  ("Standard racket sort (milliseconds)" .
    #<stats: #:average 131687/1000 #:count 1000 #:max 321 #:min 118 #:sum 131687>))

It looks like for this benchmark, in-merge-sorted-as-a-transducer takes about as long as sort-as-a-transducer, and Rebellion's sorting is still in first place.

Here's a new Gist with the modifications I made to the benchmark file as well as the bug fix for in-merge-sorted.

jackfirth commented 4 years ago

It looks like for this benchmark, in-merge-sorted-as-a-transducer takes about as long as sort-as-a-transducer, and Rebellion's sorting is still in first place.

Hmm, that's suspicious. I think transducer-pipe, batching, and append-mapping are introducing a lot of overhead and skewing the results, rather than showing any difference between the sorting algorithms themselves.

jackfirth commented 4 years ago

Idea: you can determine the overhead of pipe/batching/appending by just timing (transducer-pipe (batching into-list) (append-mapping values)).

rocketnia commented 4 years ago

I see what you're saying. It's too bad we can't subtract that from the sorting time too; surely there's some unnecessary overhead in the interface there too.

If I add in this code:

(define (bottom-10/passthrough-wrapper gems)
  (transduce gems
             (transducer-pipe (batching into-list) (append-mapping values))
             (taking 10)
             #:into into-list))

       (measure-runtime "Passthrough wrapper (milliseconds)"
                        (λ () (bottom-10/passthrough-wrapper gems)))

I get:

$ time rebellion-sorting-benchmark.rkt benchmark
'#hash(
  ("Lazy merge sort (milliseconds)" .
    #<stats: #:average 60277/500 #:count 1000 #:max 177 #:min 112 #:sum 120554>)
  ("Passthrough wrapper (milliseconds)" .
    #<stats: #:average 117137/1000 #:count 1000 #:max 279 #:min 109 #:sum 117137>)
  ("Rebellion lazy sort (milliseconds)" .
    #<stats: #:average 96781/1000 #:count 1000 #:max 142 #:min 84 #:sum 96781>)
  ("Standard racket sort (milliseconds)" .
    #<stats: #:average 15743/125 #:count 1000 #:max 292 #:min 117 #:sum 125944>))

Algorithm	Time - Wrapper Time (ms)
Lazy merge sort	3417
Rebellion lazy sort	-20356 (if only it made sense to subtract)
Standard racket sort	5390

jackfirth / rebellion

Sorting transducer is slow #301