Ideas for improving performance of transducers and reducers

jackfirth commented 4 years ago

Make variants intern themselves upon construction when they're wrapping values that are interned, like fixnums, symbols, booleans, keywords, and singletons. Could turn operations like into-nth and into-sum into zero-allocation operations (sometimes), and it will make all stateless transitions (i.e. returning (variant #:<kw> #f) from a state transition handler) zero-allocation.
Consider adding a way for users to force a variant to be interned, so that transducer/reducer implementations that use mutable state can be zero-allocation.
Add special case dispatching to transduce and reduce-all when the input sequence is a list. Possibly add special case dispatching to transduce when the #:into argument is into-list.
Use transducer composition to implement transduce, instead of repeatedly folding over the input sequence with in-transduced. This should remove a bunch of generic interface dynamic dispatch overhead.
Add no-contract submodules and use them in various places.

jackfirth commented 4 years ago

cc @samdphillips and @rocketnia

jackfirth commented 4 years ago

Another idea: make variant a macro that statically expands into a direct call to the internal variant constructor when it's used in a first-order way. So (variant #:foo (some-expression)) would expand into (constructor:variant '#:foo (some-expression)). When used in a higher-order fashion, like with keyword-apply, then fall back to the normal function-like behavior.

This would reduce the need to go through the make-keyword-function slow path, and eliminate the need to check that only one keyword argument is given. Since variants are often constructed by transducers and reducers in tight loops, and the current constructor is never used in a higher-order fashion, that could save a significant chunk of time.

jackfirth commented 4 years ago

See also #287 and #301.

samdphillips commented 4 years ago

Here is a branch (in my fork) where I used no contract versions of rebellion/collection/list and rebellion/base/comparators from the sorting transducer.

I don't have the numbers handy, IIRC in the case of sorting contracts were using about 60% of the time. I'm planning on running the contract profiler again tomorrow on some test programs. Maybe seeing how something like (transduce (in-range 10000) (mapping values) #:into (into-for-each void)) performs.

samdphillips commented 4 years ago

EDIT: the headline of this should probably be Running time is 57.74% contracts

I tested this program with the contract profiler. The changes that I made in the above branch were against a sorting program and don't improve the performance of this baseline case of transducers, so there is still contracts that can be removed or at least talked about. Here are the profile results for this program.

#lang racket/base

(require rebellion/streaming/reducer
         rebellion/streaming/transducer)

(transduce (in-range 1000000)
           (mapping values)
           #:into
           (into-for-each void))

jackfirth commented 4 years ago

Yikes, checking the consumer contract of (mapping values) is 5 seconds on its own, and the emitter contract is another four seconds. And that check would be repeated for every transducer in the chain, so I suspect this:

(transduce (in-range 1000000)
           (mapping values)
           (mapping values)
           (mapping values)
           (mapping values)
           (mapping values)
           #:into
           (into-for-each void))

...would add at least another 40 seconds or so. Maybe we need transducer fusion?

jackfirth commented 4 years ago

Expanded on an idea for transducer fusion in #358.

samdphillips commented 4 years ago

Transducing (mapping values) five times:

Running time is 58.62% contracts
131420/224180 ms

jackfirth commented 4 years ago

My goodness

jackfirth commented 4 years ago

I've written a small microbenchmarking library called Atomichron. Hopefully we can use this to get a better sense of the performance problems in Rebellion.

jackfirth commented 4 years ago

I made an example Atomichron benchmark that measures the performance of (transduce <some vector> #:into into-list) compared to (vector->list <some vector>). For a vector with 1000 elements on my machine:

vector->list takes 30.1 microseconds
The transducer-based implementation takes 61.5 milliseconds (that's 2000x slower!)

samdphillips commented 4 years ago

Linking performance benchmarks Jack has written here, so we can use them for comparisons.

https://gist.github.com/jackfirth/9596dd48536344b4b94c1938a36de4c9

jackfirth / rebellion

Ideas for improving performance of transducers and reducers #351