aria42 commented 5 years ago

Initial Jump table decoder implementation and tests.

The out-of-the-box implementation is pretty fast (46% faster in decoding time on the benchmark), but there's a lot of room for improvement. Wanted to get feedback before iterating on this. Some tidbits:

According to the random benchmarking, it's approximately 46% faster decoding on average; see this spreadsheet for output of benchmarking. I suspect this is an under-estimate of the actual gains since the symbols generated in the benchmark are very long (hundreds of characters) and the jump gains are larger when you have lots of symbols of length < 24 (or even a single byte), which in practice is the far more common case we care about. (The clojure data.generator uses geometric distribution for symbol lengths which is a saner assumption IMO).
Jump decoder is occasionally slower than standard decoding, but the diffs are generally small. I'll dive into some of these cases
Running all the same codec tests that exist for default-codec to verify implementation
The implementation has all the core cases (numbers, byte-strings, strings, arrays, maps, etc.) , but is avoiding some cases, either because it's not clear the jump table will help (most of the streaming cases) or just wanting to get feedback.
I can clean up how the tests deal with testing multiple codecs with a macro or something if that's a concern
I had to make some methods public in codec namespace in order to re-use
I haven't put too much effort into optimization up here and can probably get quite more wins if I spend another 2-3 hours optimizing.

codecov-io commented 5 years ago

Codecov Report

Merging #11 into develop will decrease coverage by 1.57%. The diff coverage is 92.3%.

@@             Coverage Diff             @@
##           develop      #11      +/-   ##
===========================================
- Coverage    99.86%   98.28%   -1.58%     
===========================================
  Files           13       14       +1     
  Lines          748      818      +70     
  Branches         1        5       +4     
===========================================
+ Hits           747      804      +57     
- Misses           0        9       +9     
- Partials         1        5       +4

Impacted Files	Coverage Δ
src/clj_cbor/data/core.clj	`100% <ø> (ø)`	:arrow_up:
src/clj_cbor/tags/clojure.clj	`100% <100%> (ø)`	:arrow_up:
src/clj_cbor/header.clj	`100% <100%> (ø)`	:arrow_up:
src/clj_cbor/bytes.clj	`100% <100%> (ø)`
src/clj_cbor/core.clj	`98.86% <50%> (-1.14%)`	:arrow_down:
src/clj_cbor/codec.clj	`96.54% <90.9%> (-3.14%)`	:arrow_down:

Continue to review full report at Codecov.

Legend - Click here to learn more Δ = absolute <relative> (impact), ø = not affected, ? = missing data Powered by Codecov. Last update 84c8714...c5e8770. Read the comment docs.

aria42 commented 5 years ago

Ok, I'll address this, sorry this was a little sloppy and thrown together too quickly. So I think the right way to address everything is

Integrate jump table into codec.clj to make sure you recursively always check jump table
Move some bits from codec.clj into a utility name space (e.g, bytes) so that codec can depend on jump
Verify that protocols (on defrecords) are faster than function invocation, this used to be true, but might not any more

aria42 commented 5 years ago

Ok, so I did the optimization and re-arranged things so the jump table is used recursively, that triggered moving out stuff into the bytes and decoder namespaces. I'm not thrilled about that choice, but it seemed like the least shitty option.

Surprisingly, the impact on performance wasn't that much from adding recursive stuff (now decoding takes half the time, see spreadsheet). I also tuned up some performance issues across the board so the baseline moved a little bit as well, and that's why you're not getting as much of a leap as you expect. Still 2x faster is pretty nice and the jump code itself is pretty small (but it triggers some other changes that are a bit messy).

My todo:

Get more tests on jump entries since the macro expanded functions are not all tested and cloverage is unhappy
Take any suggestions on how to re-factor things if you're not happy with this design, I suggested something in a comment above I didn't do, because I didn't like how leaky the abstraction, became.

aria42 commented 5 years ago

Ok, I give up on the Coverage test check. If you look at the coverage on jump.clj it's claiming the doseq that populates the jump entries is untested because that form looks disconnected from the result expression and it's < 100 lines, so 2-3 untested lines will cause you to fail the test. I can write this in such a way that it looks covered, but that seems wrong. Any suggestions?

aria42 commented 5 years ago

@greglook Wanted to draw your attention to b631671. I believe the current unsigned long coercion might be wrong (or at least it's not clear to me) since it appears to use the bytes in little endian order instead of big endian which is what the CBOR spec requires for all values. You can see the byte order reversal if you compare the binary string representations

I updated the function and the newer function is also 2x faster.

clj-cbor.bytes> (Long/toBinaryString -10)
"1111111111111111111111111111111111111111111111111111111111110110"
;; this is the older function which moves "11110110" to the beggining
clj-cbor.bytes> (.toString (.toBigInteger (old-to-unsigned-long -10)) 2)
"1111011011111111111111111111111111111111111111111111111111111111"
;; the newer function preserves the binary representation, which is what
;; you expect since all that's changing is how you interpret the bytes
clj-cbor.bytes> (.toString (.toBigInteger (to-unsigned-long -10)) 2)
"1111111111111111111111111111111111111111111111111111111111110110"

aria42 commented 5 years ago

Also, update on end-to-end performance. Based on my 84 samples, looking at the sum of the decoding time average in micro-seconds (each average comes from a criterium benchmark of decoding the random sample).

Prior to my changes:

CBOR (84c87144d4b, tip of master): 2947.45µs. (1x compared to baseline)
CBOR (b631671, tip of jump-table branch): 2781.58µs (1.06x compared to baseline)
CBOR-JUMP (b631671): 1186.65µs (2.48x compared to baseline)

aria42 commented 5 years ago

Ok, took @greglook to inline jump stuff, total lines added specifically for jump stuff is lower. Bad news is my prior benchmark was wrong since I accidentally left off the handlers in the jump decoder. I decided to use Reddit API data (see EDN here) instead since the random data has some weird attributes (e.g, very long keywords).

So how's performance on this data

Prior to PR average decode is mean 3.80ms
This PR without jump is mean 2.40ms
This PR with jump decoding is mean 2.08ms

;; First one is 84c8714
clj-cbor.core> (c/bench (decode default-codec bs))
Evaluation count : 26280 in 60 samples of 438 calls.
Evaluation count : 16020 in 60 samples of 267 calls.
             Execution time mean : 3.809419 ms
    Execution time std-deviation : 151.597668 µs
   Execution time lower quantile : 3.615037 ms ( 2.5%)
   Execution time upper quantile : 4.241055 ms (97.5%)
                   Overhead used : 6.791016 ns
nil
;; Next two use this PR b631671
clj-cbor.core> (c/bench (decode default-codec bs))
Evaluation count : 26280 in 60 samples of 438 calls.
             Execution time mean : 2.400332 ms
    Execution time std-deviation : 142.585442 µs
   Execution time lower quantile : 2.265757 ms ( 2.5%)
   Execution time upper quantile : 2.690157 ms (97.5%)
                   Overhead used : 6.791016 ns
nil
clj-cbor.core> (def jump-decode (with-jump-table default-codec))
clj-cbor.core> (c/bench (decode jump-decode bs))
Evaluation count : 29880 in 60 samples of 498 calls.
             Execution time mean : 2.080512 ms
    Execution time std-deviation : 142.985792 µs
   Execution time lower quantile : 1.941332 ms ( 2.5%)
   Execution time upper quantile : 2.373102 ms (97.5%)
                   Overhead used : 6.791016 ns

Which is a statistically significant difference, but a lot smaller than the 2x earlier reported

greglook / clj-cbor

Jump table decoder #11

Codecov Report