Optimize toByteString and toASCIIBytes

lykahb commented 9 months ago

This PR leverages the bytestring fixed-length builders to simplify and speed up the conversions. The re-implementation of toASCIIBytes is now more high-level and safe.

I bumped the bytestring dependency lower bound to the version that introduces Data.ByteString.Builder.Prim. It was released in 2012, so that is plenty of backwards compatibility.

Here are benchmarks on MacBook M1 Max:

Before

benchmarking conversion/toASCIIBytes
time                 34.23 ns   (34.16 ns .. 34.35 ns)
                     1.000 R²   (1.000 R² .. 1.000 R²)
mean                 34.21 ns   (34.18 ns .. 34.29 ns)
std dev              169.2 ps   (40.23 ps .. 290.2 ps)

benchmarking conversion/toByteString
time                 74.16 ns   (73.96 ns .. 74.43 ns)
                     1.000 R²   (1.000 R² .. 1.000 R²)
mean                 74.07 ns   (73.98 ns .. 74.25 ns)
std dev              398.7 ps   (176.1 ps .. 649.3 ps)

After

benchmarking conversion/toASCIIBytes
time                 28.61 ns   (28.59 ns .. 28.63 ns)
                     1.000 R²   (1.000 R² .. 1.000 R²)
mean                 28.62 ns   (28.60 ns .. 28.65 ns)
std dev              56.85 ps   (26.42 ps .. 117.3 ps)

benchmarking conversion/toByteString
time                 20.87 ns   (20.81 ns .. 20.96 ns)
                     1.000 R²   (1.000 R² .. 1.000 R²)
mean                 20.94 ns   (20.89 ns .. 21.01 ns)
std dev              200.6 ps   (144.6 ps .. 259.8 ps)

lykahb commented 6 months ago

@phadej This PR is waiting for a review.

lykahb commented 6 months ago

It is hard to tell what makes the high-level implementation faster. The toByteString packed the characters from a list, which looks suboptimal. But the toASCIIBytes only does the low-level operations and looks quite simple, with, as you said, hard to imagine anything being faster.

Yet, the benchmarks show that a high-level implementation is faster. It might be that it unlocks some GHC optimizations or that bytestring has some other non-trivial optimizations. If we compare only the functions that do not have any extra overhead (toASCIIBytes from before with toByteString in this PR), it'd be a third faster.

If uuid delegates the low-level work to bytestring, it would benefit from any future performance improvements in bytestring. Also, this might be easier to maintain: this diff mostly removes code and lets this library focus on the format, rather than the low-level optimization.

There is also more room for speedup: toASCIIBytes can be as just fast as toByteString if we remove the copy https://github.com/haskell/bytestring/issues/666. For better compatibility I avoided using an internal module in this PR Data.ByteString.Builder.Prim.Internal.runF.

clyring commented 6 months ago

If I had to guess, the reason the FixedPrim implementation in this patch is faster is probably to do with BBP.word32HexFixed being a branchless two-nibbles-at-a-time lookup instead of the branchy one-nibble-at-a-time calculation in the baseline.

lykahb commented 6 months ago

@clyring clarified that the module Data.ByteString.Builder.Prim.Internal is stable. I rewrote the string conversion without the Builder and got further improvement:

benchmarking conversion/toASCIIBytes
time                 21.56 ns   (21.18 ns .. 22.30 ns)
                     0.995 R²   (0.988 R² .. 1.000 R²)
mean                 21.36 ns   (21.20 ns .. 21.95 ns)
std dev              968.8 ps   (101.2 ps .. 2.048 ns)

benchmarking conversion/toByteString
time                 17.85 ns   (17.82 ns .. 17.89 ns)
                     1.000 R²   (1.000 R² .. 1.000 R²)
mean                 17.87 ns   (17.84 ns .. 17.95 ns)
std dev              166.1 ps   (89.32 ps .. 297.0 ps)

lykahb commented 6 months ago

With the recent update this branch has toASCIIBytes that is x1.6 faster and toByteString that is x4.15 faster.

@phadej Would you give this another look? This is a significant improvement and it reduces the amount of code to maintain in uuid. It seems that high-performance low-level code better fits the core libraries of the Haskell ecosystem like bytestring.

iand675 commented 6 months ago

I don't believe that anything can be faster than toASCIIBytes uuid = BI.unsafeCreate 36 (pokeASCII uuid). It's literally just allocating exactly 36 bytes and poking stuff at the right place. Maybe there's something GHC doesn't see (missing bang somewhere)? But the current code is literally as little work as possible already.

@phadej I appreciate that you'd like to keep the code uncomplicated, but benchmarking seems to indicate a solid improvement, so I don't understand the resistance to the improvement.

phadej commented 6 months ago

@iand675 the original patch was more complicated. And my review comment made lykahb improve on it.

Your comment is not fair.

iand675 commented 6 months ago

It's true that @lykahb made a further improvement, but I don't think it's unfair to state that he provided benchmarks prior to you saying that you didn't want to "complicate the code" that were a solid performance increase.

phadej commented 6 months ago

@iand675

"complicate the code"

If we don't understand why code works as it does, does it work?

In my opinion it doesn't in the long term. I'm the one who maintaining this, so I'm making the judgment calls.

This PR has a lot of good in it, but it's not perfect, I'll take if from here.

Note to self: benchmark on x86_64

lykahb commented 6 months ago

For the sake of clarity, is your objection about the use Data.ByteString.Builder.Prim in general or do you perceive anything else here as overly complicated? The only other thing I can think of was converting a Builder to a ByteString with toLazyByteStringWith. And that is gone now, inspired by Clyring's comment.

prikhi commented 6 months ago

Note to self: benchmark on x86_64

Is there a trick to reducing these variance values?

On an i9-13900K:

$  cabal bench --project-file cabal.bench.project uuid-types-benchmark --ghc-options='-O2 -threaded' --benchmark-options='-m pattern toASCIIBytes toByteString'
# BEFORE
benchmarking conversion/toASCIIBytes
time                 23.57 ns   (23.45 ns .. 23.68 ns)
                     1.000 R²   (0.999 R² .. 1.000 R²)
mean                 23.55 ns   (23.40 ns .. 23.91 ns)
std dev              732.5 ps   (376.0 ps .. 1.369 ns)
variance introduced by outliers: 51% (severely inflated)

benchmarking conversion/toByteString
time                 49.83 ns   (49.29 ns .. 50.46 ns)
                     0.999 R²   (0.999 R² .. 1.000 R²)
mean                 50.18 ns   (49.82 ns .. 50.54 ns)
std dev              1.218 ns   (1.078 ns .. 1.415 ns)
variance introduced by outliers: 37% (moderately inflated)

# AFTER
benchmarking conversion/toASCIIBytes
time                 13.18 ns   (13.09 ns .. 13.28 ns)
                     0.999 R²   (0.999 R² .. 1.000 R²)
mean                 13.19 ns   (13.09 ns .. 13.33 ns)
std dev              404.7 ps   (295.9 ps .. 574.5 ps)
variance introduced by outliers: 51% (severely inflated)

benchmarking conversion/toByteString
time                 13.03 ns   (12.97 ns .. 13.09 ns)
                     1.000 R²   (1.000 R² .. 1.000 R²)
mean                 13.01 ns   (12.95 ns .. 13.06 ns)
std dev              184.5 ps   (126.4 ps .. 291.3 ps)
variance introduced by outliers: 18% (moderately inflated)

haskell-hvr / uuid

Optimize toByteString and toASCIIBytes #80