Open fumieval opened 4 years ago
@fumieval I've seen your work and I've been meaning to reach out to you about it; it does look promising but I'd like to better understand its trade-offs; is it really strictly superior in all cases, or can you point out pathological cases where the current builder would be significantly (which may not necessarily be blockers) at a disadvantage?
Is it possible to switch to your implementation while being 100% API compatible with the old impl? (i.e. same type-sigs; no observable semantic differences )
bytestring's Builder tends to be slow when working with small primitives (e.g. Word8). From my small benchmark:
bytestring mean 247.6 ns ( +- 12.06 ns )
bytestring-builder mean 188.7 ns ( +- 34.65 ns )
bytestring-builder/toStrict mean 166.8 ns ( +- 3.603 ns )
fast-builder/lazy mean 67.43 ns ( +- 20.60 ns )
fast-builder/strict mean 41.29 ns ( +- 2.150 ns )
mason/lazy mean 378.2 ns ( +- 5.022 ns )
mason/strict mean 30.44 ns ( +- 916.5 ps )
AFAIK both fast-builder
and mason
have 100% compatible API. However, fast-builder has a [caveat] (http://hackage.haskell.org/package/fast-builder-0.1.2.0/docs/Data-ByteString-FastBuilder.html#v:toLazyByteString) about the performance of toLazyByteString
and mason's toLazyByteString
, implemented in a similar way, is still poor in some cases.
Hrm; about the performance penalty of toLazyByteString
; do you know why it regresses? which cases specifically suffer most? (i.e. are those pathological, or are those cases actually relevant in real-world cases?)
Because toLazyByteString
forks another thread to run the builder action, and the caller thread retrieves the chunks it yields via MVar. fast-builder avoids this problem by delaying forking until it reaches 32K bytes.
@fumieval so are you saying that the difference/penalty is mostly noticeable for small-ish serializations (i.e. in the <= 32KiB range)?
Yes, currently it has a big (constant?) overhead. If the whole process takes more than 100μs, I think mason gets faster than others (as shown in https://github.com/fumieval/mason#performance) in most cases.
@fumieval what if the caller could provide a size-hint of the expected output buffer size (i.e. an integer argument)? would it then be possible to mitigate the constant-overhead and get on par with the current builders performance?
I think the problem really is the overhead of thread creation, not about the buffer size. fast-builder implements a rather complicated trick to avoid this issue: http://hackage.haskell.org/package/fast-builder-0.1.2.0/docs/src/Data.ByteString.FastBuilder.Internal.html#toLazyByteStringWith
I wasn't able to replace Builder with Mason. https://github.com/fumieval/mason/issues/2
Looks like fast builder doesn't provide options to wrap lazy string into builder.
A new approach based on delimited continuations seems promising: https://github.com/ghc-proposals/ghc-proposals/pull/313#issuecomment-648114538
@fumieval haven't looked at a lot of your work (which is rather broad btw, thank you), but have you considered something along the lines of builder or small-bytearray-builder. These could be re-worked to use Ptr rather easily, and I'm interested to see how they stack up.
Additionally, @andrewthad has done a lot of thinking about builders and may have some interesting input here.
In bytebuild, I use GC-managed ByteArray#
for everything instead of Addr#
. Also, I don't use CPS. In bytes-builder-shootout, which uses backpack for perf benchmarking, I have observed that for encoding (recursive) tree-like structures, bytebuild
outperforms bytestring
, but fast-builder
outperforms them both.
@chessai small-bytearray-builder's design looks very interesting, but both builder and small-bytearray-builder produce fully strict byte arrays only, while we need streaming behaviour for toLazyByteString and hPutBuilder.
@andrewthad I didn't know about bytes-builder-shootout, that looks useful! If mason works as intended (primitives are well fused and inlined), producing strict ByteString should be faster than fast-builder. I'll try adding mason to bytes-builder-shootout
I added mason to bytes-builder-shootout. the latest release of mason wasn't very good in this benchmark:
treeToHex-2000/small-bytearray-builder mean 43.27 μs ( +- 167.3 ns )
treeToHex-2000/fast-builder mean 34.22 μs ( +- 95.85 ns )
treeToHex-2000/bytestring mean 60.76 μs ( +- 3.829 μs )
treeToHex-2000/mason mean 60.34 μs ( +- 182.0 ns )
treeToHex-9000/small-bytearray-builder mean 190.1 μs ( +- 761.5 ns )
treeToHex-9000/bytestring mean 289.9 μs ( +- 1.700 μs )
treeToHex-9000/mason mean 267.9 μs ( +- 1.023 μs )
short-text-tree-1000/small-bytearray-builder mean 27.30 μs ( +- 142.7 ns )
short-text-tree-1000/fast-builder mean 37.36 μs ( +- 639.8 ns )
short-text-tree-1000/bytestring mean 37.17 μs ( +- 1.763 μs )
short-text-tree-1000/mason mean 29.42 μs ( +- 422.5 ns )
byte-tree-2000/small-bytearray-builder mean 30.64 μs ( +- 131.9 ns )
byte-tree-2000/fast-builder mean 28.82 μs ( +- 114.9 ns )
byte-tree-2000/bytestring mean 53.86 μs ( +- 314.2 ns )
byte-tree-2000/mason mean 40.45 μs ( +- 1.252 μs )
I investigated the code and it turns out that the lack of worker-wrapper transformation (which isn't necessary in fast-builder and small-bytearray-builder) is slowing it down. Flattening the internal structure drastically improved the speed.
treeToHex-2000/mason mean 37.96 μs ( +- 306.3 ns )
treeToHex-9000/mason mean 167.6 μs ( +- 719.6 ns )
short-text-tree-1000/mason mean 27.48 μs ( +- 151.2 ns )
byte-tree-2000/mason mean 32.12 μs ( +- 170.2 ns )
@fumieval is there anything actionable here?
Unfortunately, mason's performance regressed a lot since GHC 9.2, and I didn't manage to figure out why. I decided to put the project on hold until delimited continuation primops arrives.
@fumieval Delimited Continuations has merged https://gitlab.haskell.org/ghc/ghc/-/merge_requests/7942
Any new progress?
As https://github.com/haskell-perf/strict-bytestring-builders pointed out, current Data.ByteString.Builder tends to be slow especially when primitives are small.
Some alternatives have good benchmark results (cf https://www.reddit.com/r/haskell/comments/e6yg76/mason_faster_bytestring_builder/ ,https://github.com/fumieval/mason#performance, http://hackage.haskell.org/package/fast-builder); we may be able to take some tricks from those.
It also makes sense to expand the API so that
hPutBuilder
I think it's important that the internal API are exposed. At the moment it's difficult to send the contents of Builder over a socket due to
BuildSignal
being opaque.