Closed JoshMeredith closed 1 year ago
Why aren't CPR analysis and worker/wrapper kicking in to unbox these automatically? Are some things failing to inline?
I believe this is due to a problem that SPJ explained in the GHC issue (https://gitlab.haskell.org/ghc/ghc/-/issues/22946#note_480902)
The key thing is that data BufferCodec is a data type, so if you store a function in there, it must have the specified type, and that type does this unnecessary boxing. That makes it very hard for GHC to automate the optimisation you do here.
I'm concerned about the noise in the microbenchmarks. In fact I think there is too much noise to draw a good conclusion. Could you try some of these suggestions or similar if your platform is not Linux, and rerun the benchmarks?
For what its worth I think there is ample evidence that this is a performance win even without the benchmarks. I just don't want to rely on noisy statistics. And I want to make sure that the code's performance stays as predictable as possible. For example, the stddev
4x increase in the handle/10000
concerns me.
In fact I think there is too much noise to draw a good conclusion.
Indeed, e. g., time for handle/1000000
is 282.0 ms .. 299.5 ms in the first case and 236.0 ms .. 296.7 ms in the second case. One cannot conclude that any one is better than another.
I have not used criterion
for a while. If you were using tasty-bench
, passing --stdev 2
or smaller should provide more precise numbers.
Also I would strongly suggest to make this change non-breaking: introduce a new data BufferCodec#
and a pattern synonym BufferCodec
to provide backward compatibility.
Also I would strongly suggest to make this change non-breaking: introduce a new
data BufferCodec#
and a pattern synonymBufferCodec
to provide backward compatibility.
This seems to lose the majority of the performance gains in terms of allocations. My guess is that there's some improvement over the baseline due to the use in a record as SPJ mentioned, but the encoders don't get fully unboxed.
There's only one change needed in all of Hackage, to add one language extension to generics-sop
, since deriveGeneric ''BufferCodec
requires UnboxedTuples
. A pattern synonym won't fix this, so I'm not sure if it's worth it just to maintain compatability on what is really only internals.
Here's the results using tasty-bench --stdev 1
, nice
, and taskset
:
unboxed
:
All
handle
1: OK (2.99s)
22.6 μs ± 238 ns
10: OK (3.25s)
24.3 μs ± 113 ns
100: OK (26.20s)
49.4 μs ± 671 ns
1000: OK (1.25s)
306 μs ± 5.8 μs
10000: OK (2.83s)
2.76 ms ± 33 μs
100000: OK (13.47s)
26.4 ms ± 26 μs
1000000: OK (8.02s)
258 ms ± 1.0 ms
All 7 tests passed (58.02s)
boxed
:
All
handle
1: OK (2.92s)
20.9 μs ± 153 ns
10: OK (3.16s)
23.6 μs ± 460 ns
100: OK (3.32s)
50.5 μs ± 811 ns
1000: OK (10.73s)
325 μs ± 547 ns
10000: OK (11.96s)
2.91 ms ± 40 μs
100000: OK (3.64s)
28.7 ms ± 403 μs
1000000: OK (0.86s)
288 ms ± 5.2 ms
All 7 tests passed (36.57s)
Seems like a modest performance benefit at relatively little cost.
What's a non-microbenchmark use case that would benefit strongly from this?
What's a non-microbenchmark use case that would benefit strongly from this?
Handle usage in general should be affected.
For example, if I change the inner loop (of the test, not the encoder) to write a line, rather than a single character:
loop :: Int -> Handle -> IO ()
loop 0 !_ = pure ()
loop !n !h = do
hPutStr h $! "the quick brown fox jumps over the lazy dog"
loop (n-1) h
I still see fairly significant allocation improvements:
boxed
:
./HandlePerf 1000 +RTS -s
1,416,016 bytes allocated in the heap
5,648 bytes copied during GC
46,776 bytes maximum residency (1 sample(s))
22,856 bytes maximum slop
6 MiB total memory in use (0 MiB lost due to fragmentation)
unboxed
:
./HandlePerf 1000 +RTS -s
1,270,608 bytes allocated in the heap
5,648 bytes copied during GC
46,776 bytes maximum residency (1 sample(s))
22,856 bytes maximum slop
6 MiB total memory in use (0 MiB lost due to fragmentation)
As well as time improvements (--stdev 2
):
boxed
:
All
handle
1: OK (2.86s)
20.9 μs ± 562 ns
10: OK (7.25s)
27.3 μs ± 431 ns
100: OK (0.48s)
114 μs ± 2.7 μs
1000: OK (1.00s)
975 μs ± 31 μs
10000: OK (4.50s)
8.77 ms ± 188 μs
100000: OK (1.30s)
86.6 ms ± 2.1 ms
1000000: OK (2.60s)
868 ms ± 8.8 ms
All 7 tests passed (20.00s)
unboxed
:
All
handle
1: OK (0.64s)
18.7 μs ± 390 ns
10: OK (28.85s)
27.1 μs ± 657 ns
100: OK (0.93s)
110 μs ± 3.8 μs
1000: OK (0.47s)
916 μs ± 26 μs
10000: OK (4.19s)
8.11 ms ± 94 μs
100000: OK (0.24s)
79.0 ms ± 3.1 ms
1000000: OK (2.40s)
801 ms ± 31 ms
All 7 tests passed (37.72s)
I am very interested by this patch!
It could be very nice to reduce allocations for applications that are constantly sending data to a DB.
filepath: slightly more involved changes since this library implements its own encoders and decoders
Since I maintain this library I'm interested in:
That won't guarantee a +1 from me, though.
Also I would strongly suggest to make this change non-breaking: introduce a new data BufferCodec# and a pattern synonym BufferCodec to provide backward compatibility.
As a matter of interest, do we regard the representation of BufferCodec
as part of the public facing API of the IO library? Or merely an implementation detail?
I know that it is currently exposed and so clients could be using it; but I'm interested in what the intent is. Internal, or deliberately exposed?
PS: Anything that shaves time off the inner loop of IO would be fantastic. That benefits everyone
As a matter of interest, do we regard the representation of
BufferCodec
as part of the public facing API of the IO library? Or merely an implementation detail?
I see it as an implementation detail, with filepath being an edge case that is justifiably touching internals for niche reasons.
IMO, it's impractical to consider the entire GHC.
namespace (whether in base or ghc) as part of the public API, and I suspect this would be in an Internal
module if not for historic reasons.
filepath: slightly more involved changes since this library implements its own encoders and decoders
Since I maintain this library I'm interested in:
* a full patch
https://gitlab.haskell.org/ghc/packages/filepath/-/tree/wip/unboxed-codebuffer
* comparison of filepath benchmarks (they use an inlined tasty-bench)
I won't have time to get to these today but I'll be happy to provide them after I've run them
I see it as an implementation detail, with filepath being an edge case that is justifiably touching internals for niche reasons.
I disagree.
How can you define your own encoders/decoders without reaching to those "internals"? How can you use all the APIs that use TextEncoding
without that ability? Do you think base should provide exhaustive support for all of them?
I consider this public API. We should make that more clear. filepath does nothing really fancy. The alternative would have been to not use any of the API that utilizes TextEncoding
, which is a lot, e.g. withCStringLen, and would have required major inlining and duplication of code.
How can you define your own encoders/decoders without reaching to those "internals"? How can you use all the APIs that use
TextEncoding
without that ability? Do you think base should provide exhaustive support for all of them?
My point here is that Internal
modules are generally exposed for the few cases that need them (rather than completely unexposed in the cabal file), and usually in these cases the library authors mark Internal
modules as subject to change.
I do think filepath is correct to be using the implementation details here, but I don't think it's a fair conclusion that one library should be locking an important implementation in place, especially since core libraries are updated in the course of merging incompatible case changes.
This seems to lose the majority of the performance gains in terms of allocations. My guess is that there's some improvement over the baseline due to the use in a record as SPJ mentioned, but the encoders don't get fully unboxed.
Sorry, I don't follow. How providing a pattern synonym for backward compatibility is to affect performance?
Sorry, I don't follow. How providing a pattern synonym for backward compatibility is to affect performance?
What I mean to say is, there's a slight benefit to the encoders using this backwards-compatability pattern, but their inner loop is still over-allocating, so those encoders overall don't benefit much. It's only filepath
that defines its own encoders, which IMO should ideally be switched to gain the increased performance.
I haven't verified this yet, but I wonder as of writing this if haskeline
would still require the addition of UnboxedTuples
since the functions in the view patterns use them.
Here's the implementation I used, please correct me if my understanding of your idea is incorrect:
data BufferCodec from to state = BufferCodec# {...}
pattern BufferCodec e r c g s <- BufferCodec# (unEncode -> e) (unRecover -> r) c g s
where
BufferCodec e r c g s = BufferCodec (mkEncode e) (mkRecover r) c g s
unEncode :: CodeBuffer# from to -> CodeBuffer from to
unEncode e i o = IO $ \st -> let !(# st', prog, i', o' #) = e i o st in (# st', (prog, i', o') #)
unRecover :: (Buffer from -> Buffer to -> State# RealWorld -> (# State# RealWorld, Buffer from, Buffer to #))
-> (Buffer from -> Buffer to -> IO (Buffer from, Buffer to))
unRecover r i o = IO $ \st -> let !(# st', i', o' #) = r i o st in (# st', (i', o') #)
mkEncode :: CodeBuffer from to -> CodeBuffer# from to
mkEncode e i o st = let !(# st', (prog, i', o') #) = unIO (e i o) st in (# st', prog, i', o' #)
mkRecover :: (Buffer from -> Buffer to -> IO (Buffer from, Buffer to))
-> (Buffer from -> Buffer to -> State# RealWorld -> (# State# RealWorld, Buffer from, Buffer to #))
mkRecover r i o st = let !(# st', (i', o') #) = unIO (r i o) st in (# st', i', o' #)
What I mean to say is, there's a slight benefit to the encoders using this backwards-compatability pattern, but their inner loop is still over-allocating, so those encoders overall don't benefit much. It's only
filepath
that defines its own encoders, which IMO should ideally be switched to gain the increased performance.
It's good enough if consumers of the old interface are not broken immediately and have time to migrate. If they want performance gains, they can switch to the new interface indeed.
The thing is that the patch you demonstrated for filepath
is quite involved: the module used to be almost Haskell98, but now it requires proficiency with MagicHash
/ UnboxedTuples
and a good deal of CPP. If someone uses CodeBuffer
interface in proprietary code, they'll broken without an easy way out. That's why I'm very keen to provide pattern synonyms.
The breakage in generics-sop
is less of an issue from my perspective: if TH generates code which requires more language extensions, GHC will prompt user about it, so it is straightforward to fix. I doubt there is any other code (public or private) affected this way.
It's good enough if consumers of the old interface are not broken immediately and have time to migrate. If they want performance gains, they can switch to the new interface indeed.
Yes. This proposal will get a nay from me if it's a breaking change without deprecation period.
It's good enough if consumers of the old interface are not broken immediately and have time to migrate. If they want performance gains, they can switch to the new interface indeed.
Ah I see. In that case, I think the pattern isn't a very high cost to incur.
I'd like to hear comments on the deprecation strategy with the pattern. I wouldn't mind taking it as an opportunity for us to properly define the public API here.
I don't really see much point in deprecation of the pattern synonym from https://github.com/haskell/core-libraries-committee/issues/134#issuecomment-1454781039. It's not wrong to use the boxed implementation, and - if you are happy to sacrifice some performance - it is significantly easier to use, because you don't need MagicHash
and stuff. Not much additional code to maintain either.
@JoshMeredith there are two action points left:
clc-stackage
.@JoshMeredith just a gentle reminder about this. It would be a pity to close such discussion as abandoned, without reaching a conclusion.
Sorry for the delay.
I've managed to get clc-stagage
mostly built today, with the exception of some c dependency issues.
The build hasn't found any additional uses of BufferCodec
in Hackage code, and this combined with head.hackage makes me quite confident in claiming that the addition of UnboxedTuples
to this one module in generics-sop
is the only change required in Hackage.
In my testing with clc-stackage
, I included the pattern synonyms in the bootstrap of GHC, and these allowed me to leave the base
and filepath
codecs unchanged.
I hope to provide the updated MR after the weekend.
I've updated the merge request containing the entire code at https://gitlab.haskell.org/ghc/ghc/-/merge_requests/9948.
The performance test mentioned in the description of this proposal has been merged into GHC in https://gitlab.haskell.org/ghc/ghc/-/merge_requests/10347, and we have some numbers improvements, see:
encodingAllocations
test in above outputNote that the previous changes to filepath
and haskeline
have been removed, as @Bodigrim's pattern synonym suggestion maintains backwards compatability for all cases other than the one module in generics-sop
that requires the addition of UnboxedTuples
.
I'd be happy to request a vote at this point with the updated code - assuming this is still considered a breaking change (depending on the generics-sop
judgement).
Quoting https://gitlab.haskell.org/ghc/ghc/-/jobs/1505952#L6827,
encodingAllocations(normal) run/alloc 504,094,272 392,094,048 -22.2% GOOD
It's GOOD indeed :)
Thanks @JoshMeredith. The remaining breakage in generics-sop
is probably unavoidable. The problematic line is deriveGeneric ''BufferCodec
: while we continue to provide BufferCodec
pattern synonym, the TH-generated code now requires {-# LANGUAGE UnboxedTuples #-}
. In general it's very hard to provide full backwards compatibility in the presence of Template Haskell. generics-sop
has active maintainers and the latest change was just two weeks ago, so I'm sure it will be updated in time.
I'd like to give CLC members a couple of days to review the code before triggering a vote. The crux of the MR is in GHC.IO.Encoding.Types
: we replace the existing data BufferCodec from to state
, which has boxed fields, with a backwards-compatible pattern synonym, wrapping data BufferCodec# from to state
with unboxed fields. The rest of the MR is AFAICT just trivial wrapping-unwrapping.
Dear CLC members, let's vote on the proposal to provide an unboxed interface to BufferCodec
. The complete patch is somewhat technical, but here is a gist. Currently we have
data BufferCodec from to state = BufferCodec {
encode :: CodeBuffer from to,
recover :: Buffer from -> Buffer to -> IO (Buffer from, Buffer to),
...
}
The proposal is to expose an unboxed
data BufferCodec from to state = BufferCodec# {
encode# :: CodeBuffer# from to,
recover# :: Buffer from -> Buffer to -> State# RealWorld -> (# State# RealWorld, Buffer from, Buffer to #),
...
}
and provide a pattern synonym BufferCodec
and compatibility functions encode
/ recover
with current, boxed types atop of it.
This way existing customers such as filepath
continue to work as is, without changes, and new clients (e. g., GHC itself) can choose the unboxed interface to enjoy ~20% decrease in allocations. As @simonpj highlights, "Anything that shaves time off the inner loop of IO would be fantastic. That benefits everyone".
The compatibility layer covers all "normal" customers, except those who introspect BufferCodec
by means of Template Haskell. The only package affected is generics-sop
, because if uses TH to derive instances for BufferCodec
, which can be fixed simply by enabling {-# LANGUAGE UnboxedTuples #-}
.
@tomjaguarpaw @hasufell @mixphix @chshersh @angerman @parsonsmatt
+1 from me. I agree with Simon that anything decreasing allocations in I/O is a great addition to base
.
+1
+1
This sounds like an amazing performance improvement! 👏🏻 I appreciate the backward compatibility effort, especially for something that is considered an internal thing. Solid work! 🏆
Absolutely in favour. Thanks for this great contribution!
+1
Thanks all, 5 votes in favour are enough to approve.
@JoshMeredith please link a PR patching generics-sop
.
Background
If we consider the implementation of codecs within
GHC.IO.Encoding.Types
, the use of regular (boxed) tuples in theencode
andrecover
functions causes unnecessary allocations in the inner loop of encoders/decoders. We have a current implementation of:Proposal
We should unwrap the
IO
to pass aroundState# RealWorld
, allowing for the full unboxing of the return types:Externally, the types of
encode
andrecover
are maintained, so this change is almost entirely internal, except for a few packages that poke into the internals (discussed later in the breakage analysis):Performance Tests
I've used the following example program to isolate the allocations caused by the inner loop:
STG
The STG for this program shows that the
loop
doesn't allocate, so any allocations that occur during the loop will be inhPutChar
, allowing us to benchmark the overhead of this call:Allocations
If we run the test program with
+RTS -s
to extract heap statistics, we can see an overview of the allocations with and without the changes:boxed
:unboxed
:Ticky
Extracting a ticky profile of the program shows the reduced allocations more specifically:
boxed
:unboxed
:Further changes to individual encoders to evaluate the
Buffer from
andBuffer to
outputs ofCodeBuffer#
before returning them also cleans up the ticky allocations, though this doesn't appear to reduce the overall heap numbers:Microbenchmarks
By benchmarking the previous program with Criterion, results were also found to be in favour of these changes:
boxed
:unboxed
:Breakage Analysis
Submodules
Two of GHC's submodules require changes:
haskeline
: minor changes to enableUnboxedTuples
andMagicHash
, changing one use of each ofrecoverEncode
andrecoverDecode
to the unboxed versions, and changing a setter forrecover
to use the renamed fieldfilepath
: slightly more involved changes since this library implements its own encoders and decodersHackage
I've built
head.hackage
using the GHC CI (https://gitlab.haskell.org/ghc/head.hackage/-/jobs/1381683), and other thanfilepath
, onlygenerics-sop
failed to build. I have been able to successfully build this library after addingUnboxedTuples
with no other changes required.