JoshMeredith commented 1 year ago

Background

If we consider the implementation of codecs within GHC.IO.Encoding.Types, the use of regular (boxed) tuples in the encode and recover functions causes unnecessary allocations in the inner loop of encoders/decoders. We have a current implementation of:

type CodeBuffer from to = Buffer from -> Buffer to -> IO (CodeProgress, Buffer from, Buffer to)

data BufferCodec from to state = BufferCodec {
  encode :: CodeBuffer from to,
  recover :: Buffer from -> Buffer to -> IO (Buffer from, Buffer to),
  ...
}

Proposal

We should unwrap the IO to pass around State# RealWorld, allowing for the full unboxing of the return types:

type CodeBuffer from to = Buffer from -> Buffer to -> IO (CodeProgress, Buffer from, Buffer to)
type CodeBuffer# from to = Buffer from -> Buffer to -> State# RealWorld -> (# State# RealWorld, CodeProgress, Buffer from, Buffer to #)

data BufferCodec from to state = BufferCodec {
  encode# :: CodeBuffer# from to,
  recover# :: Buffer from -> Buffer to -> State# RealWorld -> (# State# RealWorld, Buffer from, Buffer to #),
  ...
}

Externally, the types of encode and recover are maintained, so this change is almost entirely internal, except for a few packages that poke into the internals (discussed later in the breakage analysis):

{-# INLINE encode #-}
encode :: BufferCodec from to state -> CodeBuffer from to
encode codec from to = IO $ \s -> case encode# codec from to s of
  (# s', progress, from', to' #) -> (# s', (progress, from', to') #)

{-# INLINE recover #-}
recover :: BufferCoder from to state -> Buffer from -> Buffer to -> (Buffer from, Buffer to)
recover codec from to = IO $ \s -> case recover# codec from to s of
  (# s', from', to' #) -> (# s', (from', to') #)

Performance Tests

I've used the following example program to isolate the allocations caused by the inner loop:

module Main (main) where

import System.IO
import Data.Bits
import GHC.Int
import GHC.Exts
import System.Environment

main :: IO ()
main = do
  [n] <- getArgs
  withFile "/dev/null" WriteMode (loop (read n))

loop :: Int -> Handle -> IO ()
loop 0  !_ = pure ()
loop !n !h = do
  hPutChar h $! dummy_char n
  loop (n-1) h

-- unsafe efficient version of `chr`
my_chr :: Int -> Char
my_chr (I# i) = C# (chr# i)

-- return either a or b
dummy_char :: Int -> Char
dummy_char !i = my_chr ((i .&. 1) + 97)

STG

The STG for this program shows that the loop doesn't allocate, so any allocations that occur during the loop will be in hPutChar, allowing us to benchmark the overhead of this call:

Rec {
Main.$wloop [InlPrag=[2], Occ=LoopBreaker]
  :: GHC.Prim.Int#
     -> GHC.IO.Handle.Types.Handle
     -> GHC.Prim.State# GHC.Prim.RealWorld
     -> (# GHC.Prim.State# GHC.Prim.RealWorld, () #)
[GblId, Arity=3, Str=<1L><SL><L>, Unf=OtherCon []] =
    {} \r [ww_s2Mf w_s2Mg void_0E]
        case ww_s2Mf of ds_s2Mi {
          __DEFAULT ->
              case andI# [ds_s2Mi 1#] of sat_s2Mj [Occ=Once1] {
              __DEFAULT ->
              case +# [sat_s2Mj 97#] of sat_s2Mk [Occ=Once1] {
              __DEFAULT ->
              case chr# [sat_s2Mk] of sat_s2Ml [Occ=Once1] {
              __DEFAULT ->
              case
                  GHC.IO.Handle.Text.$whPutChar w_s2Mg sat_s2Ml GHC.Prim.void#
              of
              {
              Solo# _ [Occ=Dead] ->
              case -# [ds_s2Mi 1#] of sat_s2Mp [Occ=Once1] {
              __DEFAULT -> Main.$wloop sat_s2Mp w_s2Mg GHC.Prim.void#;
              };
              };
              };
              };
              };
          0# -> case w_s2Mg of { __DEFAULT -> Solo# [GHC.Tuple.()]; };
        };
end Rec }

Allocations

If we run the test program with +RTS -s to extract heap statistics, we can see an overview of the allocations with and without the changes:

boxed:

./HandlePerf 100 +RTS -s
      140,952 bytes allocated in the heap
./HandlePerf 1000 +RTS -s
      796,440 bytes allocated in the heap
./HandlePerf 10000 +RTS -s
     7,348,944 bytes allocated in the heap
./HandlePerf 100000 +RTS -s
    72,871,608 bytes allocated in the heap

unboxed:

./HandlePerf 100 +RTS -s
         106,792 bytes allocated in the heap
./HandlePerf 1000 +RTS -s
         459,896 bytes allocated in the heap
./HandlePerf 10000 +RTS -s
       3,988,416 bytes allocated in the heap
./HandlePerf 100000 +RTS -s
      39,271,096 bytes allocated in the heap

Ticky

Extracting a ticky profile of the program shows the reduced allocations more specifically:

boxed:

    Entries       Alloc     Alloc'd  Non-void Arguments   STG Name
--------------------------------------------------------------------
     100001    14400144           0  8 PMEiwiiS           GHC.IO.Encoding.UTF8.$wutf8_encode{v rh} (fun)
     100001     9600096           0  2 SS                 GHC.IO.Encoding.UTF8.mkUTF1{v rn} (fun)
     100000     6400000           0  2 cS                 $l$wact1_g5R1{v} (GHC.IO.Handle.Text) (fun)
     100002     4800096           0  4 LM>P               GHC.IO.Handle.Internals.$wdo_operation{v rR} (fun)
     100002     4000080           0  4 LMP>               GHC.IO.Handle.Internals.$wwantWritableHandle'{v r1N} (fun)
    1700034     2400048           0  1 S                  sat_s5F0{v} (GHC.IO.Handle.Internals) (fun) in s5DP
     100000     2400000           0  4 >>SS               GHC.IO.Handle.Internals.$wstreamEncode{v r1c} (fun)
    4100082     1600032           0  0                    io{v s5DP} (GHC.IO.Handle.Internals) (fun) in r1N
     100000     1600000           0  2 Mc                 GHC.IO.Handle.Text.$whPutChar{v rL} (fun)

unboxed:

     Entries       Alloc     Alloc'd  Non-void Arguments   STG Name
--------------------------------------------------------------------
      100001    14400144           0  8 PMEiwiiS           GHC.IO.Encoding.UTF8.$wutf8_encode{v r2mD} (fun)
      100001     6400064           0  2 SS                 GHC.IO.Encoding.UTF8.utf8_encode{v rC3} (fun)
      100000     6400000           0  2 cS                 $l$wact1_g5QY{v} (GHC.IO.Handle.Text) (fun)
      100002     4800096           0  4 LM>P               GHC.IO.Handle.Internals.$wdo_operation{v r4Zo} (fun)
      100002     4000080           0  4 LMP>               GHC.IO.Handle.Internals.$wwantWritableHandle'{v r4Zy} (fun)
     1700034     2400048           0  1 S                  sat_s5Fh{v} (GHC.IO.Handle.Internals) (fun) in s5E6
      100000     2400000           0  4 >>SS               GHC.IO.Handle.Internals.$wstreamEncode{v r4Yn} (fun)
     4100082     1600032           0  0                    io{v s5E6} (GHC.IO.Handle.Internals) (fun) in r4Zy
      100000     1600000           0  2 Mc                 GHC.IO.Handle.Text.$whPutChar{v r51K} (fun)

Further changes to individual encoders to evaluate the Buffer from and Buffer to outputs of CodeBuffer# before returning them also cleans up the ticky allocations, though this doesn't appear to reduce the overall heap numbers:

     Entries       Alloc     Alloc'd  Non-void Arguments   STG Name
--------------------------------------------------------------------
      100001           0           0  8 PMEiwiiS           GHC.IO.Encoding.UTF8.$wutf8_encode{v rh} (fun)
      100001    12800128           0  2 SS                 GHC.IO.Encoding.UTF8.utf8_encode{v rz} (fun)
      100000     6400000           0  2 cS                 $l$wact1_g5QY{v} (GHC.IO.Handle.Text) (fun)
      100002     4800096           0  4 LM>P               GHC.IO.Handle.Internals.$wdo_operation{v rR} (fun)
      100002     4000080           0  4 LMP>               GHC.IO.Handle.Internals.$wwantWritableHandle'{v r1M} (fun)
     1700034     2400048           0  1 S                  sat_s5Fh{v} (GHC.IO.Handle.Internals) (fun) in s5E6
      100000     2400000           0  4 >>SS               GHC.IO.Handle.Internals.$wstreamEncode{v r1c} (fun)
     4100082     1600032           0  0                    io{v s5E6} (GHC.IO.Handle.Internals) (fun) in r1M
      100000     1600000           0  2 Mc                 GHC.IO.Handle.Text.$whPutChar{v rL} (fun)

Microbenchmarks

By benchmarking the previous program with Criterion, results were also found to be in favour of these changes:

main :: IO ()
main = do
  defaultMain
    [ bgroup "handle" [bench (show n) $ nfIO (withFile "/dev/null" WriteMode (loop n)) | n <- sizes ]
    ]

sizes :: [Int]
sizes = [1,10,100,1000,10000,100000,1000000]

boxed:

benchmarking handle/1
time                 22.49 μs   (21.71 μs .. 23.32 μs)
                     0.992 R²   (0.988 R² .. 0.995 R²)
mean                 22.13 μs   (21.53 μs .. 22.92 μs)
std dev              2.453 μs   (1.933 μs .. 3.549 μs)
variance introduced by outliers: 88% (severely inflated)

benchmarking handle/10
time                 25.23 μs   (24.66 μs .. 25.77 μs)
                     0.994 R²   (0.991 R² .. 0.997 R²)
mean                 25.02 μs   (24.35 μs .. 25.87 μs)
std dev              2.357 μs   (1.788 μs .. 3.298 μs)
variance introduced by outliers: 83% (severely inflated)

benchmarking handle/100
time                 53.37 μs   (52.59 μs .. 54.14 μs)
                     0.998 R²   (0.997 R² .. 0.999 R²)
mean                 53.08 μs   (52.37 μs .. 53.66 μs)
std dev              2.269 μs   (1.911 μs .. 2.736 μs)
variance introduced by outliers: 47% (moderately inflated)

benchmarking handle/1000
time                 331.3 μs   (323.7 μs .. 340.2 μs)
                     0.994 R²   (0.991 R² .. 0.997 R²)
mean                 330.8 μs   (320.5 μs .. 336.8 μs)
std dev              26.62 μs   (18.41 μs .. 38.96 μs)
variance introduced by outliers: 70% (severely inflated)

benchmarking handle/10000
time                 2.905 ms   (2.699 ms .. 3.059 ms)
                     0.969 R²   (0.937 R² .. 0.991 R²)
mean                 2.899 ms   (2.785 ms .. 2.971 ms)
std dev              292.9 μs   (196.0 μs .. 411.8 μs)
variance introduced by outliers: 67% (severely inflated)

benchmarking handle/100000
time                 25.73 ms   (24.43 ms .. 26.90 ms)
                     0.991 R²   (0.983 R² .. 0.996 R²)
mean                 27.97 ms   (27.16 ms .. 28.65 ms)
std dev              1.660 ms   (1.410 ms .. 1.949 ms)
variance introduced by outliers: 21% (moderately inflated)

benchmarking handle/1000000
time                 290.5 ms   (282.0 ms .. 299.5 ms)
                     1.000 R²   (0.999 R² .. 1.000 R²)
mean                 290.0 ms   (286.3 ms .. 292.6 ms)
std dev              3.897 ms   (1.740 ms .. 5.702 ms)
variance introduced by outliers: 16% (moderately inflated)

unboxed:

benchmarking handle/1
time                 22.85 μs   (22.02 μs .. 23.58 μs)
                     0.991 R²   (0.987 R² .. 0.995 R²)
mean                 22.15 μs   (21.51 μs .. 22.83 μs)
std dev              2.259 μs   (1.894 μs .. 2.799 μs)
variance introduced by outliers: 86% (severely inflated)

benchmarking handle/10
time                 25.04 μs   (24.35 μs .. 25.80 μs)
                     0.994 R²   (0.991 R² .. 0.996 R²)
mean                 24.89 μs   (24.32 μs .. 25.51 μs)
std dev              1.840 μs   (1.396 μs .. 2.485 μs)
variance introduced by outliers: 75% (severely inflated)

benchmarking handle/100
time                 52.83 μs   (51.94 μs .. 53.63 μs)
                     0.998 R²   (0.997 R² .. 0.999 R²)
mean                 51.20 μs   (50.44 μs .. 51.97 μs)
std dev              2.508 μs   (2.092 μs .. 3.138 μs)
variance introduced by outliers: 54% (severely inflated)

benchmarking handle/1000
time                 307.6 μs   (292.3 μs .. 318.0 μs)
                     0.985 R²   (0.971 R² .. 0.995 R²)
mean                 299.1 μs   (288.3 μs .. 307.8 μs)
std dev              33.62 μs   (26.10 μs .. 41.03 μs)
variance introduced by outliers: 82% (severely inflated)

benchmarking handle/10000
time                 2.395 ms   (2.102 ms .. 2.679 ms)
                     0.941 R²   (0.909 R² .. 0.983 R²)
mean                 2.611 ms   (2.514 ms .. 2.677 ms)
std dev              268.7 μs   (195.7 μs .. 370.0 μs)
variance introduced by outliers: 68% (severely inflated)

benchmarking handle/100000
time                 24.82 ms   (24.29 ms .. 25.31 ms)
                     0.996 R²   (0.988 R² .. 0.999 R²)
mean                 25.34 ms   (24.84 ms .. 25.66 ms)
std dev              917.3 μs   (532.7 μs .. 1.574 ms)

benchmarking handle/1000000
time                 261.3 ms   (236.0 ms .. 296.7 ms)
                     0.996 R²   (0.989 R² .. 1.000 R²)
mean                 237.0 ms   (224.0 ms .. 247.5 ms)
std dev              14.99 ms   (8.919 ms .. 21.73 ms)
variance introduced by outliers: 17% (moderately inflated)

Breakage Analysis

Submodules

Two of GHC's submodules require changes:

haskeline: minor changes to enable UnboxedTuples and MagicHash, changing one use of each of recoverEncode and recoverDecode to the unboxed versions, and changing a setter for recover to use the renamed field
filepath: slightly more involved changes since this library implements its own encoders and decoders

Hackage

I've built head.hackage using the GHC CI (https://gitlab.haskell.org/ghc/head.hackage/-/jobs/1381683), and other than filepath, only generics-sop failed to build. I have been able to successfully build this library after adding UnboxedTuples with no other changes required.

treeowl commented 1 year ago

Why aren't CPR analysis and worker/wrapper kicking in to unbox these automatically? Are some things failing to inline?

JoshMeredith commented 1 year ago

I believe this is due to a problem that SPJ explained in the GHC issue (https://gitlab.haskell.org/ghc/ghc/-/issues/22946#note_480902)

The key thing is that data BufferCodec is a data type, so if you store a function in there, it must have the specified type, and that type does this unnecessary boxing. That makes it very hard for GHC to automate the optimisation you do here.

doyougnu commented 1 year ago

I'm concerned about the noise in the microbenchmarks. In fact I think there is too much noise to draw a good conclusion. Could you try some of these suggestions or similar if your platform is not Linux, and rerun the benchmarks?

For what its worth I think there is ample evidence that this is a performance win even without the benchmarks. I just don't want to rely on noisy statistics. And I want to make sure that the code's performance stays as predictable as possible. For example, the stddev 4x increase in the handle/10000 concerns me.

Bodigrim commented 1 year ago

In fact I think there is too much noise to draw a good conclusion.

Indeed, e. g., time for handle/1000000 is 282.0 ms .. 299.5 ms in the first case and 236.0 ms .. 296.7 ms in the second case. One cannot conclude that any one is better than another.

I have not used criterion for a while. If you were using tasty-bench, passing --stdev 2 or smaller should provide more precise numbers.

Bodigrim commented 1 year ago

Also I would strongly suggest to make this change non-breaking: introduce a new data BufferCodec# and a pattern synonym BufferCodec to provide backward compatibility.

JoshMeredith commented 1 year ago

Also I would strongly suggest to make this change non-breaking: introduce a new data BufferCodec# and a pattern synonym BufferCodec to provide backward compatibility.

This seems to lose the majority of the performance gains in terms of allocations. My guess is that there's some improvement over the baseline due to the use in a record as SPJ mentioned, but the encoders don't get fully unboxed.

There's only one change needed in all of Hackage, to add one language extension to generics-sop, since deriveGeneric ''BufferCodec requires UnboxedTuples. A pattern synonym won't fix this, so I'm not sure if it's worth it just to maintain compatability on what is really only internals.

JoshMeredith commented 1 year ago

Here's the results using tasty-bench --stdev 1, nice, and taskset:

unboxed:

All
  handle
    1:       OK (2.99s)
      22.6 μs ± 238 ns
    10:      OK (3.25s)
      24.3 μs ± 113 ns
    100:     OK (26.20s)
      49.4 μs ± 671 ns
    1000:    OK (1.25s)
      306  μs ± 5.8 μs
    10000:   OK (2.83s)
      2.76 ms ±  33 μs
    100000:  OK (13.47s)
      26.4 ms ±  26 μs
    1000000: OK (8.02s)
      258  ms ± 1.0 ms

All 7 tests passed (58.02s)

boxed:

All
  handle
    1:       OK (2.92s)
      20.9 μs ± 153 ns
    10:      OK (3.16s)
      23.6 μs ± 460 ns
    100:     OK (3.32s)
      50.5 μs ± 811 ns
    1000:    OK (10.73s)
      325  μs ± 547 ns
    10000:   OK (11.96s)
      2.91 ms ±  40 μs
    100000:  OK (3.64s)
      28.7 ms ± 403 μs
    1000000: OK (0.86s)
      288  ms ± 5.2 ms

All 7 tests passed (36.57s)

parsonsmatt commented 1 year ago

Seems like a modest performance benefit at relatively little cost.

What's a non-microbenchmark use case that would benefit strongly from this?

JoshMeredith commented 1 year ago

What's a non-microbenchmark use case that would benefit strongly from this?

Handle usage in general should be affected.

For example, if I change the inner loop (of the test, not the encoder) to write a line, rather than a single character:

loop :: Int -> Handle -> IO ()
loop 0  !_ = pure ()
loop !n !h = do
  hPutStr h $! "the quick brown fox jumps over the lazy dog"
  loop (n-1) h

I still see fairly significant allocation improvements: boxed:

./HandlePerf 1000 +RTS -s
       1,416,016 bytes allocated in the heap
           5,648 bytes copied during GC
          46,776 bytes maximum residency (1 sample(s))
          22,856 bytes maximum slop
               6 MiB total memory in use (0 MiB lost due to fragmentation)

unboxed:

./HandlePerf 1000 +RTS -s
       1,270,608 bytes allocated in the heap
           5,648 bytes copied during GC
          46,776 bytes maximum residency (1 sample(s))
          22,856 bytes maximum slop
               6 MiB total memory in use (0 MiB lost due to fragmentation)

As well as time improvements (--stdev 2): boxed:

All
  handle
    1:       OK (2.86s)
      20.9 μs ± 562 ns
    10:      OK (7.25s)
      27.3 μs ± 431 ns
    100:     OK (0.48s)
      114  μs ± 2.7 μs
    1000:    OK (1.00s)
      975  μs ±  31 μs
    10000:   OK (4.50s)
      8.77 ms ± 188 μs
    100000:  OK (1.30s)
      86.6 ms ± 2.1 ms
    1000000: OK (2.60s)
      868  ms ± 8.8 ms

All 7 tests passed (20.00s)

unboxed:

All
  handle
    1:       OK (0.64s)
      18.7 μs ± 390 ns
    10:      OK (28.85s)
      27.1 μs ± 657 ns
    100:     OK (0.93s)
      110  μs ± 3.8 μs
    1000:    OK (0.47s)
      916  μs ±  26 μs
    10000:   OK (4.19s)
      8.11 ms ±  94 μs
    100000:  OK (0.24s)
      79.0 ms ± 3.1 ms
    1000000: OK (2.40s)
      801  ms ±  31 ms

All 7 tests passed (37.72s)

Kleidukos commented 1 year ago

I am very interested by this patch!

It could be very nice to reduce allocations for applications that are constantly sending data to a DB.

hasufell commented 1 year ago

filepath: slightly more involved changes since this library implements its own encoders and decoders

Since I maintain this library I'm interested in:

a full patch
comparison of filepath benchmarks (they use an inlined tasty-bench)

That won't guarantee a +1 from me, though.

simonpj commented 1 year ago

Also I would strongly suggest to make this change non-breaking: introduce a new data BufferCodec# and a pattern synonym BufferCodec to provide backward compatibility.

As a matter of interest, do we regard the representation of BufferCodec as part of the public facing API of the IO library? Or merely an implementation detail?

I know that it is currently exposed and so clients could be using it; but I'm interested in what the intent is. Internal, or deliberately exposed?

simonpj commented 1 year ago

PS: Anything that shaves time off the inner loop of IO would be fantastic. That benefits everyone

JoshMeredith commented 1 year ago

As a matter of interest, do we regard the representation of BufferCodec as part of the public facing API of the IO library? Or merely an implementation detail?

I see it as an implementation detail, with filepath being an edge case that is justifiably touching internals for niche reasons.

IMO, it's impractical to consider the entire GHC. namespace (whether in base or ghc) as part of the public API, and I suspect this would be in an Internal module if not for historic reasons.

JoshMeredith commented 1 year ago

filepath: slightly more involved changes since this library implements its own encoders and decoders

Since I maintain this library I'm interested in:
* a full patch

https://gitlab.haskell.org/ghc/packages/filepath/-/tree/wip/unboxed-codebuffer

* comparison of filepath benchmarks (they use an inlined tasty-bench)

I won't have time to get to these today but I'll be happy to provide them after I've run them

hasufell commented 1 year ago

I see it as an implementation detail, with filepath being an edge case that is justifiably touching internals for niche reasons.

I disagree.

How can you define your own encoders/decoders without reaching to those "internals"? How can you use all the APIs that use TextEncoding without that ability? Do you think base should provide exhaustive support for all of them?

I consider this public API. We should make that more clear. filepath does nothing really fancy. The alternative would have been to not use any of the API that utilizes TextEncoding, which is a lot, e.g. withCStringLen, and would have required major inlining and duplication of code.

JoshMeredith commented 1 year ago

How can you define your own encoders/decoders without reaching to those "internals"? How can you use all the APIs that use TextEncoding without that ability? Do you think base should provide exhaustive support for all of them?

My point here is that Internal modules are generally exposed for the few cases that need them (rather than completely unexposed in the cabal file), and usually in these cases the library authors mark Internal modules as subject to change.

I do think filepath is correct to be using the implementation details here, but I don't think it's a fair conclusion that one library should be locking an important implementation in place, especially since core libraries are updated in the course of merging incompatible case changes.

Bodigrim commented 1 year ago

This seems to lose the majority of the performance gains in terms of allocations. My guess is that there's some improvement over the baseline due to the use in a record as SPJ mentioned, but the encoders don't get fully unboxed.

Sorry, I don't follow. How providing a pattern synonym for backward compatibility is to affect performance?

JoshMeredith commented 1 year ago

Sorry, I don't follow. How providing a pattern synonym for backward compatibility is to affect performance?

What I mean to say is, there's a slight benefit to the encoders using this backwards-compatability pattern, but their inner loop is still over-allocating, so those encoders overall don't benefit much. It's only filepath that defines its own encoders, which IMO should ideally be switched to gain the increased performance.

I haven't verified this yet, but I wonder as of writing this if haskeline would still require the addition of UnboxedTuples since the functions in the view patterns use them.

Here's the implementation I used, please correct me if my understanding of your idea is incorrect:

data BufferCodec from to state = BufferCodec# {...}

pattern BufferCodec e r c g s <- BufferCodec# (unEncode -> e) (unRecover -> r) c g s
  where
    BufferCodec e r c g s = BufferCodec (mkEncode e) (mkRecover r) c g s

unEncode :: CodeBuffer# from to -> CodeBuffer from to
unEncode e i o = IO $ \st -> let !(# st', prog, i', o' #) = e i o st in (# st', (prog, i', o') #)

unRecover :: (Buffer from -> Buffer to -> State# RealWorld -> (# State# RealWorld, Buffer from, Buffer to #))
          -> (Buffer from -> Buffer to -> IO (Buffer from, Buffer to))
unRecover r i o = IO $ \st -> let !(# st', i', o' #) = r i o st in (# st', (i', o') #)

mkEncode :: CodeBuffer from to -> CodeBuffer# from to
mkEncode e i o st = let !(# st', (prog, i', o') #) = unIO (e i o) st in (# st', prog, i', o' #)

mkRecover :: (Buffer from -> Buffer to -> IO (Buffer from, Buffer to))
          -> (Buffer from -> Buffer to -> State# RealWorld -> (# State# RealWorld, Buffer from, Buffer to #))
mkRecover r i o st = let !(# st', (i', o') #) = unIO (r i o) st in (# st', i', o' #)

Bodigrim commented 1 year ago

What I mean to say is, there's a slight benefit to the encoders using this backwards-compatability pattern, but their inner loop is still over-allocating, so those encoders overall don't benefit much. It's only filepath that defines its own encoders, which IMO should ideally be switched to gain the increased performance.

It's good enough if consumers of the old interface are not broken immediately and have time to migrate. If they want performance gains, they can switch to the new interface indeed.

The thing is that the patch you demonstrated for filepath is quite involved: the module used to be almost Haskell98, but now it requires proficiency with MagicHash / UnboxedTuples and a good deal of CPP. If someone uses CodeBuffer interface in proprietary code, they'll broken without an easy way out. That's why I'm very keen to provide pattern synonyms.

The breakage in generics-sop is less of an issue from my perspective: if TH generates code which requires more language extensions, GHC will prompt user about it, so it is straightforward to fix. I doubt there is any other code (public or private) affected this way.

hasufell commented 1 year ago

It's good enough if consumers of the old interface are not broken immediately and have time to migrate. If they want performance gains, they can switch to the new interface indeed.

Yes. This proposal will get a nay from me if it's a breaking change without deprecation period.

JoshMeredith commented 1 year ago

It's good enough if consumers of the old interface are not broken immediately and have time to migrate. If they want performance gains, they can switch to the new interface indeed.

Ah I see. In that case, I think the pattern isn't a very high cost to incur.

I'd like to hear comments on the deprecation strategy with the pattern. I wouldn't mind taking it as an opportunity for us to properly define the public API here.

Bodigrim commented 1 year ago

I don't really see much point in deprecation of the pattern synonym from https://github.com/haskell/core-libraries-committee/issues/134#issuecomment-1454781039. It's not wrong to use the boxed implementation, and - if you are happy to sacrifice some performance - it is significantly easier to use, because you don't need MagicHash and stuff. Not much additional code to maintain either.

Bodigrim commented 1 year ago

@JoshMeredith there are two action points left:

Update the MR to provide pattern synonyms.
Make a full impact assessment using clc-stackage.

Bodigrim commented 1 year ago

@JoshMeredith just a gentle reminder about this. It would be a pity to close such discussion as abandoned, without reaching a conclusion.

JoshMeredith commented 1 year ago

Sorry for the delay.

I've managed to get clc-stagage mostly built today, with the exception of some c dependency issues.

The build hasn't found any additional uses of BufferCodec in Hackage code, and this combined with head.hackage makes me quite confident in claiming that the addition of UnboxedTuples to this one module in generics-sop is the only change required in Hackage.

In my testing with clc-stackage, I included the pattern synonyms in the bootstrap of GHC, and these allowed me to leave the base and filepath codecs unchanged.

I hope to provide the updated MR after the weekend.

JoshMeredith commented 1 year ago

I've updated the merge request containing the entire code at https://gitlab.haskell.org/ghc/ghc/-/merge_requests/9948.

The performance test mentioned in the description of this proposal has been merged into GHC in https://gitlab.haskell.org/ghc/ghc/-/merge_requests/10347, and we have some numbers improvements, see:

https://gitlab.haskell.org/ghc/ghc/-/jobs/1505952#L6766 (start of perf tests in release build)
encodingAllocations test in above output

Note that the previous changes to filepath and haskeline have been removed, as @Bodigrim's pattern synonym suggestion maintains backwards compatability for all cases other than the one module in generics-sop that requires the addition of UnboxedTuples.

I'd be happy to request a vote at this point with the updated code - assuming this is still considered a breaking change (depending on the generics-sop judgement).

Bodigrim commented 1 year ago

Quoting https://gitlab.haskell.org/ghc/ghc/-/jobs/1505952#L6827,

encodingAllocations(normal) run/alloc 504,094,272 392,094,048 -22.2% GOOD

It's GOOD indeed :)

Thanks @JoshMeredith. The remaining breakage in generics-sop is probably unavoidable. The problematic line is deriveGeneric ''BufferCodec: while we continue to provide BufferCodec pattern synonym, the TH-generated code now requires {-# LANGUAGE UnboxedTuples #-}. In general it's very hard to provide full backwards compatibility in the presence of Template Haskell. generics-sop has active maintainers and the latest change was just two weeks ago, so I'm sure it will be updated in time.

I'd like to give CLC members a couple of days to review the code before triggering a vote. The crux of the MR is in GHC.IO.Encoding.Types: we replace the existing data BufferCodec from to state, which has boxed fields, with a backwards-compatible pattern synonym, wrapping data BufferCodec# from to state with unboxed fields. The rest of the MR is AFAICT just trivial wrapping-unwrapping.

Bodigrim commented 1 year ago

Dear CLC members, let's vote on the proposal to provide an unboxed interface to BufferCodec. The complete patch is somewhat technical, but here is a gist. Currently we have

data BufferCodec from to state = BufferCodec {
  encode :: CodeBuffer from to,
  recover :: Buffer from -> Buffer to -> IO (Buffer from, Buffer to),
  ...
  }

The proposal is to expose an unboxed

data BufferCodec from to state = BufferCodec# {
  encode# :: CodeBuffer# from to,
  recover# :: Buffer from -> Buffer to -> State# RealWorld -> (# State# RealWorld, Buffer from, Buffer to #),
  ...
  }

and provide a pattern synonym BufferCodec and compatibility functions encode / recover with current, boxed types atop of it.

This way existing customers such as filepath continue to work as is, without changes, and new clients (e. g., GHC itself) can choose the unboxed interface to enjoy ~20% decrease in allocations. As @simonpj highlights, "Anything that shaves time off the inner loop of IO would be fantastic. That benefits everyone".

The compatibility layer covers all "normal" customers, except those who introspect BufferCodec by means of Template Haskell. The only package affected is generics-sop, because if uses TH to derive instances for BufferCodec, which can be fixed simply by enabling {-# LANGUAGE UnboxedTuples #-}.

@tomjaguarpaw @hasufell @mixphix @chshersh @angerman @parsonsmatt

+1 from me. I agree with Simon that anything decreasing allocations in I/O is a great addition to base.

hasufell commented 1 year ago

+1

chshersh commented 1 year ago

+1

This sounds like an amazing performance improvement! 👏🏻 I appreciate the backward compatibility effort, especially for something that is considered an internal thing. Solid work! 🏆

mixphix commented 1 year ago

Absolutely in favour. Thanks for this great contribution!

tomjaguarpaw commented 1 year ago

+1

Bodigrim commented 1 year ago

Thanks all, 5 votes in favour are enough to approve.

@JoshMeredith please link a PR patching generics-sop.

haskell / core-libraries-committee

Codebuffer should use unboxed tuples for encoders/decoders #134

Background

Proposal

Performance Tests

STG

Allocations

Ticky

Microbenchmarks

Breakage Analysis

Submodules

Hackage