input-output-hk / typed-protocols

Session types framework with support of protocol pipelining.
15 stars 4 forks source link

Redesign of typed-protocols #3

Closed coot closed 4 weeks ago

coot commented 2 years ago

A major redesign of typed-protocols.

Missing Todo Items:

coot commented 2 years ago

@dcoutts My thoughts around SingI.

  1. Where SingI is definitely useful is in Peer. It hides unnecessary detail of the implementation and simplifies writing clients and servers.

  2. There are places where an explicit Sing st parameter is more useful than a SingI st constraint is only in decode method. An explicit Sing st is more useful, since anyway when writing a codec one will need it to pattern match on it and do something different depending on the state. When the singleton is already in scope one does not need let tok = sing :: Sing st (which must come with a type annotation).

Currently we pass the singleton using type class dictionary rather than an explicit argument, except when it is anyway needed by the library user (which is only in decode, and TerminalStates).

Note that one can always translate one into the other with withSing and withSingI.

The current design choice makes it easier to write a Peer or a Driver. For example, when writing a Driver we don't need to depend on the protocol state, receiving or sending a message involves some IO & running a codec. We pass the singleton to the codec using type constraint dictionary and it will unpack it for the decoder.

Some time ago I tried a different approach: to remove singletons dependency all together and instead having:

type family StateToken ...

class StateTokenI ...

but then one would need to provide withSing and withSingI which are useful, and a bit non trivial to implement, so I decided against it. We would also need to use StateToken (and StateTokenI) for things that are not states per se like StNextKind in ChainSync.

coot commented 1 year ago

Rebased on top of current main branch.

coot commented 11 months ago

Below there are to memory profiles of two versions of cardano-node syncing an initial section of the chain.

In both cases we use cardano-node-8.4.0, with:

Both cases were running with -N4 rts option, with access to 4 CPUs (cores 8-11 & 20-23). In both cases we synced first 4492800 slots (the Byron & Shelley eras). The following rts options were used in both cases:

+RTS -N4 -qg -qa -hT -l-agu -ol<file> -RTS

typed-protocols-0.1

image

typed-protocols-0.2

image

analysis

typed-protocols-0.2 consumes less memory, but syncing took more time.

coot commented 11 months ago

Accumulated downloaded block sizes over time:

This is measured with CompletedBlockFetch msgs.

block-acc-size-8 4 0 n6

coot commented 11 months ago

Hypothesis

The difference between the new design (tp-0.2) and the previous one (tp-0.1) is in the usage of parallelism. tp-0.2 is a concurrent design while tp-0.1 is a parallel one. The tp-0.1 deserialises received bytes of responsenes skipped due to pipelining and acts on them in a separate thread, while the new design might do more work in the current thread. The exception is the CollectSTM primitive which will start a new thread which will read from the network until a full message is deserialised and then will make it available to the main thread. The parallel design is nicer for cache locality and might use more available parallelism, this might result in better performance.

Validation

To verify the hypothesis I run a demo-ping-pong which is using pingPongClientMin - this increases the ammount of Collect premitives. When we interpret it we the main thread will check if there are bytes available from the network, move forward decoding (in ping-pong, this really mean decoding a message). This slightly limits the effect described in the hypothesis above but still we can see small performance regression. In both runs we used +RTS -N2 (-N1 doesn't show significant difference). Moreover we adjusted the decoder which forces to compute fib x for x in 25,30,33,35 - this simulates extra work to be done when desrialising data. 10000 refers to value passed to pingPongPipelinedMin function (e.g. how many messages will be sent).

10000 -N2

fib 25

<<ghc: 693252552 bytes, 131 GCs, 1111345/1936152 avg/max bytes residency (6 samples), 15M in use, 0.001 INIT (0.000 elapsed), 0.317 MUT (0.396 elapsed), 0.011 GC (0.008 elapsed) :ghc>>
<<ghc: 1139944592 bytes, 198 GCs, 1409272/3842616 avg/max bytes residency (9 samples), 18M in use, 0.001 INIT (0.001 elapsed), 0.367 MUT (0.329 elapsed), 0.025 GC (0.019 elapsed) :ghc>>
<<ghc: 785498504 bytes, 156 GCs, 1333489/2512072 avg/max bytes residency (7 samples), 16M in use, 0.001 INIT (0.000 elapsed), 0.371 MUT (0.433 elapsed), 0.019 GC (0.014 elapsed) :ghc>>
<<ghc: 718707584 bytes, 139 GCs, 1024858/1831496 avg/max bytes residency (7 samples), 15M in use, 0.001 INIT (0.000 elapsed), 0.326 MUT (0.381 elapsed), 0.013 GC (0.010 elapsed) :ghc>>
<<ghc: 780890272 bytes, 153 GCs, 1347072/3076032 avg/max bytes residency (6 samples), 16M in use, 0.001 INIT (0.001 elapsed), 0.359 MUT (0.473 elapsed), 0.015 GC (0.011 elapsed) :ghc>>
<<ghc: 778972584 bytes, 158 GCs, 1226083/2364832 avg/max bytes residency (7 samples), 17M in use, 0.001 INIT (0.000 elapsed), 0.312 MUT (0.491 elapsed), 0.015 GC (0.012 elapsed) :ghc>>

fib 30

<<ghc: 722929720 bytes, 146 GCs, 1142140/1857728 avg/max bytes residency (7 samples), 15M in use, 0.001 INIT (0.000 elapsed), 0.298 MUT (0.399 elapsed), 0.012 GC (0.009 elapsed) :ghc>>
<<ghc: 740097944 bytes, 142 GCs, 1106096/2128864 avg/max bytes residency (7 samples), 16M in use, 0.001 INIT (0.000 elapsed), 0.306 MUT (0.389 elapsed), 0.012 GC (0.009 elapsed) :ghc>>
<<ghc: 719787376 bytes, 141 GCs, 1164262/2200912 avg/max bytes residency (7 samples), 15M in use, 0.001 INIT (0.000 elapsed), 0.303 MUT (0.507 elapsed), 0.012 GC (0.009 elapsed) :ghc>>
<<ghc: 676511144 bytes, 131 GCs, 936612/1920512 avg/max bytes residency (6 samples), 15M in use, 0.001 INIT (0.000 elapsed), 0.300 MUT (0.379 elapsed), 0.010 GC (0.008 elapsed) :ghc>>
<<ghc: 692115544 bytes, 140 GCs, 1098595/1968048 avg/max bytes residency (7 samples), 15M in use, 0.001 INIT (0.000 elapsed), 0.295 MUT (0.394 elapsed), 0.013 GC (0.010 elapsed) :ghc>>
<<ghc: 701682960 bytes, 139 GCs, 1182961/1953512 avg/max bytes residency (6 samples), 16M in use, 0.001 INIT (0.001 elapsed), 0.304 MUT (0.389 elapsed), 0.012 GC (0.009 elapsed) :ghc>>

fib 33

<<ghc: 707937552 bytes, 140 GCs, 1236272/2036200 avg/max bytes residency (7 samples), 15M in use, 0.001 INIT (0.001 elapsed), 0.333 MUT (0.484 elapsed), 0.015 GC (0.011 elapsed) :ghc>>
<<ghc: 700573304 bytes, 136 GCs, 1106250/1856232 avg/max bytes residency (6 samples), 15M in use, 0.001 INIT (0.000 elapsed), 0.339 MUT (0.420 elapsed), 0.013 GC (0.010 elapsed) :ghc>>
<<ghc: 692944704 bytes, 133 GCs, 1006560/1717544 avg/max bytes residency (8 samples), 14M in use, 0.001 INIT (0.000 elapsed), 0.348 MUT (0.395 elapsed), 0.014 GC (0.011 elapsed) :ghc>>
<<ghc: 702206016 bytes, 142 GCs, 1254130/2019408 avg/max bytes residency (7 samples), 15M in use, 0.001 INIT (0.001 elapsed), 0.359 MUT (0.495 elapsed), 0.015 GC (0.011 elapsed) :ghc>>
<<ghc: 696117248 bytes, 132 GCs, 1072453/2318576 avg/max bytes residency (6 samples), 15M in use, 0.001 INIT (0.000 elapsed), 0.325 MUT (0.451 elapsed), 0.012 GC (0.009 elapsed) :ghc>>
<<ghc: 723454704 bytes, 146 GCs, 1174185/2107224 avg/max bytes residency (6 samples), 15M in use, 0.001 INIT (0.001 elapsed), 0.319 MUT (0.522 elapsed), 0.012 GC (0.009 elapsed) :ghc>>

fib 35

<<ghc: 741803056 bytes, 142 GCs, 1041963/2139168 avg/max bytes residency (7 samples), 15M in use, 0.001 INIT (0.000 elapsed), 0.357 MUT (0.474 elapsed), 0.012 GC (0.009 elapsed) :ghc>>
<<ghc: 790232208 bytes, 125 GCs, 787934/1603760 avg/max bytes residency (6 samples), 14M in use, 0.001 INIT (0.000 elapsed), 0.381 MUT (0.244 elapsed), 0.011 GC (0.008 elapsed) :ghc>>
<<ghc: 691418616 bytes, 131 GCs, 1116128/2237456 avg/max bytes residency (6 samples), 16M in use, 0.001 INIT (0.001 elapsed), 0.340 MUT (0.433 elapsed), 0.012 GC (0.009 elapsed) :ghc>>
<<ghc: 712248712 bytes, 137 GCs, 1382013/2915256 avg/max bytes residency (8 samples), 17M in use, 0.001 INIT (0.001 elapsed), 0.346 MUT (0.400 elapsed), 0.015 GC (0.011 elapsed) :ghc>>
<<ghc: 732023544 bytes, 147 GCs, 1076276/2302472 avg/max bytes residency (8 samples), 15M in use, 0.001 INIT (0.000 elapsed), 0.338 MUT (0.491 elapsed), 0.013 GC (0.010 elapsed) :ghc>>
<<ghc: 674865520 bytes, 127 GCs, 839364/2055960 avg/max bytes residency (7 samples), 13M in use, 0.001 INIT (0.000 elapsed), 0.365 MUT (0.385 elapsed), 0.010 GC (0.008 elapsed) :ghc>>

10000 -N2

fib 25

<<ghc: 651765456 bytes, 138 GCs, 767406/1816936 avg/max bytes residency (6 samples), 14M in use, 0.001 INIT (0.000 elapsed), 0.274 MUT (0.436 elapsed), 0.008 GC (0.007 elapsed) :ghc>>
<<ghc: 717163552 bytes, 149 GCs, 996149/2066728 avg/max bytes residency (6 samples), 14M in use, 0.001 INIT (0.000 elapsed), 0.277 MUT (0.445 elapsed), 0.008 GC (0.006 elapsed) :ghc>>
<<ghc: 672408816 bytes, 141 GCs, 842413/1682152 avg/max bytes residency (6 samples), 15M in use, 0.001 INIT (0.000 elapsed), 0.288 MUT (0.420 elapsed), 0.008 GC (0.007 elapsed) :ghc>>
<<ghc: 742142336 bytes, 152 GCs, 1009322/2108504 avg/max bytes residency (6 samples), 13M in use, 0.001 INIT (0.000 elapsed), 0.298 MUT (0.515 elapsed), 0.009 GC (0.007 elapsed) :ghc>>
<<ghc: 673896976 bytes, 142 GCs, 931796/1770528 avg/max bytes residency (6 samples), 14M in use, 0.001 INIT (0.000 elapsed), 0.264 MUT (0.437 elapsed), 0.008 GC (0.006 elapsed) :ghc>>
<<ghc: 707061800 bytes, 149 GCs, 926605/1572712 avg/max bytes residency (6 samples), 14M in use, 0.001 INIT (0.000 elapsed), 0.326 MUT (0.474 elapsed), 0.009 GC (0.006 elapsed) :ghc>>

fib 30

<<ghc: 666647440 bytes, 139 GCs, 848067/1745440 avg/max bytes residency (5 samples), 14M in use, 0.001 INIT (0.001 elapsed), 0.309 MUT (0.430 elapsed), 0.008 GC (0.006 elapsed) :ghc>>
<<ghc: 664194784 bytes, 142 GCs, 910480/1619376 avg/max bytes residency (6 samples), 15M in use, 0.001 INIT (0.000 elapsed), 0.292 MUT (0.516 elapsed), 0.009 GC (0.007 elapsed) :ghc>>
<<ghc: 697683712 bytes, 147 GCs, 961426/1852536 avg/max bytes residency (6 samples), 13M in use, 0.001 INIT (0.000 elapsed), 0.311 MUT (0.448 elapsed), 0.009 GC (0.008 elapsed) :ghc>>
<<ghc: 720565536 bytes, 150 GCs, 937760/2062552 avg/max bytes residency (6 samples), 14M in use, 0.001 INIT (0.001 elapsed), 0.335 MUT (0.428 elapsed), 0.009 GC (0.007 elapsed) :ghc>>
<<ghc: 726204456 bytes, 152 GCs, 860138/1553024 avg/max bytes residency (7 samples), 14M in use, 0.001 INIT (0.001 elapsed), 0.290 MUT (0.464 elapsed), 0.010 GC (0.008 elapsed) :ghc>>
<<ghc: 674834184 bytes, 143 GCs, 864544/1689600 avg/max bytes residency (7 samples), 13M in use, 0.001 INIT (0.001 elapsed), 0.278 MUT (0.492 elapsed), 0.009 GC (0.008 elapsed) :ghc>>

fib 33

<<ghc: 810412512 bytes, 165 GCs, 926561/1434184 avg/max bytes residency (7 samples), 14M in use, 0.001 INIT (0.000 elapsed), 0.332 MUT (0.471 elapsed), 0.010 GC (0.008 elapsed) :ghc>>
<<ghc: 6902868328 bytes, 941 GCs, 4392287/9398040 avg/max bytes residency (14 samples), 28M in use, 0.001 INIT (0.000 elapsed), 0.976 MUT (0.580 elapsed), 0.139 GC (0.107 elapsed) :ghc>>
<<ghc: 771473104 bytes, 159 GCs, 1071444/2346952 avg/max bytes residency (7 samples), 15M in use, 0.001 INIT (0.001 elapsed), 0.328 MUT (0.464 elapsed), 0.010 GC (0.008 elapsed) :ghc>>
<<ghc: 731049584 bytes, 153 GCs, 948261/2082032 avg/max bytes residency (6 samples), 15M in use, 0.001 INIT (0.000 elapsed), 0.306 MUT (0.466 elapsed), 0.009 GC (0.007 elapsed) :ghc>>
<<ghc: 742410944 bytes, 158 GCs, 897226/1611784 avg/max bytes residency (7 samples), 14M in use, 0.001 INIT (0.000 elapsed), 0.326 MUT (0.465 elapsed), 0.009 GC (0.007 elapsed) :ghc>>
<<ghc: 781055256 bytes, 166 GCs, 868950/1614024 avg/max bytes residency (8 samples), 13M in use, 0.001 INIT (0.000 elapsed), 0.310 MUT (0.389 elapsed), 0.010 GC (0.009 elapsed) :ghc>>

fib 35

<<ghc: 670043352 bytes, 142 GCs, 918360/1664168 avg/max bytes residency (6 samples), 14M in use, 0.001 INIT (0.000 elapsed), 0.327 MUT (0.528 elapsed), 0.008 GC (0.006 elapsed) :ghc>>
<<ghc: 5716566488 bytes, 790 GCs, 3594732/7988856 avg/max bytes residency (12 samples), 25M in use, 0.001 INIT (0.000 elapsed), 0.914 MUT (0.565 elapsed), 0.111 GC (0.086 elapsed) :ghc>>
<<ghc: 674450040 bytes, 139 GCs, 794592/1570720 avg/max bytes residency (5 samples), 14M in use, 0.001 INIT (0.000 elapsed), 0.324 MUT (0.460 elapsed), 0.007 GC (0.006 elapsed) :ghc>>
<<ghc: 678670240 bytes, 142 GCs, 848784/1488592 avg/max bytes residency (5 samples), 14M in use, 0.001 INIT (0.000 elapsed), 0.337 MUT (0.475 elapsed), 0.008 GC (0.006 elapsed) :ghc>>
<<ghc: 748257632 bytes, 149 GCs, 757854/1348048 avg/max bytes residency (6 samples), 14M in use, 0.001 INIT (0.000 elapsed), 0.332 MUT (0.466 elapsed), 0.009 GC (0.007 elapsed) :ghc>>
<<ghc: 732409224 bytes, 150 GCs, 912068/1780680 avg/max bytes residency (6 samples), 14M in use, 0.001 INIT (0.000 elapsed), 0.330 MUT (0.474 elapsed), 0.009 GC (0.007 elapsed) :ghc>>
coot commented 11 months ago

Possible Approaches

There might be two possible approaches, with various pros & cons.

Lazy deserialisation.

If we provide length of messages (which would require to modify our codecs, or possibly make mux aware of message boundaries, then we could read data from the network, pass it to another thread which would deserialise it and act on it, e.g. add a block to ChainDB. However the drawback is that he new thread will reach out for the data from the heap, as it's not available in the CPU cache, however acting on deserialised data will be available in the local cache.

Extension of Peer

We could also how to extend a peer with a different pipelined send command, which would be interpreted in a similar style that the current tp-0.1 approach does.