Fixed latent bug in SocketTpktReader and optimized size of ConcurrentRequestExecutor

gfoidl commented 2 years ago

First off I whish you and your child all the best -- get well soon! (So sorry that this text got long...)

This PR addresses the points brought up in https://github.com/mycroes/Sally7/pull/23#discussion_r730095872.

Abstract

the potential issue where too much data is read needs to be fixed -- correctness is the most important thing
we should keep ValueTask where available

ValueTask vs. custom SocketAwaitable

To investigate which option seems better as first step there's an reduced benchmark, that sends a buffer via loopback to a server, which in turn reads the sent data.

Benchmark with simple read and write to loopback

It's really not so easy to get reliable results from the benchmarks, due all the latencies involved. So I've run the benchmarks and profiling sessions multiple times, and tried to pick the results from a "representative" run (i.e. not the best, not the worst ones).

With some debugging (and hacky console output) it can be seen that send completes synchronous most of the time, while receive completes async most of the time. So I hope this benchmark / comparison is fair, and doesn't bias the results (too much).

|              Method | BufferSize |          Mean |        Error |       StdDev | Ratio | RatioSD |  Gen 0 | Allocated |
|-------------------- |----------- |--------------:|-------------:|-------------:|------:|--------:|-------:|----------:|
| WithSocketAwaitable |        960 |      56.34 μs |     1.125 μs |     2.629 μs |  1.00 |    0.00 | 0.0610 |     248 B |
|       WithValueTask |        960 |      44.00 μs |     0.859 μs |     1.614 μs |  0.78 |    0.04 |      - |     175 B |
|                     |            |               |              |              |       |         |        |           |
| WithSocketAwaitable |  104857600 | 127,997.08 μs | 2,088.934 μs | 1,851.785 μs |  1.00 |    0.00 |      - |   3,126 B |
|       WithValueTask |  104857600 | 125,761.26 μs | 2,224.399 μs | 2,080.704 μs |  0.98 |    0.02 |      - |   1,226 B |

When run bench.WithSocketAwaitable() in a loop of 100.000 iterations with BufferSize set to 960, a profiler shows

WithSocketAwaitable

Type	Allocations
+ System.Runtime.CompilerServices.AsyncTaskMethodBuilder\<>.AsyncStateMachineBox\<>	68 403
\| - System.Runtime.CompilerServices.AsyncTaskMethodBuilder\<System.Threading.Tasks.VoidTaskResult>.AsyncStateMachineBox\<\<WithSocketAwaitable>d__14>	68 402
\| - System.Runtime.CompilerServices.AsyncTaskMethodBuilder\<System.Threading.Tasks.VoidTaskResult>.AsyncStateMachineBox\<\<\<Main>$>d__0>	1
- System.Action	68 403

grafik

So even we do 100.000 iterations, there are less than that allocations for the AsyncStateMachineBox, meaning that some operations (send, receive) complete synchronous. When the profiling session is run multiple times, these number change quite frequently. That was already alluded above.

For the SocketAwaitable I tried a version with ICriticalNotifyCompletion too, that avoid the ceremony with the ExecutionContext, but results didn't change much (whether positive nor negative).

The AsyncStateMachineBox is needed to let the locals of the async-method survive, i.e. they need to be allocated on the heap.

The same with bench.WithValueTask() yields

WithValueTask

Type	Allocations
+ System.Runtime.CompilerServices.AsyncTaskMethodBuilder\<>.AsyncStateMachineBox\<>	19 494
\| - System.Runtime.CompilerServices.AsyncTaskMethodBuilder\<System.Threading.Tasks.VoidTaskResult>.AsyncStateMachineBox\<\<WithValueTask>d__18>	19 493
\| - System.Runtime.CompilerServices.AsyncTaskMethodBuilder\<System.Threading.Tasks.VoidTaskResult>.AsyncStateMachineBox\<\<\<Main>$>d__0>	1

grafik

Some rans show the saw-tooth pattern like above too, but for several runs it looks more like the allocation graph shown in the image. I have no reason why this behavior is the way it seems to be.

The AsyncStateMachineBox is needed for the same reason, to let the locals survive by allocating them on the heap.

In the VT-based version the allocations for the Action never showed up. I'd count this as positive point. So if we assume that send or receive complete the same amount asynchronous as with the SocketAwaitable-version, then we can save the allocations for the action-delegate.

But why isn't this delegate needed for the VT-version? It's because .NET has for VTs a special AwaitableSocketAsyncEventArgs which implement several IValueTaskSources for the VT, and (most importantly) these instances are cached. So there's no need to allocate the action delegate for the continuation over and over again.

In the following figure focus on the marked are of the calltree.

callstack_side_by_side

Both start with the IO completion from the OS by executing PerformIOCompletionCallback. Then the difference begins. For the VT (right hand side) there's the AwaitableSocketAsyncEventArgs plumbing things together for the ValueTask via the IValueTaskSource-methods, and only the allocation for the statemachine-box is needed. This is because the ValueTask is backed by the IValueTaskSource, and not a Task for the async completion. I'll skip over the exact details on how this works, the code is on GitHub :wink:. On the other (left) side there's the calltree for the SocketAwaitable, allocating the statemachine-box, and also the "MoveNextAction" for AsyncTaskMethodBuilder<T>.AwaitOnCompleted.

So based on that investigation I'd go with the ValueTask where available.

Benchmark code

```c# using System; using System.Diagnostics; using System.Net; using System.Net.Sockets; using System.Runtime.InteropServices; using System.Threading.Tasks; using BenchmarkDotNet.Attributes; #if !DEBUG using BenchmarkDotNet.Running; #endif #nullable disable Bench bench = new(); bench.Setup(); await bench.WithSocketAwaitable(); await bench.WithValueTask(); #if !DEBUG BenchmarkRunner.Run(); #endif [MemoryDiagnoser] public class Bench { [Params(960, 104857600)] public int BufferSize { get; set; } = 104857600; // 100 MiB private Socket _listener; private Socket _server; private Socket _client; private Memory _sendBuffer; private Memory _receiveBuffer; private SocketAwaitable _sendAwaitable; private SocketAwaitable _receiveAwaitable; private Memory _sendBufferFromPinnedArray; private Memory _receiveBufferFromPinnedArray; [GlobalSetup] public void Setup() { _listener = new Socket(AddressFamily.InterNetwork, SocketType.Stream, ProtocolType.Tcp); _listener.Bind(new IPEndPoint(IPAddress.Loopback, 0)); _listener.Listen(1); _client = new Socket(AddressFamily.InterNetwork, SocketType.Stream, ProtocolType.Tcp); _client.Connect(_listener.LocalEndPoint); _server = _listener.Accept(); byte[] buffer = new byte[this.BufferSize]; Random.Shared.NextBytes(buffer); _sendBuffer = buffer; _receiveBuffer = new byte[this.BufferSize]; _sendAwaitable = new SocketAwaitable(new SocketAsyncEventArgs()); _receiveAwaitable = new SocketAwaitable(new SocketAsyncEventArgs()); byte[] pinnedBuffer = GC.AllocateUninitializedArray(this.BufferSize, pinned: true); Random.Shared.NextBytes(pinnedBuffer); _sendBufferFromPinnedArray = MemoryMarshal.CreateFromPinnedArray(pinnedBuffer, 0, pinnedBuffer.Length); pinnedBuffer = GC.AllocateUninitializedArray(this.BufferSize, pinned: true); _receiveBufferFromPinnedArray = MemoryMarshal.CreateFromPinnedArray(pinnedBuffer, 0, pinnedBuffer.Length); } [Benchmark(Baseline = true)] public async ValueTask WithSocketAwaitable() { static void Throw() => throw new Exception(); if (!MemoryMarshal.TryGetArray(_sendBuffer, out ArraySegment segment)) { Throw(); } _sendAwaitable.EventArgs.SetBuffer(segment.Array, segment.Offset, segment.Count); await _client.SendAsync(_sendAwaitable); Debug.Assert(_sendAwaitable.EventArgs.BytesTransferred == this.BufferSize); if (!MemoryMarshal.TryGetArray(_receiveBuffer, out segment)) { Throw(); } _receiveAwaitable.EventArgs.SetBuffer(segment.Array, segment.Offset, segment.Count); int received = 0; int readCount = 0; do { readCount++; if (received > 0) { _receiveAwaitable.EventArgs.SetBuffer(segment.Offset + received, this.BufferSize - received); } await _server.ReceiveAsync(_receiveAwaitable); if (_receiveAwaitable.EventArgs.BytesTransferred <= 0) { Throw(); } received += _receiveAwaitable.EventArgs.BytesTransferred; } while (received < this.BufferSize); #if DEBUG Console.WriteLine($"Is OK: {_sendBuffer.Span.SequenceEqual(_receiveBuffer.Span)}"); Console.WriteLine($"Read count: {readCount}"); #endif } [Benchmark] public async ValueTask WithValueTask() { int sent = await _client.SendAsync(_sendBufferFromPinnedArray, SocketFlags.None).ConfigureAwait(false); Debug.Assert(sent == this.BufferSize); Memory buffer = _receiveBufferFromPinnedArray; int received = 0; int readCount = 0; do { readCount++; if (received > 0) { buffer = _receiveBufferFromPinnedArray.Slice(received); } int read = await _server.ReceiveAsync(buffer, SocketFlags.None).ConfigureAwait(false); if (read <= 0) { Throw(); static void Throw() => throw new Exception(); } received += read; } while (received < this.BufferSize); #if DEBUG Console.WriteLine($"Is OK: {_sendBufferFromPinnedArray.Span.SequenceEqual(_receiveBufferFromPinnedArray.Span)}"); Console.WriteLine($"Read count: {readCount}"); #endif } } ```

Benchmark against a soft-plc

Let's see how it looks when reading with Sally7 from a PLC. As PLC softplc is run in a linux-VM with some test-data applied to it. To have a benchmarks that runs both variants (default and with ValueTasks) in "one run", I duplicated the SocketTpktReader to have the default version, and one that uses ValueTasks. Other classes are duplicated as needed. See commit 126259d8f9d57a52515837e8188f6fa277dc3d31. The benchmark just reads three dataitems out of the PLC. So results may and will vary depending how much is read, and especially what the network will do (latencies).

Side note: I believe there's some room for improvements in the handling of the Requests. Will address this in a separate PR, so we can discuss this over there (I like PRs that focus on one point (although I'm aware that I didn't follow my own rule here...)).

Again the note of the flakiness of the results. I tried hard to get some stable results, this is what looks the most realistic:

|                 Method |     Mean |   Error |   StdDev |   Median | Ratio | RatioSD |  Gen 0 | Allocated |
|----------------------- |---------:|--------:|---------:|---------:|------:|--------:|-------:|----------:|
| ReadViaSocketAwaitable | 220.5 μs | 4.36 μs | 10.10 μs | 216.9 μs |  1.00 |    0.00 | 0.2441 |     864 B |
|       ReadViaValueTask | 207.9 μs | 4.14 μs |  8.55 μs | 207.2 μs |  0.94 |    0.05 | 0.2441 |     800 B |

TBH this matches with my expectations, though I'm a bit surprised that it matches :wink:. For perf / time I guess it's less allocations, and the pre-pinned memory. For deeper analysis of that, I could look with PerfView into this (but not yet).

Let's look with a profiler into it, by doing 20.000 reads.

SocketAwaitable:

grafik

ValueTask:

grafik

For the AsyncStateMachineBox it's the same* count of allocations. But the allocation count for Action-delegate halved from 40.000 to 20.000.

* the tiny difference results from sync vs. async completion of a socket-operations, but they are pretty close, and one time it's on that side, one time on the other side.

20.000 action allocations can be eliminated from both version. When we complete the request, then https://github.com/mycroes/Sally7/blob/e250d37adb2926a6c4e92084d9979edd93b10edd/Sally7/RequestExecutor/ConcurrentRequestExecutor.cs#L235-L237 the length is set, but IsCompleted is left to false. Hence the async machinery has to got the route GetAwaiter() -> IsCompleted -> OnCompleted() -> continuation() -> GetResult(), and therefore an Action-delegate needs to be allocated for continuation. By updating IsCompleted the route can be shortened to GetAwaiter() -> IsCompleted -> GetResult(), and not delegate for continuation needs to be allocated.

Note: this is a best effort solution, where the request to be awaited is the same as the one handled by the receive (I believe this is the common case). Otherwise there will be the async route.

SocketAwaitAble (after reduction of action allocations):

grafik

ValueTask (after elimination of action allocations):

grafik

Side note (and a bit a note / remainder for myself):

Before ValueTasks came into .NET I used "await anything" quite extensively, but then somehow I forgot about this nice technique. The code in Sally7, and especially this investigation remindeds me to use this in the future a bit more. There's no need to artificially complicate things with ValueTasks, especially when the "Awaitable" can be pooled / cached easily.

Out of interest I tried different versions of the Request-type. Default is the one used in current main-branch. DefaultFixed is the same as Default, but with the IsCompleted-fix -- so the one used in this PR. DefaultFixed2 is more or less for reference, and completely synchronous -- this won't work with concurrent requests though, hence "for reference". ValueTaskBased is based on IValueTaskSource instead of the custom awaiter.

|         Method |      Mean |    Error |   StdDev | Ratio | RatioSD |  Gen 0 | Allocated |
|--------------- |----------:|---------:|---------:|------:|--------:|-------:|----------:|
|        Default | 192.39 ns | 3.634 ns | 7.423 ns |  1.00 |    0.00 | 0.0663 |     208 B |
|   DefaultFixed |  58.14 ns | 0.896 ns | 0.794 ns |  0.30 |    0.01 |      - |         - |
|  DefaultFixed2 |  26.31 ns | 0.457 ns | 0.427 ns |  0.13 |    0.01 |      - |         - |
| ValueTaskBased |  94.46 ns | 1.903 ns | 2.790 ns |  0.48 |    0.02 |      - |         - |

DefaultFixed with synchronous completion is pretty fast. ValueTaskBased shows more overhead than the simple custom awaiter, so if possible* use custom awaiters over ValueTasks (of course measure results before taking a conclusion).

* one also needs to take ExecutionContext into account, if needed, so the way via IValueTaskSource and the ManualResetValueTaskSourceCore is preferable, as otherwise thing get complicated pretty soon.

Code

```c# using System; using System.Threading.Tasks; using BenchmarkDotNet.Attributes; using BenchmarkDotNet.Running; Bench bench = new(); Memory mem0 = await bench.Default(); Memory mem1 = await bench.DefaultFixed(); Memory mem2 = await bench.DefaultFixed2(); Memory mem3 = await bench.ValueTaskBased(); Console.WriteLine(mem0.Length); Console.WriteLine(mem1.Length); Console.WriteLine(mem2.Length); Console.WriteLine(mem3.Length); #if !DEBUG BenchmarkRunner.Run(); #endif [MemoryDiagnoser] public class Bench { private readonly Request _request = new(); private readonly Request1 _request1 = new(); private readonly Request2 _request2 = new(); private readonly RequestValueTask _requestValueTask = new(); private byte[] _buffer = new byte[100]; private int _length = 42; [Benchmark(Baseline = true)] public async ValueTask> Default() { _request.Reset(); _request.SetBuffer(_buffer); _request.Complete(_length); return await _request; } [Benchmark] public async ValueTask> DefaultFixed() { _request1.Reset(); _request1.SetBuffer(_buffer); _request1.Complete(_length); return await _request1; } [Benchmark] public ValueTask> DefaultFixed2() { _request2.SetBuffer(_buffer); _request2.Complete(_length); return ValueTask.FromResult(_request2.Buffer); } [Benchmark] public async ValueTask> ValueTaskBased() { ValueTask> vt = _requestValueTask.SetBuffer(_buffer); _requestValueTask.Complete(_length); return await vt; } } ``` ```c# using System; using System.Threading.Tasks; using System.Threading.Tasks.Sources; public class RequestValueTask : IValueTaskSource> { private ManualResetValueTaskSourceCore> _mrvts = new(); private Memory _buffer; public Memory Buffer => _buffer; public ValueTask> SetBuffer(Memory buffer) { _mrvts.Reset(); _buffer = buffer; return new ValueTask>(this, _mrvts.Version); } public void Complete(int length) => _mrvts.SetResult(_buffer.Slice(0, length)); public Memory GetResult(short token) => _mrvts.GetResult(token); public ValueTaskSourceStatus GetStatus(short token) => _mrvts.GetStatus(token); public void OnCompleted(Action continuation, object? state, short token, ValueTaskSourceOnCompletedFlags flags) => _mrvts.OnCompleted(continuation, state, token, flags); } ```

Things look different if await req doesn't complete synchronous.

|         Method |     Mean |     Error |    StdDev | Ratio |  Gen 0 | Allocated |
|--------------- |---------:|----------:|----------:|------:|-------:|----------:|
|        Default | 1.155 μs | 0.0410 μs | 0.0022 μs |  1.00 | 0.0744 |     235 B |
| ValueTaskBased | 1.447 μs | 0.2077 μs | 0.0114 μs |  1.25 | 0.0610 |     198 B |

The ValueTaskBased-version (baked by IValueTaskSource) doesn't need the allocations for the continuation (I believe C# compiler has special handling for built-in awaiter-types). Here the time needed isn't relevant, as the socket-operations will be dominant. It's solely about allocations here.

Benchmark code

```c# using System; using System.Threading; using System.Threading.Tasks; using BenchmarkDotNet.Attributes; using BenchmarkDotNet.Running; Bench bench = new(); Memory mem0 = await bench.Default(); Memory mem1 = await bench.ValueTaskBased(); Console.WriteLine(mem0.Length); Console.WriteLine(mem1.Length); #if !DEBUG BenchmarkRunner.Run(); #endif [MemoryDiagnoser] [ShortRunJob] public class Bench { private readonly Request _request = new(); private readonly RequestValueTask _requestValueTask = new(); private byte[] _buffer = new byte[100]; private int _length = 42; [Benchmark(Baseline = true)] public async ValueTask> Default() { _request.Reset(); _request.SetBuffer(_buffer); ThreadPool.UnsafeQueueUserWorkItem(b => { //Thread.Sleep(10_000); b._request.Complete(b._length); //Console.WriteLine("complete"); }, this, preferLocal: false); return await _request; } [Benchmark] public async ValueTask> ValueTaskBased() { ValueTask> vt = _requestValueTask.SetBuffer(_buffer); ThreadPool.UnsafeQueueUserWorkItem(b => { //Thread.Sleep(10_000); b._requestValueTask.Complete(b._length); }, this, preferLocal: false); return await vt; } } ```

I implemented this after writing the rest of this description with commit c8b312321120c031415d4427b40a607dd1676342. So this isn't written in chronological order. Looking at the IL and allocation profile, this is a regression, as the AsyncStateMachineBox increases by a field for the ConfiguredValueTaskAwaitable<Memory<byte>>. 4.479.552 / 19.998 = 224 bytes per instance So after (next section) a reduction of the size, we would increase it again. Thus this change is reverted in

Then I got the idea to (try) reduce the size of the AsyncStateMachineBox, by eliminating the field for the req.

Before:

grafik

4.479.104 / 19.996 = 224 bytes per instance

After (commit c718f72cb76ae033eeaf7be3713dd541191de77b):

grafik

4.318.488 / 19.993 = 216 bytes per instance

The difference of 8 bytes is exactly the size of the reference we save (on 64bit process). Not a huge saving, but still a little bit.

The size of the AsyncStateMachineBox can be further reduced by letting the C# compiler re-use some fields for the TaskAwaiters. In the ILSpy-images above <>u__1 and <>u__2 get collapsed into only one field.

After (commit da12e690b4ad34d4d9dd68c15107f407d3a897bf):

3.999.200 / 19.996 = 200 bytes per instance

So in total we saved 24 bytes per send / receive operation.

Also quite funny to see that BenchmarkDotNet immediately needs a larger amount of cycles to establish a 'reliable' result.

BenchmarkDotNet (BDN) has a heuristic to run as much iterations until the results become stable. You see that effect here. But with MaxIterations (or that like) this can be limited -- it's sometimes quite handy, otherwise it may take really long...

my benchmarking against a sample project that I only have locally or against a real PLC at a customer, both aren't pretty

Yeah, unfortunately it's quite hard to come up with reliable and reproducable benchmarks for this. I don't have any good idea either. Maybe create a soft-plc as a docker image, which can be used for benchmarking against? (also maybe create a separate issue tracking this, so that other people can chime in)

I'm not that familiar with the internal S7 communication

The code here in Sally7 is very well arranged, so I was able to learn about the protocol by skimming through the code :+1:.

PS: I didn't proof-read the text, so please apologize any / all mistakes.

mycroes commented 2 years ago

Again a long read, and again very impressive. I didn't even get to the actual code yet, but all your research is really nice. Love the tiny details like the int return for field reuse. Will start reviewing this on phone, so I might not finish it today. May I ask how much time you spent on all this? Again, impressive feat, I feel honored by all the effort you're putting in.

gfoidl commented 2 years ago

Well it wasn't much to review

Lot's of text for a little bit of code change, haha 😄

May I ask how much time you spent on all this?

I didn't work in one rush on it, but started on Saturday after lunch, and the findings are quite interesing. The thing with the "int-return" re-use is also a TIL-moment for me...so I didn't monitor the time taken, but it's a couple of hours (full of passion).

mycroes / Sally7