Avoid Allocations for High Performance

1fabi0 commented 3 years ago

If the Shared Memory is running in a HighPerformance like it's called 50 times+ per second quite a lot allocations are going on to avoid this every allocation should be done once and then reused, if you want to i can take care of this and try to replace allocations to one allocation at as much places as possible and increase the speed of this for HighPerformance. Maybe it then makes sense to use Array Segments because they can give access to a part of a array and can already save a lot allocations

justinstenning commented 3 years ago

You are welcome to submit a pull request. I just ask you do it on the develop branch.

shuebner commented 2 years ago

I have not benchmarked it, but it seems like using a threadpool Task for every single message sent in RpcBuffer is also detrimental to performance. AFAIK, it could even lead to threadpool starvation should the response take a long time. There is ThreadPool.RegisterWaitForSingleObject specifically for objects like the unfortunately synchronous-only WaitHandles. Stephen Cleary's AsyncEx library uses it too. This may have a very good benefit-cost-ratio.

justinstenning commented 2 years ago

@shuebner thanks for the input. I'm not sure when I will have time to start taking a look at this.

shuebner commented 2 years ago

Maybe I will do so myself. It seems like an innocuous change. As long as we are confident that the automated tests discover errors, this should be quick. Maybe threadpool behavior is also behind #52 . Who knows.

As a side-effect, it will also enable regular CancellationToken-based request cancellation instead of timeout-only. And this is what I am really after :-)

justinstenning commented 2 years ago

@shuebner yeah have a go, I'll have a go also, but better for you to try with your use case in mind. And yes the unit tests will catch any deadlock etc.

shuebner commented 2 years ago

@spazzarama did you notice my PR from 3 days ago?

justinstenning commented 2 years ago

@shuebner there is an implementation on develop branch that looks to remove the Tasks. It still doesn't add the CancellationToken's so could you please check whether you can get your changes to work on there?

Feel free to replace the current implementation if you see your version achieves the same thing as it does appear to be a simpler implementation.

shuebner commented 2 years ago

Feel free to replace the current implementation if you see your version achieves the same thing as it does appear to be a simpler implementation.

The changes @Christian-Schl made seem to be superior to mine alone (in terms of performance): Even though RPC_LoadTest_5k_Small_Multi_Thread would have to be converted into a proper benchmark (e. g. Benchmark.NET) to give reliable results, his approach seems to be roughly 2.5x as fast as my small change.

It looks like we have to somehow merge the two approaches to get both cancellation and performance.

I will look into it next week. @Christian-Schl maybe you could have a look as well as I suspect that you have an easier time understanding my small change than I have with your many changes 😉

Christian-Schl commented 2 years ago

@shuebner Hello this week i dont have to much time to check the changes. Maybe in the holydays between the years.

What i can remember before my changes the allocations where no problem from performance point of view. The most time was consumed by switching the thread context because auf the await calls.

There is also the MR https://github.com/spazzarama/SharedMemory/pull/59 pending which can also has side effect on the performance. Do you have any Benchmark how many allocations are happing for one request/reponse ?

shuebner commented 2 years ago

@Christian-Schl I never claimed that allocations are a problem. I just answered a question. It seems to me that the actual change in my PR (using ThreadPool.RegisterWaitForSingleObject instead of creating a regular ThreadPool Task) is overshadowed by the implementation detail of reusing completed task objects. Disregard if I misunderstood. Let's continue code-specific discussions in the PR. This issue started as something different. Sorry for hijacking it.

justinstenning commented 2 years ago

@1fabi0 this thread got a bit off topic. Back to your original comments around allocations:

I have played with a non-async version and have it down to 38us vs 44us and 1KB vs 3KB allocations per call. It isn't compatible with the existing timeout/cancellation approach so I'm not sure if/when I would look at implementing. I'll probably just post it as another branch for now and come back to it in the future. This also addresses some of the ThreadPool exhaustion issues.

Here is the branch: https://github.com/spazzarama/SharedMemory/tree/non-async-benchmarks

|                   Method |     Mean |    Error |   StdDev |  Gen 0 |  Gen 1 | Allocated |
|------------------------- |---------:|---------:|---------:|-------:|-------:|----------:|
|           ClientToServer | 38.81 us | 0.245 us | 0.229 us | 0.1221 | 0.0610 |      1 KB |
|      ServerToClientAsync | 54.46 us | 0.339 us | 0.318 us | 0.1831 | 0.0610 |      1 KB |
| ClientBiderectionalAsync | 53.89 us | 0.440 us | 0.390 us | 0.1831 | 0.0610 |      1 KB |

vs current develop branch

|                   Method |     Mean |    Error |   StdDev |  Gen 0 | Allocated |
|------------------------- |---------:|---------:|---------:|-------:|----------:|
|           ClientToServer | 44.41 us | 0.847 us | 0.792 us | 0.4883 |      3 KB |
|      ServerToClientAsync | 43.72 us | 0.411 us | 0.385 us | 0.4883 |      3 KB |
| ClientBiderectionalAsync | 43.24 us | 0.526 us | 0.492 us | 0.4883 |      3 KB |

1fabi0 commented 2 years ago

Yeah thanks, I original opened the Issue because of an project I did where we had such a throughput that every reappearing allocation caused pressure on the gc, in the project I resolved a lot by reusing the final byte array where data was copied to, in the shared memory library. But the changes I did were in the end just something very specific, that's why I didn't opened a pr, because I think that the change would have caused problems for others using the RPCBuffers.

I will take a look at your changes and I will review prs in the future and comment if I see something that is not clear to me. As I did with changes discussed here.

For me changes that were already done sound very good. But I also think there is a lot more that still can be done.

justinstenning commented 2 years ago

@1fabi0 the non-async-benchmarks branch has been updated to reduce allocations further and to completely remove the async methods.

I think in the next major release I will split RpcBuffer into two implementations, one with the async and another without. The consumer can then be responsible for deciding how best to deal with processing messages.

I also plan to use ReadonlySpan rather than byte[], again this allows the consumers to decide whether they need to copy data or not.

justinstenning / SharedMemory

Avoid Allocations for High Performance #62