Closed mkosieradzki closed 3 years ago
Came here to say the same thing. Byte array copying inside Protobuf3 is one of the biggest sources of GC pressure in Akka.NET; we would like to be able to use pooled byte arrays (or Span
s when they become part of the framework) and not have ByteString
do a deep clone of them each time we use it.
Yes, we get it - it's dangerous. Understood, but let the end-users take the risk in order to put Protobuf to work in high performance contexts.
Protobuf Span API could look like this
namespace Google.Protobuf
{
public class CodedInputStream
{
public CodedInputStream(ReadOnlySpan<byte> buffer);
…
}
public class CodedOutputStream
{
public CodedOutputStream(Span<byte> buffer);
…
}
public class MessageParser
{
public IMessage ParseFrom(ReadOnlySpan<byte> input);
…
}
public static class MessageExtensions
{
public static void MergeFrom(this IMessage message, ReadOnlySpan<byte> data);
public static void WriteTo(this IMessage message, Span<byte> output);
…
}
}
@Aaronontheweb At least this class should be public https://github.com/google/protobuf/blob/master/csharp/src/Google.Protobuf/ByteString.cs#L60
Thanks for backing me up :). If this proposal is accepted I am happy to create a proper PR. Problem with Span/ReadOnlySpan is that AFAIK it is not available yet and also that @jskeet will not be very keen on accepting PR requiring very-new compiler support and runtime support.
So I would be probably rather for the unsafe approach which can be widely used today.
@mkosieradzki if they don't want to go forward with just exposing the Unsafe
methods they have today, adding some overloads which use ArraySegment<byte>
would also be a workable option in lieu of Span
@mkosieradzki System.Memory
package supports netstandard1.0
@alexvaluyskiy Thanks. I have just checked that this package has been released in preview 3 months ago.
Are you aware whether it requires some specific compiler version? Or are you aware of any roadmap/official documentation?
@Aaronontheweb ArraySegment is not comparable to Span, because ArraySegment requires byte[]
(does not support byte *
) and Span can work on everything including byte[]
and byte*
or any other generic kind of memory.
I'm in two minds about this. I do see the point - but I'm definitely concerned about the level of complexity we end up with, in terms of lots of different builds for different scenarios. (Adding a dependency for Span also makes me nervous, but I'm gradually warming to the idea of ValueTask for the async part...) The async aspect adds another layer of complexity here - by the time we've got a matrix of managed/unmanaged, async/non-async, multiple versions of .NET supported, I think it's going to get tricky. It makes it really easy to break customers with what may seem like a trivial change.
There's also the aspect of time commitment from Google to maintain all of this. I'm not on the protobuf team, and have a full plate already - so @anandolee would need to be very comfortable with it all.
In terms of ByteString
- no, the Unsafe
class should not be made public, in the same way that I would never expect Microsoft to release a public String.Unsafe
class that made it trivial to mutate strings. ByteString
s are intended to be immutable; ByteString.Unsafe
exists to allow us to be more efficient where the library can guarantee that sharing a byte array will not exhibit mutability.
One option you may want to consider is forking the library entirely, creating an OptimizedAndVeryUnsafe.Protobuf
library which is wire compatible but exposes everything you've ever wanted to, pays less attention to being massively backward compatible all the time, takes whatever dependencies it wants and generally gives the user more of a loaded shotgun. I've always tried to optimize within the bounds of making the code easy to use safely and hard to shoot yourself in the foot. If you want to remove those restrictions, that's fine - but it shouldn't be in the same library.
@jskeet Thanks a lot for your response. My opinion would be to take the System.Memory path (without explicit unsafe). Span<byte>
and ReadOnlySpan<byte>
are incredibly great and safe wrappers for both byte[]
and byte *
.
If we could use System.Memory
it would do all the heavy lifting (without requiring any UnmanagedCodedInputStream
). We could just replace internal buffer with Span
So I would suggest waiting for System.Memory
release or I can try to create a fork using the preview version and we can merge it into the main library when ready.
Knowing that System.Memory
is going to be released in a predictable future I think that special handling for byte *
makes no sense.
I also think it's very similiar case as with ValueTask... (and System.Buffers for a shared buffer pool) - it's yet another dependency required to achieve optimal performance.
BTW. For more distant future I am also researching different ways to achieve Arena allocations for C#. There is a promising project called Snowflake https://www.microsoft.com/en-us/research/publication/project-snowflake-non-blocking-safe-manual-memory-management-net/# .
Yes, I'm definitely happier with Span
than anything to do with unsafe code. However, I wouldn't want to do anything with it until it's had a full (non-beta) release. We could create PRs before then of course, but we shouldn't merge them.
@jskeet Thanks! I will try to create a PR in August so we can preview this.
@mkosieradzki @jskeet I think the new Span
API should be similar to java's protobuf ByteBuffer
Api
Exactly as I was afraid - Span requires new compiler version: https://github.com/aspnet/Announcements/issues/268 ...
Let's see where it lands, in terms of requirements. It may be that it'll be harmless to expose it so long as Google.Protobuf is compiled with C# 7.2, which I'd be comfortable with (after that's released).
Tooling support for the C# 7.2 Span framework types was added with VS2017 15.5, which was RTMed on Dec 4. Span is now well documented and there are a couple of good collateral pieces, including one by Stephen Toub, linked here.
https://blogs.msdn.microsoft.com/dotnet/2017/11/15/welcome-to-c-7-2-and-span/
A couple of issues for discussion:
I'd be willing to work on this if there were to be a clear interest (or acceptance criteria) with respect to number two. and we're within about six weeks on number one (@anandolee & @jskeet).
I think it's reasonable to require that anyone developing the library uses VS2017 15.5 or the equivalent .NET Core SDK. There's a chunk of work required in order to update the SDK for continuous integration though.
Whether the Span methods are included in all output libraries is a different matter though. We'll need to look at the dependencies and whether there are issues consuming Span from older versions of VS/C#. (For example, if VS2015 users could easily end up using Span in a dangerous way, we may want to add a netstandard2.0 target and only expose Span there. We'll have to see.)
I think the first port of call should be some prototyping and benchmarking.
I think there's a good potential for using Span optimizations in gRPC C# (general idea: there's the grpc_csharp_ext native library and the messages need to be moved between managed and native layer and copying can be expensive. Being able to transparently address both managed/unmanaged memory with the same code seems useful.) - but we haven't really looked into these optimizations in detail (and there's complications so it's not possible to just say if it is going to be worth it without proper analysis and some experimenting).
I have started some experimentation in #3530 . I need to revisit my experiments since 15.5 is out. Before proper tooling support Span
was super-slow (as expected).
Also it's important to remember that Span
is a stack-only type - what has a lot of serious implications:
For example we might need to:
ref struct
(doesn't seem right)Memory
instead of Span
This is especially difficult in context of #3166 - I have a lot of doubts related to the proper async-support.
I would really love to start a high level design discussion.
Perhaps we should first discuss where we believe the value for C# protobuf could originate with the new framework types and then try to align on a preliminary approach to assess whether that value actually exists. I'm mindful that this is already a well-crafted code-base and we also face the barrier/cost @jskeet noted of changing the SDK. No one wishes to waste their time developing PRs that never see the light of day, so we should try to fail sooner rather than later in the case that there isn't much value.
The rationale for the first wave of these framework types can be boiled down to a few major categories, based on what I'm seeing. Feel free to add or subtract from this picture:
I did a little preliminary mechanical code analysis/triage and here is what I found. Outside of the context of unit tests, there are about 150 instances in the code of obvious allocations as indicated from the following starter list of search terms: "new byte["; ".Copy"; ".BlockCopy"; ".ToArray("; ".Substring("; "new MemoryStream"; ".Clone()." If you are interested, you can see the list in the attached spreadsheet.
Of these, probably only a small subset would be worth attacking, at least initially, and specifically those that:
(As we know, there is no unsafe code presently in the code base.)
Purely for the purposes of initial discussion, a generic plan for the first couple of iterations in this type of situation, where we're skeptical of the benefits and want to proceed cautiously to avoid time wastage, might be as follows:
Let's have a good round of discussion on these details, alternative approaches, and any other thoughts. If something like the above turns out to make sense, I've also included a list of logistical questions in the attached file, which someone from Google could address. There are also some resources on this type of code-optimization effort.
@mkosieradzki can you clarify the backwards compatibility of Span:
@jtattermusch That's topic worth researching - I don't know exactly, and the answer might be very difficult, because it consist of 3 elements:
System.Memory
requires .NET Standard 1.1ref struct
concept and safety checks)According to: https://github.com/dotnet/corefxlab/blob/master/docs/specs/span.md
Runtimes that support by-ref fields and returns will get the fast ref-field
Span<T>
. Other runtimes will get the slower three-fieldSpan<T>
.
From benchmarks at http://adamsitnik.com/Span/ we can see 5-10% performance degradation of Spans vs Arrays on reads (on pre .NET Core 2.0 runtimes), however I believe that introducing safe stackallocs might help regain the performance even on older runtimes.
The biggest benefit from Span<T>
is ability to have a single API for unmanaged and managed. However it will definitely cost us a huge API redesign as Span<T>
cannot be member of non stack-only type. Let me emphasize this: You cannot make Span
To be able to benefit from using Span<T>
we need to talk about designing a completely different API.
And having this in mind. We have 2 possible approaches:
1) Let's go fully unmanaged - Use pointers
pros: fastest, reverse compatible, we can consider keeping buffer as a CodedInputStream
class member, compatible with async approach
cons: will not work on partial trust environments (this is a legacy concept), require more serious code reviews, when using unmanaged memory we need to do managed pointer pinning.
Span<T>
pros: safe, modern, requires pointer pinning, might work on some fully managed enviroments (I am not sure)
cons: slower on legacy runtimes, need really serious redesign, will not work with asyncI believe that Span<T>
is a lot about using call stack to guarantee lifetime of the pinned memory pointer.
Also to address problems about async support:
This manner should be also compatible with gRPC streaming approach (and this is a level where I believe asynchrony fits the best).
This minimizes the time buffer needs to be pinned and strongly simplifies the code.
As I have mentioned before I have created a prototype for unsafe version of protobuf introducing even arena allocator approach to protobuf (see #3530 ).
I believe that introducing arena allocation together with unsafe approach will bring more and more benefits.
The last but not least: using unmanaged memory can allow implementing faster parsing by unsafe casting buggers into primitive types like int or long. Both protobuf and x86 are little-endian what should significantly improve parsing speed.
On more point about the arena allocations support: It could be reasonable to use the memory from the original buffer in that case it makes parsing even faster. The most important thing to remember about this parsing approach is that arena-allocated + buffer-sourced data should be short-lived and the API should provide a way to transform it into a long-living heap-allocated C# objects if required by the developer.
In many scenarios long-living heap-allocated objects are not required or helpful at all.
The fact that we are using C# should not force us to downgrade to a Java-like approach with zero control over the allocations.
I'm thinking of Span as useful primarily in the narrow case of discrete, single threaded, synchronous functions and their helpers. In that case, the ability to potentially put a short array on the stack or to encapsulate neatly a reference to a slice or substring that can be passed to a helper without incurring an allocation or muddying the water with offsets is a nice (but limited) gain.
As far as the (different) topic of a public interface for core memory-encapsulating buffer objects, I'll defer to the prior participants in that conversation except to note that in my own work, which heavily depends on these, I almost always use such objects now in the context of various (multi-threaded and often parallelized) pipeline patterns in which exposing (heap-based) arrays as public properties is indeed indispensable. I try to avoid GC-pressure from a ton of small buffer or string allocations in those patterns; size buffers based on what's found to be optimal for a machine architecture (not individual document/message/cell size); and I reuse managed arrays in rotating, multi-segment buffers (with coarse-grained write vs. read gating across stages). For what it's worth, my own benchmarking has not demonstrated performance benefits for sequential access scenarios from using pointers with or without native memory regions in that particular context (though I do use pointers, unsafe code, and native memory/file handles for other use cases). All the same, a reusable core buffer strategy requires considerable bookkeeping. A preference for managed core buffer objects with ordinary indexed access may come down simply to managed lifetime as has been noted and cross-platform portability.
If folks ultimately conclude that Spans vs. core buffer refactoring are different topics, perhaps we should spin off the Spans to a separate issue?
@mkosieradzki I looked a bit more into what would the dependency on Span<> mean for Protobuf and gRPC:
System.Memory depends on netstandard1.0 (I believe it will use the unoptimized "fallback" version unless you target netstandard2.0), which means you can use it in net45 projects, but you will need a newer version of nuget and a newer IDE to be able to build those projects (they need to have knowledge of what "netstandard1.0" is)
gRPC currently targets net45 and netstandard1.5, Protobuf targets net45 and netstandard1.0.
As both gRPC and Protobuf currently target net45 explicitly, adding a dependency on System.Memory would have this effect:
@jtattermusch I might be wrong, but my understanding is:
Span
or not.Span
as a stack only type and to be able to use ref struct
concept. If you are using older compiler you can create incorrect library: I mean something that will throw an exception when an optimized version of Span is used. It also allows you to use type-safe stackallocs etc without using /unsafe. It might also introduce some compile-time optimizations, but I am not sure about this one.System.Memory
package.Coming to this discussion late.
Having gathered a little bit experience working on a Span
From a user persective the simplest things would be to add an overload of WriteTo like this:
public void WriteTo(Span<byte> output)
and one for parsing, like this:
public T ParseFrom(ReadOnlySpan<byte> data)
Benefits are two-fold:
@kwaclaw the API you are suggesting sounds reasonable (but would also mean that Google.Protobuf would need to depend on System.Memory nuget and we should think twice before adding any new dependencies)
There's one more thing to consider for interop with native libraries: for RPC libraries that are using a native layer (such as gRPC), it won't be uncommon that the native buffers delivered on the wire will be fragmented - there's no guarantee that a single protobuf message delivered by gRPC will be present in memory in a single continguous area (gRPC has a concept of byte buffer which consists of N "slices" that represent continguous spans in memory) - so being able to parse a single message from a collection of Span
@jtattermusch Good point! (on non-contiguous memory slices)
Regarding CodedInputStream
we need to wait for @jskeet to validate my points about it.
IMO CodedInputStream
cannot be used together with ReadOnlySpan<byte>
.
Regarding non-contiguous memory slices from native library maybe it's a good idea to transition from pull approach and CodedInputStream and switch to the push approach and use Pipelines instead?
See: https://github.com/dotnet/corefxlab/blob/master/docs/roadmap.md
and:
https://github.com/dotnet/corefxlab/blob/master/docs/specs/pipelines.md
and:
https://github.com/dotnet/corefxlab/blob/master/docs/specs/pipelines-io.md
@jtattermusch regarding non-contiguous memory slices: .NET Core 2.1 preview 1 contains new class ReadOnlySequence<T>
(https://github.com/dotnet/corefx/blob/master/src/System.Memory/src/System/Buffers/ReadOnlySequence.cs) which should be helpful in this scenario.
@mkosieradzki I've been recently looking into some gRPC/Protobuf optimizations and the ReadOnlySequence<T>
type is looking pretty good. here are some advantages that caught my eye:
System.Buffers
which supports a large range of target frameworks.Overall, I think experimenting with adding a CodedInputStream constructor that consumes ReadOnlySequence would definitely be worth it.
Btw, since Span<> and Memory<> now exist (which allows accessing both managed and unmanaged memory), I think providing support for parsing from byte*
not relevant anymore (and we should be discussing Span<> and Memory<> support instead).
@jtattermusch Awesome, thanks for having a look. I have now some spare time (I think a week or two) and I would love to spend it on contributing to protobuf/gRPC.
I have done a lot of experimentation with ReadOnlySequence<T>
, Span<T>
10 weeks ago and even created some protoypes - please see: https://github.com/google/protobuf/issues/3166#issuecomment-386865048
@mkosieradzki very interesting prototypes - thanks for sharing!
I had a few questions:
@jtattermusch
The code I did most of the experimentation (in the other thread) uses PipeReader
(actually very well described in the blog post you have mentioned) instead of Stream
.
I think that idea of providing a CodedInputStream
constructor with fixed ReadOnlySequence<T>
is one of the solutions however it has some limitations.
ReadOnlySequence<T>
then we do not use streaming at all - so it's a pretty neat idea and makes it easy to retain compatibility.But honestly from my perspective if we assume that for gRPC all messages can be prefetched into non-contiguous buffer it's a perfectly good solution (assuming 4MB limit). It also helps to avoid uncontrolled working set. This decision is up to you. I am OK with both solutions prefeteched and async streamed (without gRPC streaming)..
@jtattermusch
Is it because of the async or because of the non-contiguous? It seems to me that non-continguous buffers as input should have virtually no extra overhead (assuming the buffer segments have reasonable size - e.g. sth in the ~8KB ballbark)
Conceptually you are right. But the problem is that Span<T>
can live only on stack and as a consequence you cannot keep it alive inside CodedInputStream
unless you make CodedInputStream a stack-only type… as well breaking a lot of compatibility. So the pain is that you can optimize ReadOnlySequence<T> -> ReadOnlyMemory<T>
, but you cannot optimize ReadOnlyMemory<T> -> ReadOnlySpan<T>
which must be done every call to CodedInputStream… and that was somehing I was trying to overcome with ultra-fast void MergeFrom(ReadOnlySpan<T>)
so the span was always on stack.
Unfortunately I don't have figures for this and I apologize for that… I am pretty sure I had a prototype but it wasn't very fast. But also not as slow as the async version.
could the CodedInputStream here take a ReadOnlySequence directly instead of a PipeReader? ReadOnlySequence is from System.Buffers, while PipeReader is one level abstraction higher and requires System.IO.Pipelines (which doesn't have such good platform support as System.Buffers)
I see your point here, but I really can't promise that ReadOnlySequence<T>
variant is feasible without adding the generated void MergeFrom(ReadOnlySpan<T>)
and I honestly regret I have not saved the numbers to prove this.
@mkosieradzki I see your point and the tradeoffs, but right now I'm driven mostly by what could be made useful in the near future (I believe incremental improvements are the only way forward for big problems like this).
If you want to take a look at my (very much in progress) code: https://github.com/grpc/grpc/compare/master...jtattermusch:performance_slicing (the point is basically to retrieve the buffer segments for a received message without any copying and make them available to the C# layer - right now the grpc_byte_buffer gets extracted to what's called ReadOnlySliceBuffer
but it's basically the same idea as ReadOnlySequence
On Thu, Jul 12, 2018 at 2:34 PM mkosieradzki notifications@github.com wrote:
@jtattermusch https://github.com/jtattermusch
Is it because of the async or because of the non-contiguous? It seems to me that non-continguous buffers as input should have virtually no extra overhead (assuming the buffer segments have reasonable size - e.g. sth in the ~8KB ballbark)
Conceptually you are right. But the problem is that Span
can live only on stack and as a consequence you cannot keep it alive inside CodedInputStream unless you make CodedInputStream a stack-only type… as well breaking a lot of compatibility. So the pain is that you can optimize ReadOnlySequence -> ReadOnlyMemory , but you cannot optimize ReadOnlyMemory -> ReadOnlySpan which must be done every call to CodedInputStream… and that was somehing I was trying to overcome with ultra-fast void MergeFrom(ReadOnlySpan ) so the span was always on stack. That's a very good point - thanks for bringing that up. So is ReadOnlyMemory
-> ReadOnlySpan slow? I assume it's pretty fast, but I understand that with the way CodedInputStream is now written, we would be performing the conversion practically for each byte of input. Unfortunately I don't have figures for this and I apologize for that… I am pretty sure I had a prototype but it wasn't very fast. But also not as slow as the async version.
could the CodedInputStream here take a ReadOnlySequence directly instead of a PipeReader? ReadOnlySequence is from System.Buffers, while PipeReader is one level abstraction higher and requires System.IO.Pipelines (which doesn't have such good platform support as System.Buffers)
I see your point here, but I really can't promise that ReadOnlySequence
variant is feasible without adding the generated void MergeFrom(ReadOnlySpan ) and I honestly regret I have not saved the numbers to prove this. Do you still have the MergeFrom(ReadOnlySpan
) prototype? — You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/google/protobuf/issues/3431#issuecomment-404496467, or mute the thread https://github.com/notifications/unsubscribe-auth/AJeq5D3daUceQn_om5vh_UB2FlEVc-Ivks5uF0JegaJpZM4OoqYG .
--
Jan
The benchmarks are basically here: https://github.com/mkosieradzki/protobuf/blob/spans/csharp/src/TestProtoPiper/Benchmarks/ParseAddressBook.cs
Here is the hand-crafted MergeFrom(ReadOnlySpan<T>
- prototype
https://github.com/mkosieradzki/protobuf/blob/spans/csharp/src/TestProtoPiper/Addressbook.cs#L336
That gives me an idea that it might be a good idea to compare MergeFrom(ReadOnlySpan<T>
with MergeFrom(ReadOnlyMemory<T>
- I think I will give it a shot. That result should be somewhat conclusive :).
@jtattermusch I have created two new benchmarks to capture the overhead of ReadOnlyMemory<T> -> ReadOnlySpan<T>
and as I expected the overhead is signifant (because it involves memory pinning etc)…
Method | Mean | Error | StdDev | Rank |
------------------------------------------------ |---------:|----------:|----------:|-----:|
ParseUsingCodedInputReader | 37.56 us | 0.1789 us | 0.1674 us | 5 |
ParseUsingClassic | 18.13 us | 0.0575 us | 0.0509 us | 3 |
ParseUsingCodedInputParser | 72.76 us | 0.4153 us | 0.3884 us | 6 |
ParseUsingCodedInputSpanParser | 15.40 us | 0.0761 us | 0.0675 us | 2 |
ParseUsingCodedInputSpanPosParser | 15.11 us | 0.0513 us | 0.0480 us | 1 |
ParseUsingCodedInputSpanPosParserReadOnlyMemory | 18.81 us | 0.0824 us | 0.0688 us | 4 |
ParseUsingCodedInputSpanPosParser
is passing using pos instead of ref… and is the winner today ;) I will push my current benchmarks to the repo so you can reproduce it on your side.
The comparison you want to look at is ParseUsingCodedInputSpanPosParserReadOnlyMemory
vs ParseUsingCodedInputSpanPosParser
- the only difference is conversion between ReadOnlyMemory<T>
and ReadOnlySpan<T>
on each call - what simulates what you wanted to do initially.
@jtattermusch
I have pushed my latest benchmarks to the repo. Please give it a try. I hope I have proven my point that void MergeFrom(ref ReadOnlySpan<T>)
or void MergeFrom(in ReadOnlySpan<T>, ref int pos)
MUST be generated to ensure the optimal performance.
Next point to learn is whether we want to add async Pipes support OR stick to synchronous parsing. I am perfectly OK with synchronous parsing.
If that's the case, we need to figure out what is the size threshold when copying non-contiguous memory to contiguous (and pinned memory) for parsing time (can be stack memory for small buffers or buffer pool for huge buffers) is efficient and depending on the result we need to decide whether it makes sense to create a void MergeFrom(in ReadOnlySequence<T>, ref int pos)
or not i.e. whether threshold is above or significantly below 4MB.
Alternatively we can start with your idea of ReadOnlySequence
void MergeFrom(in ReadOnlySpan<T>, ref int pos)
.Does it make sense to you?
Honestly, I don't think it makes any sense to start parsing before the entire message has arrived. because it makes the RPC server more prone to a slow-sender class of DoS attacks while giving no real performance benefits except slightly decreased latency, but I really trust in your opinion here. What do you think?
@jtattermusch I have started work on branch https://github.com/mkosieradzki/protobuf/tree/read-only-sequence - this is basically implementation of CodedInputStream using ReadOnlySequence<byte>
instead of byte[]
as buffer. Let's see what comes out of this.
@jtattermusch I have restarted my work; on branch https://github.com/mkosieradzki/protobuf/tree/netcore2_1
Currently I have added the least intrusive version that adds support for .NET Core 2.1 (to make place for special optimizations), creates a place for passing ReadOnlySequence<byte>
and ReadOnlyMemory<byte>
, but does not break compatibility.
Now I am going to make sure I do not introduce any performance regressions by switching to the ReadOnlyMemory<T>
abstraction instead of array of T.
If I won't be able to ensure this (especially on the pre-.NET Core 2.1 platforms) this I will need to add a fast path using compatBuffer
which I am already using on pre-.NET Core 2.1 for Stream.Read.
I have run a bunch of further tests and I am... a bit disappointed, but have some important conclusions,
ValueTask
-based variant of ParseUsingCodedInputSpanPosParser
just by wrapping results into ValueTask<T>
- it's slow beyond all recognition. So async
as the only variant is NOT an option.ReadOnlyMemory<T>
and T[]
in context of CodedInputStream
think it's unfeasible to implement a ReadOnlyMemory<T>
-based variant without significant performance regressions. VarInt and Tag parsing is so slow… despite multiple optimization attempts including pinning the memory once and creating span from the already pinned pointer.We are talking about performance regressions orders of magnitude larger than just copying entire non-contiguous memory to a single array.
Maybe I am missing something.
Just to sum up. IMO the array-based buffer should stay in the CodedInputParser.
I think it would be worth productizing ParseUsingCodedInputSpanPosParser
for its simplicity, efficiency and elasticity (you can use it with any contiguous piece of memory). I would also recommend adding codegen for void MergeFrom(in ReadOnlySpan<T>, ref int pos)
- because without this codegen ParseUsingCodedInputSpanPosParser
is useless.
Also this codegen should be conditionally used by the standard MergeFrom(CodedInputStream)
if the entire message is already either in contiguous buffer or available from a ReadOnlySequence<byte>
and is reasonably small.
@jtattermusch Please let me know whether I should create a PR (I will need to productize my PoC) as described, or if you need to do more investigation or you simply disagree.
I was playing a bit with aggressive inlining… and I went from:
Method | Mean | Error | StdDev | Rank |
---------------------------------- |---------:|----------:|----------:|-----:|
ParseUsingClassic | 18.39 us | 0.0996 us | 0.0932 us | 2 |
ParseUsingCodedInputSpanPosParser | 15.19 us | 0.0998 us | 0.0934 us | 1 |
down to:
Method | Mean | Error | StdDev | Rank |
---------------------------------------- |---------:|----------:|----------:|-----:|
ParseUsingClassic | 16.75 us | 0.1096 us | 0.1025 us | 4 |
ParseUsingClassicInline | 15.97 us | 0.0528 us | 0.0494 us | 3 |
ParseUsingCodedInputSpanPosParser | 12.47 us | 0.0864 us | 0.0808 us | 2 |
ParseUsingCodedInputSpanPosParserInline | 12.31 us | 0.0304 us | 0.0254 us | 1 |
only by adding [MethodImpl(MethodImplOptions.AggressiveInlining)]
on some methods and improving constructors (to avoid using Func<T>
factories)...
I think that this optimization alone deserves a PR. In general I see a huge potential in improved codegen. I mean: what's the point of generating (de-)serialization code if it still uses indirect calls (interfaces, callbacks, etc. on the hot path) and does not excercise inlining? IMO it's a good idea to generate a predictable code and help the compiler/jitter as much as you can...,
So this is my high level optimization plan proposal:
FieldCoded
abstraction).ReadOnlySequence<byte>
as an alternative to Stream
in CodedInputStream
without introducing any significant changes (just refill array-based buffer from the sequence instead of Stream
if sequence is available) - but to be honest I don't see any huge advantages over implementing a ReadOnlySequence<T>
-based Stream… I really tried.ReadOnlySpan<byte>
and use this path conditionally if the entire message fits the buffer.ArrayPool
/MemoryPool
where applicable,Utf8String
to avoid UTF8 decoding which is super slow and accounts for around 5us in my example.All of this together should result in a drop on my benchmark from 18us to around 7.5us in parsing time.
@mkosieradzki
Ad the other optimizations you are suggesting (AggresiveInlining, using ArrayPool/MemoryPool and Utf8String) - I think they are orthogonal changes and we should handle them separately (e.g. the agressive inlining can be done for both existing CodedInputStream as for any of its future variants regardless what exactly they are).
@mkosieradzki I spent some time looking at your code and also experimenting myself. Some results:
ref ReadOnlySpan<T>
- you need to be able to update the position the the current buffer. That's definitely one of the possible approaches, but my impression was that the CodedInputStream actually maintains several extra fields of state that have important security consequences (e.g. the tracking max recursion depth etc.) and it might be difficult to pass all of these along as function parameters (and if we need to add a few more context variables in the future, we might be in trouble).someMessage.MergeFrom(ParsingContext context, ReadOnlySpan<byte> immediateBuffer)
, where ParsingContext
is not necessarily a class called like that, but it is something that maintains the parsing state as immediateBuffer
is processed (it tracks the position, recursion depth and other context variables). It also should have the ability to provide the next buffer segment in form of a ReadOnlySpanParsingContext
concept from above basically corresponds to the existing class CodedInputStream
(where the immediate buffer would be used instead of the existing private field byte[] buffer
). So the new more general parsing API could even look like someMessage.MergeFrom(CodedInputStream cis, ReadOnlySpan<byte> immediateBuffer)
and the older backward-compatible API would be pretty simple to implement in terms of the newer (hopefully faster) parsing API.ParsingContext
(open to better naming) explicitly as an actual class (or a struct which would reference CodedInputStream internally)What do you think? Btw, thanks again for spending time prototyping various approaches, it is useful to have an actual code to look at.
Because the changes with the approach I'm suggesting would be relatively significant I'd suggest creating a new experimental nuget package with this new implementation of parsing first (and see how that goes), rather than trying to merge these changes to upstream.
@jtattermusch Dropping the async requirement enables us to create a ref struct
to be used as ParsingContext
.
So, in that case we can declare ref struct CodedInputParsingContext
which is a stack-only type and is allowed to contain ReadOnlySpan<T>
etc.
This means we can probably create an implicit conversion between CodedInputStream
and CodedInputParsingContext
and use a single code path for CodedInputParsingContext
.
Do you like the idea of stack-only parsing context? I think I can try to prototype it.
Regarding the ref ReadOnlySpan<T>
: the version with in ReadOnlySpan<T>
and ref int pos
is slightly faster - it is called ParseUsingCodedInputSpanPosParser .
@mkosieradzki I've thought about the ref struct
and I'm not sure if it's a good idea. The good thing is the ReadOnlySpan can be part of this struct, but you'll need a reference to some class with context anyway (as non-immutable struct seems like a bad idea in general). I'm also unsure about the backward compatibility of ref struct
(which is a C# 7.2 concept) - how would the code look like to someone using an older runtime (e.g. net45)?
In a way, the difference between using ref struct CodedInputParsingContext { SomeContextClass context, ReadOnlySpan buffer}
and using two separate parsing parameters SomeContextClass context, ReadOnlySpan buffer
in the API seems only like syntactic sugar - both implementations would probably look almost exactly the same, so I don't quite see the added value of using ref struct
and potentially risking backward compatibility issues with older runtimes.
@jtattermusch I agree.
@mkosieradzki to prevent doing unnecessary changes, I'd say a prototype that parses from 2 arguments:
CodedInputStream context, ReadOnlySpan<byte> immediateBuffer
would be useful - that should be a sufficient proof of concept that minimizes the amount of changes needed. (Basically every ReadXXX method on CodedInputStream would get an overload that accepts those two args).
@jtattermusch I was rather thinking about ref ReadOnlySpan<T> immediateBuffer
so while iterating over ReadOnlySequence we can move pointer to the next buffer.
When iterating over a Stream
or an array based buffer it's enough to use immutable reference ReadOnlySpan<T>
, but our goal is to handle ReadOnlySequence<T>
efficiently.
BTW. My preliminary results
Method | Mean | Error | StdDev | Rank |
---------------------------------------- |---------:|----------:|----------:|-----:|
ParseUsingClassic | 16.97 us | 0.1073 us | 0.1004 us | 5 |
ParseUsingClassicInline | 15.97 us | 0.1102 us | 0.0977 us | 4 |
ParseUsingNew | 14.97 us | 0.0939 us | 0.0878 us | 3 |
ParseUsingCodedInputSpanPosParser | 13.06 us | 0.0493 us | 0.0461 us | 2 |
ParseUsingCodedInputSpanPosParserInline | 12.49 us | 0.0803 us | 0.0751 us | 1 |
ParseUsingNew is a heavy-inlined version of classic (classic already benefits from some inlinining so it is not a true baseline) version optimized with an additional parameter ref ReadOnlySpan<T>
(as you suggested). I need to track down where the 2 us are lost.
The current experimental code is available: https://github.com/mkosieradzki/protobuf/tree/spans/csharp/src/TestProtoPiper/Benchmarks - it looks pretty bad, but it's only to prototype.
IMO C# protobuf implementation could strongly benefit performance-wise from allowing some unsafe code.
Background
First of all I am not saying that unsafe should be enabled in every build/every platform, having 100% managed library is always a nice feature. All I am suggesting is again some conditional feature on some platform.
However there is a trend to move towards unsafe mixing we can observe in .NET Core. For example a lot of .NET Core libraries use unsafe code due to need to interop with different kind of unmanaged libraries or just to handle AOT compiled code.
API change
Today CodedInputStream works in two modes: streaming with byte[] buffer and with a fixed buffer.
It would be really nice to be able to provide byte * and length to CodedInputStream constructor.
Alternatively we could create an UnmanagedCodedInputStream which would work on unmanaged memory.
This would require to abstracting CodedInputStream as an interface
Affected APIs
Async API #3166 makes absolutely no sense for UnmanagedCodedInputStream because it assumes everything have been already read.
Benefits
Scenarios
Scenario 1. NoSQL Database
User is using native NoSQL database (for example RocksDB/LevelDB) and is using protobuf for data persistence. Database returns native allocated pointer with buffer. Instead of copying entire buffer to the memory user can deserialize record straight from the returned memory native memory and then free the pointer. Very little GC is involved and there is practically no overhead.
Scenario 2. Stackalloc-ated buffers
For performance reasons to decrease heap allocations and allocate buffer on stack. This scenario can be alternatively handled by Buffer Pooling, however stackalloc seems to be more efficient.
Affected by
https://github.com/dotnet/corefxlab/blob/master/docs/specs/span.md - this work in progress improvement can allow to achieve the same goals in a managed way by providing some platform improvements.