mkosieradzki commented 7 years ago

IMO C# protobuf implementation could strongly benefit performance-wise from allowing some unsafe code.

Background

First of all I am not saying that unsafe should be enabled in every build/every platform, having 100% managed library is always a nice feature. All I am suggesting is again some conditional feature on some platform.

However there is a trend to move towards unsafe mixing we can observe in .NET Core. For example a lot of .NET Core libraries use unsafe code due to need to interop with different kind of unmanaged libraries or just to handle AOT compiled code.

API change

Today CodedInputStream works in two modes: streaming with byte[] buffer and with a fixed buffer.

It would be really nice to be able to provide byte * and length to CodedInputStream constructor.

Alternatively we could create an UnmanagedCodedInputStream which would work on unmanaged memory.

This would require to abstracting CodedInputStream as an interface

Affected APIs

Async API #3166 makes absolutely no sense for UnmanagedCodedInputStream because it assumes everything have been already read.

Benefits

It's extremely useful for interop (See scenario 1)
It can work with stackalloc (See scenario 2)
It allows even quicker deserialization we can quicker deserialize byte * to for example fixed-size primitive types by pointer dereference. (Endiannes might be a problem here)

Scenarios

Scenario 1. NoSQL Database

User is using native NoSQL database (for example RocksDB/LevelDB) and is using protobuf for data persistence. Database returns native allocated pointer with buffer. Instead of copying entire buffer to the memory user can deserialize record straight from the returned memory native memory and then free the pointer. Very little GC is involved and there is practically no overhead.

Scenario 2. Stackalloc-ated buffers

For performance reasons to decrease heap allocations and allocate buffer on stack. This scenario can be alternatively handled by Buffer Pooling, however stackalloc seems to be more efficient.

Affected by

https://github.com/dotnet/corefxlab/blob/master/docs/specs/span.md - this work in progress improvement can allow to achieve the same goals in a managed way by providing some platform improvements.

Aaronontheweb commented 7 years ago

Came here to say the same thing. Byte array copying inside Protobuf3 is one of the biggest sources of GC pressure in Akka.NET; we would like to be able to use pooled byte arrays (or Spans when they become part of the framework) and not have ByteString do a deep clone of them each time we use it.

Yes, we get it - it's dangerous. Understood, but let the end-users take the risk in order to put Protobuf to work in high performance contexts.

alexvaluyskiy commented 7 years ago

Protobuf Span API could look like this

namespace Google.Protobuf
{
    public class CodedInputStream
    {
        public CodedInputStream(ReadOnlySpan<byte> buffer);
        …
    }

    public class CodedOutputStream
    {
        public CodedOutputStream(Span<byte> buffer);
        …
    }

    public class MessageParser
    {
        public IMessage ParseFrom(ReadOnlySpan<byte> input);
        …
    }

    public static class MessageExtensions
    {
        public static void MergeFrom(this IMessage message, ReadOnlySpan<byte> data);
        public static void WriteTo(this IMessage message, Span<byte> output);
        …
    }
}

alexvaluyskiy commented 7 years ago

@Aaronontheweb At least this class should be public https://github.com/google/protobuf/blob/master/csharp/src/Google.Protobuf/ByteString.cs#L60

mkosieradzki commented 7 years ago

Thanks for backing me up :). If this proposal is accepted I am happy to create a proper PR. Problem with Span/ReadOnlySpan is that AFAIK it is not available yet and also that @jskeet will not be very keen on accepting PR requiring very-new compiler support and runtime support.

So I would be probably rather for the unsafe approach which can be widely used today.

Aaronontheweb commented 7 years ago

@mkosieradzki if they don't want to go forward with just exposing the Unsafe methods they have today, adding some overloads which use ArraySegment<byte> would also be a workable option in lieu of Span

alexvaluyskiy commented 7 years ago

@mkosieradzki System.Memory package supports netstandard1.0

mkosieradzki commented 7 years ago

@alexvaluyskiy Thanks. I have just checked that this package has been released in preview 3 months ago.

Are you aware whether it requires some specific compiler version? Or are you aware of any roadmap/official documentation?

@Aaronontheweb ArraySegment is not comparable to Span, because ArraySegment requires byte[] (does not support byte *) and Span can work on everything including byte[] and byte* or any other generic kind of memory.

jskeet commented 7 years ago

I'm in two minds about this. I do see the point - but I'm definitely concerned about the level of complexity we end up with, in terms of lots of different builds for different scenarios. (Adding a dependency for Span also makes me nervous, but I'm gradually warming to the idea of ValueTask for the async part...) The async aspect adds another layer of complexity here - by the time we've got a matrix of managed/unmanaged, async/non-async, multiple versions of .NET supported, I think it's going to get tricky. It makes it really easy to break customers with what may seem like a trivial change.

There's also the aspect of time commitment from Google to maintain all of this. I'm not on the protobuf team, and have a full plate already - so @anandolee would need to be very comfortable with it all.

In terms of ByteString - no, the Unsafe class should not be made public, in the same way that I would never expect Microsoft to release a public String.Unsafe class that made it trivial to mutate strings. ByteStrings are intended to be immutable; ByteString.Unsafe exists to allow us to be more efficient where the library can guarantee that sharing a byte array will not exhibit mutability.

One option you may want to consider is forking the library entirely, creating an OptimizedAndVeryUnsafe.Protobuf library which is wire compatible but exposes everything you've ever wanted to, pays less attention to being massively backward compatible all the time, takes whatever dependencies it wants and generally gives the user more of a loaded shotgun. I've always tried to optimize within the bounds of making the code easy to use safely and hard to shoot yourself in the foot. If you want to remove those restrictions, that's fine - but it shouldn't be in the same library.

mkosieradzki commented 7 years ago

@jskeet Thanks a lot for your response. My opinion would be to take the System.Memory path (without explicit unsafe). Span<byte> and ReadOnlySpan<byte> are incredibly great and safe wrappers for both byte[] and byte *.

If we could use System.Memory it would do all the heavy lifting (without requiring any UnmanagedCodedInputStream). We could just replace internal buffer with Span probably without changing a lot of code...

So I would suggest waiting for System.Memory release or I can try to create a fork using the preview version and we can merge it into the main library when ready.

Knowing that System.Memory is going to be released in a predictable future I think that special handling for byte * makes no sense.

I also think it's very similiar case as with ValueTask... (and System.Buffers for a shared buffer pool) - it's yet another dependency required to achieve optimal performance.

BTW. For more distant future I am also researching different ways to achieve Arena allocations for C#. There is a promising project called Snowflake https://www.microsoft.com/en-us/research/publication/project-snowflake-non-blocking-safe-manual-memory-management-net/# .

jskeet commented 7 years ago

Yes, I'm definitely happier with Span than anything to do with unsafe code. However, I wouldn't want to do anything with it until it's had a full (non-beta) release. We could create PRs before then of course, but we shouldn't merge them.

mkosieradzki commented 7 years ago

@jskeet Thanks! I will try to create a PR in August so we can preview this.

alexvaluyskiy commented 7 years ago

@mkosieradzki @jskeet I think the new Span API should be similar to java's protobuf ByteBuffer Api

mkosieradzki commented 7 years ago

Exactly as I was afraid - Span requires new compiler version: https://github.com/aspnet/Announcements/issues/268 ...

jskeet commented 7 years ago

Let's see where it lands, in terms of requirements. It may be that it'll be harmless to expose it so long as Google.Protobuf is compiled with C# 7.2, which I'd be comfortable with (after that's released).

dchennells commented 6 years ago

Tooling support for the C# 7.2 Span framework types was added with VS2017 15.5, which was RTMed on Dec 4. Span is now well documented and there are a couple of good collateral pieces, including one by Stephen Toub, linked here.

https://blogs.msdn.microsoft.com/dotnet/2017/11/15/welcome-to-c-7-2-and-span/

A couple of issues for discussion:

At what point would it make sense to introduce the requirement for C# Google.Protobuf developers to be on VS2017 15.5+?
For the project owner: would adding this to the mix be seen as an enhancement or rather as an "optimization" seeking a problem?

I'd be willing to work on this if there were to be a clear interest (or acceptance criteria) with respect to number two. and we're within about six weeks on number one (@anandolee & @jskeet).

jskeet commented 6 years ago

I think it's reasonable to require that anyone developing the library uses VS2017 15.5 or the equivalent .NET Core SDK. There's a chunk of work required in order to update the SDK for continuous integration though.

Whether the Span methods are included in all output libraries is a different matter though. We'll need to look at the dependencies and whether there are issues consuming Span from older versions of VS/C#. (For example, if VS2015 users could easily end up using Span in a dangerous way, we may want to add a netstandard2.0 target and only expose Span there. We'll have to see.)

I think the first port of call should be some prototyping and benchmarking.

jtattermusch commented 6 years ago

I think there's a good potential for using Span optimizations in gRPC C# (general idea: there's the grpc_csharp_ext native library and the messages need to be moved between managed and native layer and copying can be expensive. Being able to transparently address both managed/unmanaged memory with the same code seems useful.) - but we haven't really looked into these optimizations in detail (and there's complications so it's not possible to just say if it is going to be worth it without proper analysis and some experimenting).

mkosieradzki commented 6 years ago

I have started some experimentation in #3530 . I need to revisit my experiments since 15.5 is out. Before proper tooling support Span was super-slow (as expected).

Also it's important to remember that Span is a stack-only type - what has a lot of serious implications: For example we might need to:

make CodedInputStream a ref struct (doesn't seem right)
or use heap-friendly Memory instead of Span
or something else

This is especially difficult in context of #3166 - I have a lot of doubts related to the proper async-support.

I would really love to start a high level design discussion.

dchennells commented 6 years ago

Perhaps we should first discuss where we believe the value for C# protobuf could originate with the new framework types and then try to align on a preliminary approach to assess whether that value actually exists. I'm mindful that this is already a well-crafted code-base and we also face the barrier/cost @jskeet noted of changing the SDK. No one wishes to waste their time developing PRs that never see the light of day, so we should try to fail sooner rather than later in the case that there isn't much value.

The rationale for the first wave of these framework types can be boiled down to a few major categories, based on what I'm seeing. Feel free to add or subtract from this picture:

Reducing cost of allocations
- Allocations resulting from slicing/substring operations
- Allocations that could now conditionally be put on the stack and in a safe/managed manner (when they are local arrays of primitives that turn out to be small)
Increasing managed code goodness
- Reducing the unsafe footprint in the use and passing around of unsafe pointers in our library code

I did a little preliminary mechanical code analysis/triage and here is what I found. Outside of the context of unit tests, there are about 150 instances in the code of obvious allocations as indicated from the following starter list of search terms: "new byte["; ".Copy"; ".BlockCopy"; ".ToArray("; ".Substring("; "new MemoryStream"; ".Clone()." If you are interested, you can see the list in the attached spreadsheet.

Of these, probably only a small subset would be worth attacking, at least initially, and specifically those that:

can be avoided with the new types/overloads;
are in hot code regions (i.e. can deliver significant improvement in the benchmarks)

(As we know, there is no unsafe code presently in the code base.)

Purely for the purposes of initial discussion, a generic plan for the first couple of iterations in this type of situation, where we're skeptical of the benefits and want to proceed cautiously to avoid time wastage, might be as follows:

Establish benchmark details (procedures, platform(s) and artifacts)
Generate benchmarks for the current C# release (tagged as 3.5.1).
Seek code sites for proof of concept ("POC") interventions by looking to the intersection of: (a) hot code regions (as identified by a tool such as PerfView while running the benchmark cases); and, (b) code regions that have buffer-oriented allocations that might benefit from the new framework types
Select the most promising 2 - 8 code sites for preliminary interventions with span etc.
Within a fork of the same release used above, complete a quick-and-dirty POC iteration.
Obtain preliminary comparative benchmarks for the POC and come back up for air to the discussion in order to evaluate the value of proceeding with a more systematic/disciplined effort.
Decide whether to proceed, on what timing, and who to assign as reviewer(s) etc.
If it's a "Go", begin working on the actual PR(s)...

Let's have a good round of discussion on these details, alternative approaches, and any other thoughts. If something like the above turns out to make sense, I've also included a list of logistical questions in the attached file, which someone from Google could address. There are also some resources on this type of code-optimization effort.

Allocations Inventory, Questions and Resources.xlsx

jtattermusch commented 6 years ago

@mkosieradzki can you clarify the backwards compatibility of Span:

What is the lowest version of .NET where Span can be used in the code? (not necessarily with seeing performance gains, not making things slower is enough - my understanding is that Span has a fallback implementation in case it's not supported by the runtime). Google.Protobuf needs to support at least net45+
What is the oldest runtime where one can see performance gains from Span (= where Span is properly supported)?

mkosieradzki commented 6 years ago

@jtattermusch That's topic worth researching - I don't know exactly, and the answer might be very difficult, because it consist of 3 elements:

library (compatibility) - I believe System.Memory requires .NET Standard 1.1
compiler (ref struct concept and safety checks)
runtime optimizations (which are distributed among different versions of runtime starting with .NET Core 2.0 (http://adamsitnik.com/Span/) but I would not be surprised if there would be another bunch of significant improvements in .NET Core 2.1 and later.

According to: https://github.com/dotnet/corefxlab/blob/master/docs/specs/span.md

Runtimes that support by-ref fields and returns will get the fast ref-field Span<T>. Other runtimes will get the slower three-field Span<T>.

From benchmarks at http://adamsitnik.com/Span/ we can see 5-10% performance degradation of Spans vs Arrays on reads (on pre .NET Core 2.0 runtimes), however I believe that introducing safe stackallocs might help regain the performance even on older runtimes.

The biggest benefit from Span<T> is ability to have a single API for unmanaged and managed. However it will definitely cost us a huge API redesign as Span<T> cannot be member of non stack-only type. Let me emphasize this: You cannot make Span member of CodedInputStream and this defies all simple refactorings.

To be able to benefit from using Span<T> we need to talk about designing a completely different API.

And having this in mind. We have 2 possible approaches: 1) Let's go fully unmanaged - Use pointers pros: fastest, reverse compatible, we can consider keeping buffer as a CodedInputStream class member, compatible with async approach cons: will not work on partial trust environments (this is a legacy concept), require more serious code reviews, when using unmanaged memory we need to do managed pointer pinning.

Let's go Span<T> pros: safe, modern, requires pointer pinning, might work on some fully managed enviroments (I am not sure) cons: slower on legacy runtimes, need really serious redesign, will not work with async

I believe that Span<T> is a lot about using call stack to guarantee lifetime of the pinned memory pointer.

Also to address problems about async support:

Despite I am the author of async support (see #3173 ), I am not 100% convinced this is the way to go. I believe it can be a reasonable approach to use protocol buffers in a following manner:
1. Pre-fetch entire buffer (handling delimiters)
2. Schedule parsing using fast path only on a pre-fetched buffer.

This manner should be also compatible with gRPC streaming approach (and this is a level where I believe asynchrony fits the best).

This minimizes the time buffer needs to be pinned and strongly simplifies the code.

As I have mentioned before I have created a prototype for unsafe version of protobuf introducing even arena allocator approach to protobuf (see #3530 ).

I believe that introducing arena allocation together with unsafe approach will bring more and more benefits.

The last but not least: using unmanaged memory can allow implementing faster parsing by unsafe casting buggers into primitive types like int or long. Both protobuf and x86 are little-endian what should significantly improve parsing speed.

mkosieradzki commented 6 years ago

On more point about the arena allocations support: It could be reasonable to use the memory from the original buffer in that case it makes parsing even faster. The most important thing to remember about this parsing approach is that arena-allocated + buffer-sourced data should be short-lived and the API should provide a way to transform it into a long-living heap-allocated C# objects if required by the developer.

In many scenarios long-living heap-allocated objects are not required or helpful at all.

The fact that we are using C# should not force us to downgrade to a Java-like approach with zero control over the allocations.

dchennells commented 6 years ago

I'm thinking of Span as useful primarily in the narrow case of discrete, single threaded, synchronous functions and their helpers. In that case, the ability to potentially put a short array on the stack or to encapsulate neatly a reference to a slice or substring that can be passed to a helper without incurring an allocation or muddying the water with offsets is a nice (but limited) gain.

As far as the (different) topic of a public interface for core memory-encapsulating buffer objects, I'll defer to the prior participants in that conversation except to note that in my own work, which heavily depends on these, I almost always use such objects now in the context of various (multi-threaded and often parallelized) pipeline patterns in which exposing (heap-based) arrays as public properties is indeed indispensable. I try to avoid GC-pressure from a ton of small buffer or string allocations in those patterns; size buffers based on what's found to be optimal for a machine architecture (not individual document/message/cell size); and I reuse managed arrays in rotating, multi-segment buffers (with coarse-grained write vs. read gating across stages). For what it's worth, my own benchmarking has not demonstrated performance benefits for sequential access scenarios from using pointers with or without native memory regions in that particular context (though I do use pointers, unsafe code, and native memory/file handles for other use cases). All the same, a reusable core buffer strategy requires considerable bookkeeping. A preference for managed core buffer objects with ordinary indexed access may come down simply to managed lifetime as has been noted and cross-platform portability.

If folks ultimately conclude that Spans vs. core buffer refactoring are different topics, perhaps we should spin off the Spans to a separate issue?

jtattermusch commented 6 years ago

@mkosieradzki I looked a bit more into what would the dependency on Span<> mean for Protobuf and gRPC:

System.Memory depends on netstandard1.0 (I believe it will use the unoptimized "fallback" version unless you target netstandard2.0), which means you can use it in net45 projects, but you will need a newer version of nuget and a newer IDE to be able to build those projects (they need to have knowledge of what "netstandard1.0" is)

gRPC currently targets net45 and netstandard1.5, Protobuf targets net45 and netstandard1.0.

As both gRPC and Protobuf currently target net45 explicitly, adding a dependency on System.Memory would have this effect:

users using older versions of Visual Studio (e.g. 2013 and I think some older versions of 2015) and mono unable to build their projects (as they wouldn't be able to resolve the System.Memory dependency). Users would be able to build their legacy projects if they upgraded their IDE / toolchain.

Also see: http://adamsitnik.com/Span/#how-to-use-it-today

mkosieradzki commented 6 years ago

@jtattermusch I might be wrong, but my understanding is:

Framework target has no impact on performance at compile time. Neither package target framework version nor current compilation target version. Specifically if you target for example .NET Standard 1.1 you have a single assembly that does not need any recompilation - no matter what the compile target of the entire application is, and moreover no matter whether target runtime will support in-boxed Span or not.
If I understand correctly fast version of Span is provided by the runtime (as an in-boxed type) (newer .NET Framework version installed on the machine or newer .NET Core runtime). My understanding is that if a compatible runtime is used, then bindings are redirected to the in-box version of Span.
Newer version of compiler is required to enforce safety checks for Span as a stack only type and to be able to use ref struct concept. If you are using older compiler you can create incorrect library: I mean something that will throw an exception when an optimized version of Span is used. It also allows you to use type-safe stackallocs etc without using /unsafe. It might also introduce some compile-time optimizations, but I am not sure about this one.
https://www.nuget.org/packages/System.Memory/4.4.0-preview1-25305-02 - if you look at the dependencies for this package - it has a reverse compatible target for net45, so you should still be able to compile with double targets net45;netstandard15 and net45;netstandard10 - and use System.Memory package.

kwaclaw commented 6 years ago

Coming to this discussion late.

Having gathered a little bit experience working on a Span based API for LMDB (https://github.com/LMDB/lmdb) I would agree that an entirely new API is necessary.

From a user persective the simplest things would be to add an overload of WriteTo like this: public void WriteTo(Span<byte> output)

and one for parsing, like this: public T ParseFrom(ReadOnlySpan<byte> data)

Benefits are two-fold:

Span can wrap native pointers, it really becomes a universal buffer API.
This makes it possible to have zero-copy interop with native libraries.

jtattermusch commented 6 years ago

@kwaclaw the API you are suggesting sounds reasonable (but would also mean that Google.Protobuf would need to depend on System.Memory nuget and we should think twice before adding any new dependencies)

There's one more thing to consider for interop with native libraries: for RPC libraries that are using a native layer (such as gRPC), it won't be uncommon that the native buffers delivered on the wire will be fragmented - there's no guarantee that a single protobuf message delivered by gRPC will be present in memory in a single continguous area (gRPC has a concept of byte buffer which consists of N "slices" that represent continguous spans in memory) - so being able to parse a single message from a collection of Span objects might be useful (which would probably result in some new APIs added to the CodedInputStream object).

mkosieradzki commented 6 years ago

@jtattermusch Good point! (on non-contiguous memory slices)

Regarding CodedInputStream we need to wait for @jskeet to validate my points about it. IMO CodedInputStream cannot be used together with ReadOnlySpan<byte>.

Regarding non-contiguous memory slices from native library maybe it's a good idea to transition from pull approach and CodedInputStream and switch to the push approach and use Pipelines instead?

See: https://github.com/dotnet/corefxlab/blob/master/docs/roadmap.md

and:

https://github.com/dotnet/corefxlab/blob/master/docs/specs/pipelines.md

and:

https://github.com/dotnet/corefxlab/blob/master/docs/specs/pipelines-io.md

mkosieradzki commented 6 years ago

@jtattermusch regarding non-contiguous memory slices: .NET Core 2.1 preview 1 contains new class ReadOnlySequence<T> (https://github.com/dotnet/corefx/blob/master/src/System.Memory/src/System/Buffers/ReadOnlySequence.cs) which should be helpful in this scenario.

jtattermusch commented 6 years ago

@mkosieradzki I've been recently looking into some gRPC/Protobuf optimizations and the ReadOnlySequence<T> type is looking pretty good. here are some advantages that caught my eye:

It allows parsing a non-continguous buffers which is very useful, because most IO-related operations that don't do copying will probably end up with buffers that are not continguous (file and network access tends to read data in chunks).
ReadOnlySequence is a standard API used by System.IO.Pipelines (and they seem to be the way to go for high performance IO in .NET Core - see a very well written blog here: https://blogs.msdn.microsoft.com/dotnet/2018/07/09/system-io-pipelines-high-performance-io-in-net/)
gRPC internally uses non-continguous buffers as well
ReadOnlySequence itself is a struct (=no allocations) and it has allocation-free constructors from more basic types like Memory or array[] - so by supporting parsing from ReadOnlySequence, parsing from array[] or continguous memory buffers becomes super trivial. For multi-segment buffers, the ReadOnlySequenceSegment instances classes, but they are abstract, so custom implementations that use pooling are possible IMHO.
Supporting parsing from ReadOnlySequence gives us support for parsing from Memory for free (see 4).
ReadOnlySequence if in the nuget package System.Buffers which supports a large range of target frameworks.

Overall, I think experimenting with adding a CodedInputStream constructor that consumes ReadOnlySequence would definitely be worth it.

Btw, since Span<> and Memory<> now exist (which allows accessing both managed and unmanaged memory), I think providing support for parsing from byte* not relevant anymore (and we should be discussing Span<> and Memory<> support instead).

mkosieradzki commented 6 years ago

@jtattermusch Awesome, thanks for having a look. I have now some spare time (I think a week or two) and I would love to spend it on contributing to protobuf/gRPC.

I have done a lot of experimentation with ReadOnlySequence<T>, Span<T> 10 weeks ago and even created some protoypes - please see: https://github.com/google/protobuf/issues/3166#issuecomment-386865048

jtattermusch commented 6 years ago

@mkosieradzki very interesting prototypes - thanks for sharing!

I had a few questions:

You're mentioning that "Asynchronous parser using non-contiguous memory from Pipe is roughly 3x slower than the original parser. It is also zero alloc." 3x slower should prohibitively high. Do we understand why that is - Is it because of the async or because of the non-contiguous? It seems to me that non-continguous buffers as input should have virtually no extra overhead (assuming the buffer segments have reasonable size - e.g. sth in the ~8KB ballbark)
could the CodedInputStream here take a ReadOnlySequence directly instead of a PipeReader? ReadOnlySequence is from System.Buffers, while PipeReader is one level abstraction higher and requires System.IO.Pipelines (which doesn't have such good platform support as System.Buffers)

mkosieradzki commented 6 years ago

@jtattermusch The code I did most of the experimentation (in the other thread) uses PipeReader (actually very well described in the blog post you have mentioned) instead of Stream.

I think that idea of providing a CodedInputStream constructor with fixed ReadOnlySequence<T> is one of the solutions however it has some limitations.

We need not to use Stream. However my understanding is that if we use ReadOnlySequence<T> then we do not use streaming at all - so it's a pretty neat idea and makes it easy to retain compatibility.
The only limitation is that it cannot be used efficiently with streams what might be a case with large message parsing. I was thinking about 1-4MB range of messages which might be considered a bit too large to buffer or some customers using 4MB-100MB size of gRPC messages because they didn't move to streaming yet.

But honestly from my perspective if we assume that for gRPC all messages can be prefetched into non-contiguous buffer it's a perfectly good solution (assuming 4MB limit). It also helps to avoid uncontrolled working set. This decision is up to you. I am OK with both solutions prefeteched and async streamed (without gRPC streaming)..

mkosieradzki commented 6 years ago

@jtattermusch

Is it because of the async or because of the non-contiguous? It seems to me that non-continguous buffers as input should have virtually no extra overhead (assuming the buffer segments have reasonable size - e.g. sth in the ~8KB ballbark)

Conceptually you are right. But the problem is that Span<T> can live only on stack and as a consequence you cannot keep it alive inside CodedInputStream unless you make CodedInputStream a stack-only type… as well breaking a lot of compatibility. So the pain is that you can optimize ReadOnlySequence<T> -> ReadOnlyMemory<T>, but you cannot optimize ReadOnlyMemory<T> -> ReadOnlySpan<T> which must be done every call to CodedInputStream… and that was somehing I was trying to overcome with ultra-fast void MergeFrom(ReadOnlySpan<T>) so the span was always on stack.

Unfortunately I don't have figures for this and I apologize for that… I am pretty sure I had a prototype but it wasn't very fast. But also not as slow as the async version.

could the CodedInputStream here take a ReadOnlySequence directly instead of a PipeReader? ReadOnlySequence is from System.Buffers, while PipeReader is one level abstraction higher and requires System.IO.Pipelines (which doesn't have such good platform support as System.Buffers)

I see your point here, but I really can't promise that ReadOnlySequence<T> variant is feasible without adding the generated void MergeFrom(ReadOnlySpan<T>) and I honestly regret I have not saved the numbers to prove this.

jtattermusch commented 6 years ago

@mkosieradzki I see your point and the tradeoffs, but right now I'm driven mostly by what could be made useful in the near future (I believe incremental improvements are the only way forward for big problems like this).

for gRPC, when you receive a message you'll get basically a segmented buffer (in the native code, it's grpc_byte_buffer, that can be read segment by segment by grpc_byte_buffer_reader) that maps relatively well to the concept of C# ReadOnlySequence. By the time the message is received (when the C# layer gets notified about receiving a messge), the number of segments is already known.
Looks like taking a dependency on System.IO.Pipeline is more problematic than System.Buffers

If you want to take a look at my (very much in progress) code: https://github.com/grpc/grpc/compare/master...jtattermusch:performance_slicing (the point is basically to retrieve the buffer segments for a received message without any copying and make them available to the C# layer - right now the grpc_byte_buffer gets extracted to what's called ReadOnlySliceBuffer but it's basically the same idea as ReadOnlySequence). Another thing I'm trying to accomplish is to avoid allocating so the ReadOnlySliceBuffer is getting reused (I can replace my custom ReadOnlySliceBuffer by ReadOnlySequence if I'll be able to avoid allocations).

jtattermusch commented 6 years ago

On Thu, Jul 12, 2018 at 2:34 PM mkosieradzki notifications@github.com wrote:

@jtattermusch https://github.com/jtattermusch

Is it because of the async or because of the non-contiguous? It seems to me that non-continguous buffers as input should have virtually no extra overhead (assuming the buffer segments have reasonable size - e.g. sth in the ~8KB ballbark)

Conceptually you are right. But the problem is that Span can live only on stack and as a consequence you cannot keep it alive inside CodedInputStream unless you make CodedInputStream a stack-only type… as well breaking a lot of compatibility. So the pain is that you can optimize ReadOnlySequence -> ReadOnlyMemory, but you cannot optimize ReadOnlyMemory -> ReadOnlySpan which must be done every call to CodedInputStream… and that was somehing I was trying to overcome with ultra-fast void MergeFrom(ReadOnlySpan) so the span was always on stack.

That's a very good point - thanks for bringing that up. So is ReadOnlyMemory -> ReadOnlySpan slow? I assume it's pretty fast, but I understand that with the way CodedInputStream is now written, we would be performing the conversion practically for each byte of input.

Unfortunately I don't have figures for this and I apologize for that… I am pretty sure I had a prototype but it wasn't very fast. But also not as slow as the async version.

could the CodedInputStream here take a ReadOnlySequence directly instead of a PipeReader? ReadOnlySequence is from System.Buffers, while PipeReader is one level abstraction higher and requires System.IO.Pipelines (which doesn't have such good platform support as System.Buffers)

I see your point here, but I really can't promise that ReadOnlySequence variant is feasible without adding the generated void MergeFrom(ReadOnlySpan) and I honestly regret I have not saved the numbers to prove this.

Do you still have the MergeFrom(ReadOnlySpan) prototype?

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/google/protobuf/issues/3431#issuecomment-404496467, or mute the thread https://github.com/notifications/unsubscribe-auth/AJeq5D3daUceQn_om5vh_UB2FlEVc-Ivks5uF0JegaJpZM4OoqYG .

--

Jan

mkosieradzki commented 6 years ago

The benchmarks are basically here: https://github.com/mkosieradzki/protobuf/blob/spans/csharp/src/TestProtoPiper/Benchmarks/ParseAddressBook.cs

Here is the hand-crafted MergeFrom(ReadOnlySpan<T> - prototype https://github.com/mkosieradzki/protobuf/blob/spans/csharp/src/TestProtoPiper/Addressbook.cs#L336

That gives me an idea that it might be a good idea to compare MergeFrom(ReadOnlySpan<T> with MergeFrom(ReadOnlyMemory<T> - I think I will give it a shot. That result should be somewhat conclusive :).

mkosieradzki commented 6 years ago

@jtattermusch I have created two new benchmarks to capture the overhead of ReadOnlyMemory<T> -> ReadOnlySpan<T> and as I expected the overhead is signifant (because it involves memory pinning etc)…

                                          Method |     Mean |     Error |    StdDev | Rank |
------------------------------------------------ |---------:|----------:|----------:|-----:|
                      ParseUsingCodedInputReader | 37.56 us | 0.1789 us | 0.1674 us |    5 |
                               ParseUsingClassic | 18.13 us | 0.0575 us | 0.0509 us |    3 |
                      ParseUsingCodedInputParser | 72.76 us | 0.4153 us | 0.3884 us |    6 |
                  ParseUsingCodedInputSpanParser | 15.40 us | 0.0761 us | 0.0675 us |    2 |
               ParseUsingCodedInputSpanPosParser | 15.11 us | 0.0513 us | 0.0480 us |    1 |
 ParseUsingCodedInputSpanPosParserReadOnlyMemory | 18.81 us | 0.0824 us | 0.0688 us |    4 |

ParseUsingCodedInputSpanPosParser is passing using pos instead of ref… and is the winner today ;) I will push my current benchmarks to the repo so you can reproduce it on your side.

The comparison you want to look at is ParseUsingCodedInputSpanPosParserReadOnlyMemory vs ParseUsingCodedInputSpanPosParser - the only difference is conversion between ReadOnlyMemory<T> and ReadOnlySpan<T> on each call - what simulates what you wanted to do initially.

mkosieradzki commented 6 years ago

@jtattermusch

I have pushed my latest benchmarks to the repo. Please give it a try. I hope I have proven my point that void MergeFrom(ref ReadOnlySpan<T>) or void MergeFrom(in ReadOnlySpan<T>, ref int pos) MUST be generated to ensure the optimal performance.

Next point to learn is whether we want to add async Pipes support OR stick to synchronous parsing. I am perfectly OK with synchronous parsing.

If that's the case, we need to figure out what is the size threshold when copying non-contiguous memory to contiguous (and pinned memory) for parsing time (can be stack memory for small buffers or buffer pool for huge buffers) is efficient and depending on the result we need to decide whether it makes sense to create a void MergeFrom(in ReadOnlySequence<T>, ref int pos) or not i.e. whether threshold is above or significantly below 4MB.

Alternatively we can start with your idea of ReadOnlySequence in CodedInputStream and add two optimizations:

If message is in contiguous memory use void MergeFrom(in ReadOnlySpan<T>, ref int pos).
If message is in non-contiguous memory but is below memory size limit - stackalloc a buffer and use solution from point 1
Otherwise use classic parsing (CodedInputStream) and if there will be a submessage small enough it will fall into cases 1 or 2.

Does it make sense to you?

Honestly, I don't think it makes any sense to start parsing before the entire message has arrived. because it makes the RPC server more prone to a slow-sender class of DoS attacks while giving no real performance benefits except slightly decreased latency, but I really trust in your opinion here. What do you think?

mkosieradzki commented 6 years ago

@jtattermusch I have started work on branch https://github.com/mkosieradzki/protobuf/tree/read-only-sequence - this is basically implementation of CodedInputStream using ReadOnlySequence<byte> instead of byte[] as buffer. Let's see what comes out of this.

mkosieradzki commented 6 years ago

@jtattermusch I have restarted my work; on branch https://github.com/mkosieradzki/protobuf/tree/netcore2_1 Currently I have added the least intrusive version that adds support for .NET Core 2.1 (to make place for special optimizations), creates a place for passing ReadOnlySequence<byte> and ReadOnlyMemory<byte>, but does not break compatibility.

Now I am going to make sure I do not introduce any performance regressions by switching to the ReadOnlyMemory<T> abstraction instead of array of T.

If I won't be able to ensure this (especially on the pre-.NET Core 2.1 platforms) this I will need to add a fast path using compatBuffer which I am already using on pre-.NET Core 2.1 for Stream.Read.

mkosieradzki commented 6 years ago

I have run a bunch of further tests and I am... a bit disappointed, but have some important conclusions,

I have implemented a yet another alternative ValueTask-based variant of ParseUsingCodedInputSpanPosParser just by wrapping results into ValueTask<T> - it's slow beyond all recognition. So async as the only variant is NOT an option.
After experimenting and making a direct comparison between ReadOnlyMemory<T> and T[] in context of CodedInputStream think it's unfeasible to implement a ReadOnlyMemory<T>-based variant without significant performance regressions. VarInt and Tag parsing is so slow… despite multiple optimization attempts including pinning the memory once and creating span from the already pinned pointer.

We are talking about performance regressions orders of magnitude larger than just copying entire non-contiguous memory to a single array.

Maybe I am missing something.

Just to sum up. IMO the array-based buffer should stay in the CodedInputParser.

I think it would be worth productizing ParseUsingCodedInputSpanPosParser for its simplicity, efficiency and elasticity (you can use it with any contiguous piece of memory). I would also recommend adding codegen for void MergeFrom(in ReadOnlySpan<T>, ref int pos) - because without this codegen ParseUsingCodedInputSpanPosParser is useless.

Also this codegen should be conditionally used by the standard MergeFrom(CodedInputStream) if the entire message is already either in contiguous buffer or available from a ReadOnlySequence<byte> and is reasonably small.

@jtattermusch Please let me know whether I should create a PR (I will need to productize my PoC) as described, or if you need to do more investigation or you simply disagree.

mkosieradzki commented 6 years ago

I was playing a bit with aggressive inlining… and I went from:

                            Method |     Mean |     Error |    StdDev | Rank |
---------------------------------- |---------:|----------:|----------:|-----:|
                 ParseUsingClassic | 18.39 us | 0.0996 us | 0.0932 us |    2 |
 ParseUsingCodedInputSpanPosParser | 15.19 us | 0.0998 us | 0.0934 us |    1 |

down to:

                                  Method |     Mean |     Error |    StdDev | Rank |
---------------------------------------- |---------:|----------:|----------:|-----:|
                       ParseUsingClassic | 16.75 us | 0.1096 us | 0.1025 us |    4 |
                 ParseUsingClassicInline | 15.97 us | 0.0528 us | 0.0494 us |    3 |
       ParseUsingCodedInputSpanPosParser | 12.47 us | 0.0864 us | 0.0808 us |    2 |
 ParseUsingCodedInputSpanPosParserInline | 12.31 us | 0.0304 us | 0.0254 us |    1 |

only by adding [MethodImpl(MethodImplOptions.AggressiveInlining)] on some methods and improving constructors (to avoid using Func<T> factories)...

I think that this optimization alone deserves a PR. In general I see a huge potential in improved codegen. I mean: what's the point of generating (de-)serialization code if it still uses indirect calls (interfaces, callbacks, etc. on the hot path) and does not excercise inlining? IMO it's a good idea to generate a predictable code and help the compiler/jitter as much as you can...,

So this is my high level optimization plan proposal:

Improve the codegen and utilize more inlining (this might mean getting rid of the entire FieldCoded abstraction).
Consider adding support for ReadOnlySequence<byte> as an alternative to Stream in CodedInputStream without introducing any significant changes (just refill array-based buffer from the sequence instead of Stream if sequence is available) - but to be honest I don't see any huge advantages over implementing a ReadOnlySequence<T>-based Stream… I really tried.
Add codegen for ReadOnlySpan<byte> and use this path conditionally if the entire message fits the buffer.
Avoid allocating temporary buffers and use ArrayPool/MemoryPool where applicable,
Eventually in .NET Core 2.2/3.0 consider using the upcoming Utf8String to avoid UTF8 decoding which is super slow and accounts for around 5us in my example.

All of this together should result in a drop on my benchmark from 18us to around 7.5us in parsing time.

jtattermusch commented 6 years ago

@mkosieradzki

I'd leave async aside for now as it makes things more complicated than they need to be
I'd be generally open to adding logic that allows parsing from Spans, but IMHO we do want to prevent having two almost identical generated methods (MergeFrom(CodedInputStream) and MergeFrom (ReadOnlySpan)), so I'd still want to think about some trick how to express both variants using a single generated method.

Ad the other optimizations you are suggesting (AggresiveInlining, using ArrayPool/MemoryPool and Utf8String) - I think they are orthogonal changes and we should handle them separately (e.g. the agressive inlining can be done for both existing CodedInputStream as for any of its future variants regardless what exactly they are).

jtattermusch commented 6 years ago

@mkosieradzki I spent some time looking at your code and also experimenting myself. Some results:

I finally understood why for the ReadOnlySpan-based methods you are using ref ReadOnlySpan<T> - you need to be able to update the position the the current buffer. That's definitely one of the possible approaches, but my impression was that the CodedInputStream actually maintains several extra fields of state that have important security consequences (e.g. the tracking max recursion depth etc.) and it might be difficult to pass all of these along as function parameters (and if we need to add a few more context variables in the future, we might be in trouble).
The ReadOnlySpan-based approach can only achieve parsing from a single ReadOnlySpan (if I have a non-continguous buffer in which message boundaries don't match the buffer segment boundaries, I can't really use that approach).
The abstract API that is the most powerful is something like someMessage.MergeFrom(ParsingContext context, ReadOnlySpan<byte> immediateBuffer), where ParsingContext is not necessarily a class called like that, but it is something that maintains the parsing state as immediateBuffer is processed (it tracks the position, recursion depth and other context variables). It also should have the ability to provide the next buffer segment in form of a ReadOnlySpan (which be used as a new immediateBuffer once the current immediateBuffer will become empty).
As you can notice, the ParsingContext concept from above basically corresponds to the existing class CodedInputStream (where the immediate buffer would be used instead of the existing private field byte[] buffer). So the new more general parsing API could even look like someMessage.MergeFrom(CodedInputStream cis, ReadOnlySpan<byte> immediateBuffer) and the older backward-compatible API would be pretty simple to implement in terms of the newer (hopefully faster) parsing API.
Another option is to define the ParsingContext (open to better naming) explicitly as an actual class (or a struct which would reference CodedInputStream internally)
In any case, there would only be one implementation of methods doing the actual parsing and only one generated implementation of MergeFrom and the others variants/overloads would be implemented via this single implementation.

What do you think? Btw, thanks again for spending time prototyping various approaches, it is useful to have an actual code to look at.

Because the changes with the approach I'm suggesting would be relatively significant I'd suggest creating a new experimental nuget package with this new implementation of parsing first (and see how that goes), rather than trying to merge these changes to upstream.

mkosieradzki commented 6 years ago

@jtattermusch Dropping the async requirement enables us to create a ref struct to be used as ParsingContext.

So, in that case we can declare ref struct CodedInputParsingContext which is a stack-only type and is allowed to contain ReadOnlySpan<T> etc.

This means we can probably create an implicit conversion between CodedInputStream and CodedInputParsingContext and use a single code path for CodedInputParsingContext.

Do you like the idea of stack-only parsing context? I think I can try to prototype it.

Regarding the ref ReadOnlySpan<T>: the version with in ReadOnlySpan<T> and ref int pos is slightly faster - it is called ParseUsingCodedInputSpanPosParser .

jtattermusch commented 6 years ago

@mkosieradzki I've thought about the ref struct and I'm not sure if it's a good idea. The good thing is the ReadOnlySpan can be part of this struct, but you'll need a reference to some class with context anyway (as non-immutable struct seems like a bad idea in general). I'm also unsure about the backward compatibility of ref struct (which is a C# 7.2 concept) - how would the code look like to someone using an older runtime (e.g. net45)? In a way, the difference between using ref struct CodedInputParsingContext { SomeContextClass context, ReadOnlySpan buffer} and using two separate parsing parameters SomeContextClass context, ReadOnlySpan buffer in the API seems only like syntactic sugar - both implementations would probably look almost exactly the same, so I don't quite see the added value of using ref struct and potentially risking backward compatibility issues with older runtimes.

mkosieradzki commented 6 years ago

@jtattermusch I agree.

jtattermusch commented 6 years ago

@mkosieradzki to prevent doing unnecessary changes, I'd say a prototype that parses from 2 arguments: CodedInputStream context, ReadOnlySpan<byte> immediateBuffer would be useful - that should be a sufficient proof of concept that minimizes the amount of changes needed. (Basically every ReadXXX method on CodedInputStream would get an overload that accepts those two args).

mkosieradzki commented 6 years ago

@jtattermusch I was rather thinking about ref ReadOnlySpan<T> immediateBuffer so while iterating over ReadOnlySequence we can move pointer to the next buffer.

When iterating over a Stream or an array based buffer it's enough to use immutable reference ReadOnlySpan<T>, but our goal is to handle ReadOnlySequence<T> efficiently.

BTW. My preliminary results

                                  Method |     Mean |     Error |    StdDev | Rank |
---------------------------------------- |---------:|----------:|----------:|-----:|
                       ParseUsingClassic | 16.97 us | 0.1073 us | 0.1004 us |    5 |
                 ParseUsingClassicInline | 15.97 us | 0.1102 us | 0.0977 us |    4 |
                           ParseUsingNew | 14.97 us | 0.0939 us | 0.0878 us |    3 |
       ParseUsingCodedInputSpanPosParser | 13.06 us | 0.0493 us | 0.0461 us |    2 |
 ParseUsingCodedInputSpanPosParserInline | 12.49 us | 0.0803 us | 0.0751 us |    1 |

ParseUsingNew is a heavy-inlined version of classic (classic already benefits from some inlinining so it is not a true baseline) version optimized with an additional parameter ref ReadOnlySpan<T> (as you suggested). I need to track down where the 2 us are lost.

The current experimental code is available: https://github.com/mkosieradzki/protobuf/tree/spans/csharp/src/TestProtoPiper/Benchmarks - it looks pretty bad, but it's only to prototype.

protocolbuffers / protobuf

[CSharp] Allow Span<byte>-based parsing in CodedInputStream #3431

Background

API change

Affected APIs

Benefits

Scenarios

Scenario 1. NoSQL Database

Scenario 2. Stackalloc-ated buffers

Affected by