Comments Requested: Reducing FlatSharp Allocations

jamescourtney commented 1 year ago

Over the years, FlatSharp has gotten better at reducing allocations. Initially, everything was a class: Tables, Structs, Unions, and Vectors. Now, structs are optionally value types and unions are always value types. However, Tables and Vectors are still always reference types.

The tension with Tables and Vectors is that FlatSharp uses virtual property overrides to fulfill many of the FlatBuffers contracts. This gives tremendous flexibility, and is really the reason that FlatSharp can provide 4 different deserialization modes. However, it also means that FlatSharp is stuck with classes for Tables (virtual properties) and Vectors (IList<T> implementations).

This post is to gather some feedback about what users would like to see in a potential "allocation-reduction" feature.

Current State & Background

FlatSharp's stated goals are usability and speed, in that order. FlatSharp tries to be idiomatic, which is to say that it tries to look like a "normal" .NET serialization library, which is why the current abstractions exist. For better or worse, the idiomatic thing to do in C# is to create new objects and let the GC deal with the cleanup. IDisposable can also be used, but the real intent there is to free up native resources.

However, many users of FlatBuffers have incredibly high performance cases in mind. In such cases, even small stops for GC can be problematic. So reducing allocations has a double benefit: 1) Reduces the frequency of GC 2) Reduces the time per GC

Object Pooling

Object pooling is often a solution for avoiding allocations. If FlatSharp had an object pool, the general idea is that these managed objects could be "reset" to point at a new buffer or a new spot in an existing buffer. However, there are some problems with object pooling generally, and some FlatSharp specific problems:

Object pools are generally not faster than just creating a new object, unless the object represents a limited resource or is otherwise expensive to produce. FlatSharp's objects are very lightweight.
Object pools generally have some concurrency concerns. Even lock-free implementations can significantly slow down code like FlatSharp, where performance is often measured in nanoseconds. Alternatively, we could pool per thread, but some patterns like "always Parse on thread A and return the objects to the pool on Thread B" would render the idea moot.

FlatSharp has some concerns as well, since pooling and object recycling looks very different depending upon the deserialization mode:

Mode	Parent -> Child Reference	Child -> Parent References	Can Recreate Child?
Lazy	No	No	Yes
Progressive	Yes	No	Yes
Greedy	Yes	No	No
GreedyMutable	Yes	No	No

For example, returning a Lazy object to a pool is no problem at all, since there are no internal dependencies. The next time that object is needed, it can just be reconstituted from the underlying buffer.

Returning a Progressive object to a pool is more difficult; In Progressive mode, there are Parent --> Child references, which means that when a Child is returned to a pool, the parent references also need to be invalidated. However, Parents can also be returned to a pool, so now references need to be invalidated in both directions. Over-invalidating is not ideal, but does not break Progressive mode since it still has the capability to consult the underlying buffer to reconstitute the correct nodes.

Greedy is even more challenging, since those objects do not maintain a reference to the underlying buffer, which means that it would be incorrect to return a child to the pool before its parent.

One could imagine a "top-down" approach to object pooling:

protected override void Dispose(bool disposing)
{
    if (!disposing)
    {
        return;
    }

    if (this.parentLink is null)
    {
         this.child0.parentLink = null;
         this.child1.parentLink = null;

         this.child0.Dispose();
         this.child1.Dispose();

         this.Reset();
         Pool<Something>.Return(this);
    }
}

This code would implicitly work for all serialization approaches:

Since Lazy never has a parentLink, returning to the pool is always immediate
Since Progressive and Greedy always have parentLinks, return to the pool happens when the root node is disposed, recursively. Dispose would be a No-Op for all but the root node, which would return the full graph to the pool.

However, there are some concerns for this:

Is the cost of the pool greater than the cost of GC?
How expensive is the traversal logic?
Should we use .Dispose()?
How should vectors work? IList<T> does not implement IDisposable.

Value Type Traversal Options

Expose some value-type semantics that allow traversing a FlatBuffer using value types. There is a lot to unpack here:

Does this mean a completely separate Value-Type parsing track?
What about serialization?
The official FlatBuffer repo already models everything as Value types. What's the benefit of FlatSharp doing it too?
Could we come up with something approaching a query language that can jump to a specific spot in a buffer before actualizing any values? Basically, a smart pointer:

// each of these returns a different value type?
string nameOfItem7 = new OuterTableReference(someArray)
                                    .GetVector()
                                    .GetItem(7)
                                    .GetName()
                                    .Actualize(); // Invokes normal FlatSharp parsing logic?

Questions

If you have an opinion here, please respond to this issue. I'd like to get a good amount of feedback before implementing a solution. Both of these are interesting. I'd like feedback on:

How expensive GC is in your scenarios. Does your runtime have generational GC? Does FlatSharp contribute meaningfully to your GC workload?
Which of these options (or another option of your own creation) sounds better to you?
Answers to any of the questions above.

pattymack commented 1 year ago

Hello James, Yes, GC is real bad - for our game's client especially. We try to keep temporary allocations to an absolute minimum, which in C# is a lot of work. So, maybe just let us do the allocations and change the Parse API to accept a reference to an object and fill that out. That way, no need for Flatsharp to get into the pooling business. We and others probably already have a pooling solution. Thanks

jamescourtney commented 1 year ago

That way, no need for Flatsharp to get into the pooling business.

It is a little more nuanced than that, unfortunately. I'm not sure how familiar you are, but FlatSharp returns subclasses of your data definitions when you deserialize. These are sometimes not public and/or otherwise have mangled names.

We could certainly do something along the lines of:

public interface IObjectPool
{
     bool TryGet<T>(out T? item);
     void Return<T>(T item);
}

If the methods were generic, then FlatSharp could supply the concrete type in terms of T when invoking the methods. I'm still not sure that I want FlatSharp to accept an instance of IObjectPool in the parse method -- I am leaning towards a static singleton because it means less state to plumb through the stack and smaller memory overhead by not having to store that IObjectPoolreference everywhere. This is not dissimilar to ArrayPool<T>.Shared, which FlatSharp will use for pooling internal array allocations.

The other thing I'm not sure about is how configurable the IObjectPool instance needs to be aside from the number of objects to retain. I have a prototype working with ConcurrentQueue<T> as the source and haven't been able to do better, though I still need to test ConcurrentBag<T>.

Astn commented 1 year ago

Zero allocation would be amazing!

I have a use case where I merge (add) objects. For simplicity I'll demonstrate in JSON.

{
    "who": "cat"
}

mergeWith()

{
    "where": [{ "address" : "a street" }]
}

mergeWith()

{
    "where": [{ "phone" : "1234" }]
}

produces:

{
    "who": "cat"
    "where": [{ "phone" : "1234"}, {"address" : "a street" }]
}

So I find my self scanning through all the properties to see where they differ and building a new object. All my objects are provided through their own byte[] but some times the object doesn't use all the available bytes, and I can serialized the object back into the same byte[] to avoid allocating a new byte[].

Being able to perform that merge operation without allocating for each entry in the object would be great! And is something valuable enough that I've already contemplated a few times on how such an extension to FlatBuffers could be done.

I was imagining it could work in a similar way to a ReadOnlySpan<T> and Slice() if there was only a way to traverse the object through it's metadata. Maybe something along the lines of foreach(ref ... ?

This isn't working, but close as I could get in the time I had. I'm sure I'm miss-using ref structs here:

  /*
  SharpLab tools in Run mode:
    • value.Inspect()
    • Inspect.Heap(object)
    • Inspect.Stack(value)
    • Inspect.MemoryGraph(value1, value2, …)
*/
using System;

 readonly ref struct TraverseNode{
        public readonly int type;
        public readonly ReadOnlySpan<byte> value;
        public TraverseNode(int type, ReadOnlySpan<byte> value){
            this.type = type;
            this.value = value;
        }
 }
 struct Thing {
     byte[] root;

     public static Thing GetRootAsThing(byte[] buffer){
         Thing t;
         t.root = buffer;
         return t;
     }
     public ThingEnumerable GetRefEnumerator() {
         return new ThingEnumerable(root.AsSpan());
     }
     public ref struct ThingEnumerable {
         int currentPos;
         ReadOnlySpan<byte> root;
         TraverseNode traverseNode;
         public ThingEnumerable(ReadOnlySpan<byte> root){
             currentPos = 0;
             this.root = root;
         }
         public bool MoveNext(){
             // todo: fix
             traverseNode = new TraverseNode(0, root.Slice(0));
             return false;
         }
         public TraverseNode Current() { return this.traverseNode; }
     }
 }

class Example{
    static void Main(){
        Merge(new byte[10], new byte[10]);
    }
    static byte[] Merge(Byte[] current, Byte[] extra){
         var thing1 = Thing.GetRootAsThing(current);
         var thing2 = Thing.GetRootAsThing(extra);
         var itr1 = thing1.GetRefEnumerator();
         var itr2 = thing2.GetRefEnumerator();
        // todo: don't alloc if current.legth can fit all entries we need to merge in
        // instead shift some bytes around inside current and return that
        var merged = new byte[current.Length];
        while(itr1.MoveNext() && itr2.MoveNext()){
           var entry1 = itr1.Current();
           var entry2 = itr2.Current();
           if(entry1.type != entry2.type || !entry1.value.SequenceEqual(entry2.value)){
               // do fun stuff like copy both
           } else {
               // only copy one
           }
        }
        if(false){
            return current;
        }
        return merged;

    }
}

jamescourtney commented 1 year ago

Long time, @Astn! Glad to see that you're still around :)

Zero allocation would be amazing!

Let's not go too crazy. I'm thinking of this as substantially-reduced allocations. For example, I can't keep from allocating in some situations like reading a string.

I can serialized the object back into the same byte[] to avoid allocating a new byte[].

Are you using FlatSharp for this? I sincerely have no idea how it will behave if you try to overwrite the same buffer that's in use by the object. I assume the serialization and the parse would race with each other (unless you're not using Lazy, in which case I can see it).

Being able to perform that merge operation without allocating for each entry in the object would be great!

Are the allocations you're seeing coming from the traversal of the object? Or from the serialization to a new object? I think the latter case is something you can handle today if you pool lists, buffers, and FlatSharp "top level" objects. The allocate-on-parse is the scenario that I'm really trying to think about, since those happen internal to the library and are often of different types than you know about as the user, since they're generally subclasses of the objects you actually work with.

I think I'm leaning the direction of pooling rather than traversal APIs at this time, since it seems to gel a bit better with what C#/Unity devs expect and fits in better with what FlatSharp already does. Do you think that a pooling solution here would effectively address your problems?

From prototypes, object pooling does add considerable overhead, often approaching half the speed of non-pooled mode (on .NET 7 at least...), which is really a bummer. However, for Unity scenarios where GC is non-generational, I get the sense that they'd probably rather have predictable performance with minimal GC rather than absolute highest straight-line speed.

pattymack commented 1 year ago

It is a little more nuanced than that, unfortunately. I'm not sure how familiar you are, but FlatSharp returns subclasses of your data definitions when you deserialize. These are sometimes not public and/or otherwise have mangled names.

Could there be static "Create" function that allocates the internal sub-class and returns parent-class?

Astn commented 1 year ago

I can serialized the object back into the same byte[] to avoid allocating a new byte[].

Are you using FlatSharp for this?

I'm actually cheating a bit right now cause I couldn't get it to perform like I wanted with just FlatSharp. Right now I'm taking the objects that need to be merged and if they both can fit in the byte[] already allocated, then I serialize them both to it one after the other. This way I can do the merge lazily during a later read operation instead of incurring that cost during save. Makes the read operation slower, but that's better when writes are more frequent than reads.

Then during a later read operation I merge them before returning them to the caller. And if I did a merge, then I take that byte[] and then store it back into FASTER with an in-place update.

FASTER RMW operation using InPlaceUpdater.

jamescourtney commented 1 year ago

It is a little more nuanced than that, unfortunately. I'm not sure how familiar you are, but FlatSharp returns subclasses of your data definitions when you deserialize. These are sometimes not public and/or otherwise have mangled names.

Could there be static "Create" function that allocates the internal sub-class and returns parent-class?

The other thing I've observed after some prototyping with Pooling is that the difference between static IObjectPool? { get; set; } and static DefaultObjectPool? { get; set; } is about 50%:

~1000ns using the concrete type
~1500ns using the interface
~400ns just allocating new instances each time.

This obviously scales with the number of items going in and out of the pool. FlatSharp generally tries really hard not to have any virtual method calls that aren't necessary. Frequent virtual dispatches can tank performance. Note that DefaultObjectPool in this case still exposes some configuration knobs, such as changing the limit, etc. You just can't swap in your own implementation.

This is a tricky call for me. Is there a super compelling case for plugging in a custom object pool? What is different about it than, say, ConcurrentQueue<T>?

jamescourtney commented 1 year ago

I can serialized the object back into the same byte[] to avoid allocating a new byte[].

Are you using FlatSharp for this?

I'm actually cheating a bit right now cause I couldn't get it to perform like I wanted with just FlatSharp. Right now I'm taking the objects that need to be merged and if they both can fit in the byte[] already allocated, then I serialize them both to it one after the other. This way I can do the merge lazily during a later read operation instead of incurring that cost during save. Makes the read operation slower, but that's better when writes are more frequent than reads.

Then during a later read operation I merge them before returning them to the caller. And if I did a merge, then I take that byte[] and then store it back into FASTER with an in-place update.

FASTER RMW operation using InPlaceUpdater.

@Astn -- that makes some degree of sense. Thanks! I'm trying to parse your responses, but I haven't gotten a clear read on whether you have a preference for object pooling or for some other option? Object Pooling is definitely more natural for FlatSharp since I can just add .ReturnToPool() on generated objects and have it do the right thing depending on the context.

Astn commented 1 year ago

I don't think I would use an object pool in my case. I would look for a completely stack based solution. If I'm force to allocate then I'd prefer to do it only once to resize the underlying byte[]. I think I would lean towards some kind of visitor approach (for ex: ExpressionVisitor).

jamescourtney commented 1 year ago

I don't think I would use an object pool in my case. I would look for a completely stack based solution. If I'm force to allocate then I'd prefer to do it only once to resize the underlying byte[]. I think I would lean towards some kind of visitor approach (for ex: ExpressionVisitor).

The official Google C# implementation is entirely stack-based if you want it to be. While FlatSharp might be able to do some things better, I'm quite skeptical of reinventing that approach here. What benefits over the official library do you see FlatSharp being able to provide if it were to go down that road?

Astn commented 1 year ago

I started with the Google implementation, and eventually switched to FlatSharp because I felt like 90% of my code was dealing building buffers and that was getting quite tiring. I really haven't looked at the Google library recently though.

Some things I think would be useful to add as some kind of extension. I'm not suggesting you change the way your library works. I think it's really great for 98% of what I want to do. I'm thinking of an side car type of feature set.

I'd like to validate a buffer in a fast forward only zero allocation pass. In this case I don't want to parse it and manipulate it.
I'd like to be able to extract (visit, view) or get a Span<byte> or Memory to a specific entry in the buffer to pick out some value.
I think some sort of traversal feature that worked on the a buffer directly and let you somehow find the thing your looking for would be amazing.

jamescourtney commented 1 year ago

I'd like to be able to extract (visit, view) or get a Span or Memory to a specific entry in the buffer to pick out some value.

I think some sort of traversal feature that worked on the a buffer directly and let you somehow find the thing your looking for would be amazing.

So, building buffers with the official library can be tedious. But actually using them might do what you're hoping. I'd encourage you to give it a shot, because you'll get an API that looks how you're expecting. They're also all value types. So you could do something where you use FlatSharp for serialization and "general purpose" traversal, but fall back to Flatc for cases where you need to zoom to a specific field or whatnot. I'm not trying to drive you away from my library of course, but I don't want to reinvent what they've done.

One thing I could consider doing would be emitting "Flatc" code as well from FlatSharp, since Flatsharp already uses Flatc, this wouldn't be difficult. Addtionally, FlatSharp could probably go through and adjust the namespaces for you so there aren't conflicts.

I'd like to validate a buffer in a fast forward only zero allocation pass. In this case I don't want to parse it and manipulate it.

That's something I can definitely think about. Today, validation happens "on demand" mainly through array bounds checks as you actually use the buffer.

jamescourtney commented 1 year ago

@pattymack and @joncham -- I have a prototype of object pooling if you'd like to provide some feedback. The nuget packages are attached here: FlatSharp.Runtime.7.0.0.zip FlatSharp.Compiler.7.0.0.zip

There are a few things to keep in mind:

Object pooling is opt-in. Use --gen-poolable true from the command line to generate Poolable code. Or add <FlatSharpPoolable>true</FlatSharpPoolable> to your .csproj file.
With --gen-poolable true specified, Unions become reference types (but they're poolable!)
Once you have a poolable object, use .ReturnToPool() to return the object. Specifying true as the parameter forces the object to return. I recommend never specifying this as doing so can break internal links between objects.
In Lazy mode, all objects need to be .ReturnToPool'ed individually. In non-lazy modes, all calls to .ReturnToPool() are a no-op unless invoked from the root of the parse tree or using force.
The object pool is not swappable. This is for performance reasons. The only knob isFlatSharp.DefaultObjectPool.MaxToRetain. This is a "soft max", and it applies for each distinct type.
The object pool only applies for deserialized object graphs. Anything you new() yourself is your job to pool.
Never use an object after it's been returned to a pool. It may fail or may give you results from another parsed FlatBuffer. Objects are resilient to adding themselves to the pool twice, but thread safety isn't something FlatSharp guarantees.
Object Pooling is slower than allocate-and-GC (on .NET 7 at least). Results in Unity may vary.
FlatSharp 7 is a prerelease build. Don't use this in production unless you want to live dangerously.
Sample usage: https://github.com/jamescourtney/FlatSharp/blob/objectPools/src/Tests/FlatSharpPoolableEndToEndTests/PoolingTests.cs

Finally, some benchmarks:

FlatSharp 7 without pooling:	Method	Mean	Error	StdDev	P25	P95	Code Size	Gen0	Gen1
Parse_StringTable_SingleString	58.50 ns	1.833 ns	0.476 ns	58.33 ns	58.87 ns	305 B	0.0100	-	168 B
Parse_StringTable_SingleString_Repeated	98.38 ns	2.791 ns	0.995 ns	97.74 ns	99.75 ns	241 B	0.0100	-	168 B
Parse_StringTable_Vector	835.12 ns	11.538 ns	2.996 ns	834.29 ns	838.43 ns	305 B	0.2069	0.0019	3472 B
Parse_StringTable_Empty	36.37 ns	0.219 ns	0.078 ns	36.34 ns	36.45 ns	305 B	0.0043	-	72 B
Parse_PrimitivesTable_Empty	57.49 ns	0.176 ns	0.063 ns	57.47 ns	57.55 ns	382 B	0.0081	-	136 B
Parse_PrimitivesTable_Full	65.17 ns	0.284 ns	0.101 ns	65.09 ns	65.28 ns	382 B	0.0081	-	136 B
Parse_StructTable_SingleRef	41.06 ns	1.363 ns	0.486 ns	40.69 ns	41.73 ns	228 B	0.0086	-	144 B
Parse_StructTable_SingleRef_WriteThrough	41.23 ns	0.932 ns	0.332 ns	41.03 ns	41.73 ns	233 B	0.0086	-	144 B
Parse_StructTable_SingleValue	31.71 ns	0.508 ns	0.132 ns	31.63 ns	31.89 ns	215 B	0.0062	-	104 B
Parse_StructTable_SingleValue_WriteThrough	31.64 ns	0.761 ns	0.271 ns	31.50 ns	31.97 ns	220 B	0.0062	-	104 B
Parse_StructTable_VecRef	507.35 ns	9.698 ns	2.518 ns	506.78 ns	509.63 ns	291 B	0.1087	-	1824 B
Parse_StructTable_VecRef_WriteThrough	543.90 ns	15.351 ns	5.474 ns	540.46 ns	551.73 ns	304 B	0.1087	-	1824 B
Parse_StructTable_VecValue	177.64 ns	5.425 ns	1.935 ns	176.28 ns	180.26 ns	279 B	0.0296	-	496 B
Parse_StructTable_VecValue_WriteThrough	279.93 ns	7.217 ns	2.574 ns	278.29 ns	283.65 ns	312 B	0.0296	-	496 B
ParseAndTraverse_SafeUnionVector	770.63 ns	78.869 ns	28.125 ns	747.46 ns	806.52 ns	1,370 B	0.1011	-	1704 B
ParseAndTraverse_UnsafeUnionVector	784.03 ns	29.085 ns	7.553 ns	781.39 ns	791.53 ns	1,370 B	0.0992	-	1672 B
ParseAndTraverse_MixedUnionVector	708.00 ns	12.702 ns	3.299 ns	706.03 ns	712.23 ns	1,462 B	0.0992	-	1664 B

FlatSharp 7 with pooling:

Method	Mean	Error	StdDev	P25	P95	Code Size	Gen0	Allocated
Parse_StringTable_SingleString	67.90 ns	0.411 ns	0.147 ns	67.91 ns	68.00 ns	331 B	0.0057	96 B
Parse_StringTable_SingleString_Repeated	114.85 ns	2.160 ns	0.561 ns	114.46 ns	115.58 ns	259 B	0.0057	96 B
Parse_StringTable_Vector	1,075.70 ns	69.748 ns	24.873 ns	1,057.87 ns	1,107.13 ns	331 B	0.1717	2880 B
Parse_StringTable_Empty	45.21 ns	1.838 ns	0.477 ns	44.88 ns	45.81 ns	331 B	-	-
Parse_PrimitivesTable_Empty	67.69 ns	2.552 ns	0.910 ns	67.32 ns	68.84 ns	406 B	-	-
Parse_PrimitivesTable_Full	75.10 ns	2.761 ns	0.985 ns	74.67 ns	75.80 ns	406 B	-	-
Parse_StructTable_SingleRef	61.17 ns	0.347 ns	0.090 ns	61.08 ns	61.26 ns	256 B	-	-
Parse_StructTable_SingleRef_WriteThrough	61.09 ns	1.145 ns	0.408 ns	60.74 ns	61.52 ns	256 B	-	-
Parse_StructTable_SingleValue	42.90 ns	1.508 ns	0.538 ns	42.64 ns	43.34 ns	243 B	-	-
Parse_StructTable_SingleValue_WriteThrough	43.19 ns	1.020 ns	0.364 ns	42.96 ns	43.70 ns	242 B	-	-
Parse_StructTable_VecRef	1,055.11 ns	376.613 ns	134.304 ns	963.82 ns	1,231.70 ns	321 B	-	-
Parse_StructTable_VecRef_WriteThrough	995.01 ns	22.320 ns	5.796 ns	993.14 ns	1,000.43 ns	330 B	-	-
Parse_StructTable_VecValue	237.21 ns	8.514 ns	3.036 ns	235.68 ns	240.68 ns	308 B	-	-
Parse_StructTable_VecValue_WriteThrough	349.37 ns	9.803 ns	3.496 ns	346.63 ns	353.89 ns	336 B	-	-
ParseAndTraverse_SafeUnionVector	1,348.77 ns	172.590 ns	61.547 ns	1,316.41 ns	1,389.31 ns	1,311 B	-	-
ParseAndTraverse_UnsafeUnionVector	1,379.63 ns	46.598 ns	16.617 ns	1,370.82 ns	1,402.75 ns	1,311 B	-	-
ParseAndTraverse_MixedUnionVector	1,398.65 ns	26.543 ns	6.893 ns	1,393.00 ns	1,407.32 ns	1,324 B	-	-

The allocated column tells the story. Pooling definitely works, and it's also definitely slower (on .NET 7 at least). The allocations you do see are for reading strings (which are not poolable and do still need allocations).

jamescourtney commented 1 year ago

Version 7.0.0 is published with experimental support for object pools. Feedback welcome. https://github.com/jamescourtney/FlatSharp/wiki/Object-Pooling

jamescourtney / FlatSharp