Open jamescourtney opened 1 year ago
Hello James, Yes, GC is real bad - for our game's client especially. We try to keep temporary allocations to an absolute minimum, which in C# is a lot of work. So, maybe just let us do the allocations and change the Parse API to accept a reference to an object and fill that out. That way, no need for Flatsharp to get into the pooling business. We and others probably already have a pooling solution. Thanks
That way, no need for Flatsharp to get into the pooling business.
It is a little more nuanced than that, unfortunately. I'm not sure how familiar you are, but FlatSharp returns subclasses of your data definitions when you deserialize. These are sometimes not public and/or otherwise have mangled names.
We could certainly do something along the lines of:
public interface IObjectPool
{
bool TryGet<T>(out T? item);
void Return<T>(T item);
}
If the methods were generic, then FlatSharp could supply the concrete type in terms of T
when invoking the methods. I'm still not sure that I want FlatSharp to accept an instance of IObjectPool
in the parse method -- I am leaning towards a static singleton because it means less state to plumb through the stack and smaller memory overhead by not having to store that IObjectPool
reference everywhere. This is not dissimilar to ArrayPool<T>.Shared
, which FlatSharp will use for pooling internal array allocations.
The other thing I'm not sure about is how configurable the IObjectPool
instance needs to be aside from the number of objects to retain. I have a prototype working with ConcurrentQueue<T>
as the source and haven't been able to do better, though I still need to test ConcurrentBag<T>
.
Zero allocation would be amazing!
I have a use case where I merge (add) objects. For simplicity I'll demonstrate in JSON.
{
"who": "cat"
}
mergeWith()
{
"where": [{ "address" : "a street" }]
}
mergeWith()
{
"where": [{ "phone" : "1234" }]
}
produces:
{
"who": "cat"
"where": [{ "phone" : "1234"}, {"address" : "a street" }]
}
So I find my self scanning through all the properties to see where they differ and building a new object.
All my objects are provided through their own byte[]
but some times the object doesn't use all the available bytes, and I can serialized the object back into the same byte[]
to avoid allocating a new byte[]
.
Being able to perform that merge operation without allocating for each entry in the object would be great! And is something valuable enough that I've already contemplated a few times on how such an extension to FlatBuffers could be done.
I was imagining it could work in a similar way to a ReadOnlySpan<T>
and Slice()
if there was only a way to traverse the object through it's metadata. Maybe something along the lines of foreach(ref ...
?
This isn't working, but close as I could get in the time I had. I'm sure I'm miss-using ref structs here:
/*
SharpLab tools in Run mode:
• value.Inspect()
• Inspect.Heap(object)
• Inspect.Stack(value)
• Inspect.MemoryGraph(value1, value2, …)
*/
using System;
readonly ref struct TraverseNode{
public readonly int type;
public readonly ReadOnlySpan<byte> value;
public TraverseNode(int type, ReadOnlySpan<byte> value){
this.type = type;
this.value = value;
}
}
struct Thing {
byte[] root;
public static Thing GetRootAsThing(byte[] buffer){
Thing t;
t.root = buffer;
return t;
}
public ThingEnumerable GetRefEnumerator() {
return new ThingEnumerable(root.AsSpan());
}
public ref struct ThingEnumerable {
int currentPos;
ReadOnlySpan<byte> root;
TraverseNode traverseNode;
public ThingEnumerable(ReadOnlySpan<byte> root){
currentPos = 0;
this.root = root;
}
public bool MoveNext(){
// todo: fix
traverseNode = new TraverseNode(0, root.Slice(0));
return false;
}
public TraverseNode Current() { return this.traverseNode; }
}
}
class Example{
static void Main(){
Merge(new byte[10], new byte[10]);
}
static byte[] Merge(Byte[] current, Byte[] extra){
var thing1 = Thing.GetRootAsThing(current);
var thing2 = Thing.GetRootAsThing(extra);
var itr1 = thing1.GetRefEnumerator();
var itr2 = thing2.GetRefEnumerator();
// todo: don't alloc if current.legth can fit all entries we need to merge in
// instead shift some bytes around inside current and return that
var merged = new byte[current.Length];
while(itr1.MoveNext() && itr2.MoveNext()){
var entry1 = itr1.Current();
var entry2 = itr2.Current();
if(entry1.type != entry2.type || !entry1.value.SequenceEqual(entry2.value)){
// do fun stuff like copy both
} else {
// only copy one
}
}
if(false){
return current;
}
return merged;
}
}
Long time, @Astn! Glad to see that you're still around :)
Zero allocation would be amazing!
Let's not go too crazy. I'm thinking of this as substantially-reduced allocations. For example, I can't keep from allocating in some situations like reading a string.
I can serialized the object back into the same byte[] to avoid allocating a new byte[].
Are you using FlatSharp for this? I sincerely have no idea how it will behave if you try to overwrite the same buffer that's in use by the object. I assume the serialization and the parse would race with each other (unless you're not using Lazy
, in which case I can see it).
Being able to perform that merge operation without allocating for each entry in the object would be great!
Are the allocations you're seeing coming from the traversal of the object? Or from the serialization to a new object? I think the latter case is something you can handle today if you pool lists, buffers, and FlatSharp "top level" objects. The allocate-on-parse is the scenario that I'm really trying to think about, since those happen internal to the library and are often of different types than you know about as the user, since they're generally subclasses of the objects you actually work with.
I think I'm leaning the direction of pooling rather than traversal APIs at this time, since it seems to gel a bit better with what C#/Unity devs expect and fits in better with what FlatSharp already does. Do you think that a pooling solution here would effectively address your problems?
From prototypes, object pooling does add considerable overhead, often approaching half the speed of non-pooled mode (on .NET 7 at least...), which is really a bummer. However, for Unity scenarios where GC is non-generational, I get the sense that they'd probably rather have predictable performance with minimal GC rather than absolute highest straight-line speed.
It is a little more nuanced than that, unfortunately. I'm not sure how familiar you are, but FlatSharp returns subclasses of your data definitions when you deserialize. These are sometimes not public and/or otherwise have mangled names.
Could there be static "Create" function that allocates the internal sub-class and returns parent-class?
I can serialized the object back into the same byte[] to avoid allocating a new byte[].
Are you using FlatSharp for this?
I'm actually cheating a bit right now cause I couldn't get it to perform like I wanted with just FlatSharp. Right now I'm taking the objects that need to be merged and if they both can fit in the byte[] already allocated, then I serialize them both to it one after the other. This way I can do the merge lazily during a later read operation instead of incurring that cost during save. Makes the read operation slower, but that's better when writes are more frequent than reads.
Then during a later read operation I merge them before returning them to the caller. And if I did a merge, then I take that byte[] and then store it back into FASTER with an in-place update.
FASTER RMW operation using InPlaceUpdater
.
It is a little more nuanced than that, unfortunately. I'm not sure how familiar you are, but FlatSharp returns subclasses of your data definitions when you deserialize. These are sometimes not public and/or otherwise have mangled names.
Could there be static "Create" function that allocates the internal sub-class and returns parent-class?
The other thing I've observed after some prototyping with Pooling is that the difference between static IObjectPool? { get; set; }
and static DefaultObjectPool? { get; set; }
is about 50%:
This obviously scales with the number of items going in and out of the pool. FlatSharp generally tries really hard not to have any virtual method calls that aren't necessary. Frequent virtual dispatches can tank performance. Note that DefaultObjectPool
in this case still exposes some configuration knobs, such as changing the limit, etc. You just can't swap in your own implementation.
This is a tricky call for me. Is there a super compelling case for plugging in a custom object pool? What is different about it than, say, ConcurrentQueue<T>
?
I can serialized the object back into the same byte[] to avoid allocating a new byte[].
Are you using FlatSharp for this?
I'm actually cheating a bit right now cause I couldn't get it to perform like I wanted with just FlatSharp. Right now I'm taking the objects that need to be merged and if they both can fit in the byte[] already allocated, then I serialize them both to it one after the other. This way I can do the merge lazily during a later read operation instead of incurring that cost during save. Makes the read operation slower, but that's better when writes are more frequent than reads.
Then during a later read operation I merge them before returning them to the caller. And if I did a merge, then I take that byte[] and then store it back into FASTER with an in-place update.
FASTER RMW operation using
InPlaceUpdater
.
@Astn -- that makes some degree of sense. Thanks! I'm trying to parse your responses, but I haven't gotten a clear read on whether you have a preference for object pooling or for some other option? Object Pooling is definitely more natural for FlatSharp since I can just add .ReturnToPool()
on generated objects and have it do the right thing depending on the context.
I don't think I would use an object pool in my case. I would look for a completely stack based solution. If I'm force to allocate then I'd prefer to do it only once to resize the underlying byte[]. I think I would lean towards some kind of visitor approach (for ex: ExpressionVisitor).
I don't think I would use an object pool in my case. I would look for a completely stack based solution. If I'm force to allocate then I'd prefer to do it only once to resize the underlying byte[]. I think I would lean towards some kind of visitor approach (for ex: ExpressionVisitor).
The official Google C# implementation is entirely stack-based if you want it to be. While FlatSharp might be able to do some things better, I'm quite skeptical of reinventing that approach here. What benefits over the official library do you see FlatSharp being able to provide if it were to go down that road?
I started with the Google implementation, and eventually switched to FlatSharp because I felt like 90% of my code was dealing building buffers and that was getting quite tiring. I really haven't looked at the Google library recently though.
Some things I think would be useful to add as some kind of extension. I'm not suggesting you change the way your library works. I think it's really great for 98% of what I want to do. I'm thinking of an side car type of feature set.
Span<byte> or Memory
to a specific entry in the buffer to pick out some value.I'd like to be able to extract (visit, view) or get a Span
or Memory to a specific entry in the buffer to pick out some value. I think some sort of traversal feature that worked on the a buffer directly and let you somehow find the thing your looking for would be amazing.
So, building buffers with the official library can be tedious. But actually using them might do what you're hoping. I'd encourage you to give it a shot, because you'll get an API that looks how you're expecting. They're also all value types. So you could do something where you use FlatSharp for serialization and "general purpose" traversal, but fall back to Flatc for cases where you need to zoom to a specific field or whatnot. I'm not trying to drive you away from my library of course, but I don't want to reinvent what they've done.
One thing I could consider doing would be emitting "Flatc" code as well from FlatSharp, since Flatsharp already uses Flatc, this wouldn't be difficult. Addtionally, FlatSharp could probably go through and adjust the namespaces for you so there aren't conflicts.
I'd like to validate a buffer in a fast forward only zero allocation pass. In this case I don't want to parse it and manipulate it.
That's something I can definitely think about. Today, validation happens "on demand" mainly through array bounds checks as you actually use the buffer.
@pattymack and @joncham -- I have a prototype of object pooling if you'd like to provide some feedback. The nuget packages are attached here: FlatSharp.Runtime.7.0.0.zip FlatSharp.Compiler.7.0.0.zip
There are a few things to keep in mind:
--gen-poolable true
from the command line to generate Poolable code. Or add <FlatSharpPoolable>true</FlatSharpPoolable>
to your .csproj file.--gen-poolable true
specified, Unions become reference types (but they're poolable!).ReturnToPool()
to return the object. Specifying true
as the parameter forces the object to return. I recommend never specifying this as doing so can break internal links between objects.Lazy
mode, all objects need to be .ReturnToPool
'ed individually. In non-lazy modes, all calls to .ReturnToPool()
are a no-op unless invoked from the root of the parse tree or using force
.FlatSharp.DefaultObjectPool.MaxToRetain
. This is a "soft max", and it applies for each distinct type. new()
yourself is your job to pool. Finally, some benchmarks:
FlatSharp 7 without pooling: | Method | Mean | Error | StdDev | P25 | P95 | Code Size | Gen0 | Gen1 | Allocated |
---|---|---|---|---|---|---|---|---|---|---|
Parse_StringTable_SingleString | 58.50 ns | 1.833 ns | 0.476 ns | 58.33 ns | 58.87 ns | 305 B | 0.0100 | - | 168 B | |
Parse_StringTable_SingleString_Repeated | 98.38 ns | 2.791 ns | 0.995 ns | 97.74 ns | 99.75 ns | 241 B | 0.0100 | - | 168 B | |
Parse_StringTable_Vector | 835.12 ns | 11.538 ns | 2.996 ns | 834.29 ns | 838.43 ns | 305 B | 0.2069 | 0.0019 | 3472 B | |
Parse_StringTable_Empty | 36.37 ns | 0.219 ns | 0.078 ns | 36.34 ns | 36.45 ns | 305 B | 0.0043 | - | 72 B | |
Parse_PrimitivesTable_Empty | 57.49 ns | 0.176 ns | 0.063 ns | 57.47 ns | 57.55 ns | 382 B | 0.0081 | - | 136 B | |
Parse_PrimitivesTable_Full | 65.17 ns | 0.284 ns | 0.101 ns | 65.09 ns | 65.28 ns | 382 B | 0.0081 | - | 136 B | |
Parse_StructTable_SingleRef | 41.06 ns | 1.363 ns | 0.486 ns | 40.69 ns | 41.73 ns | 228 B | 0.0086 | - | 144 B | |
Parse_StructTable_SingleRef_WriteThrough | 41.23 ns | 0.932 ns | 0.332 ns | 41.03 ns | 41.73 ns | 233 B | 0.0086 | - | 144 B | |
Parse_StructTable_SingleValue | 31.71 ns | 0.508 ns | 0.132 ns | 31.63 ns | 31.89 ns | 215 B | 0.0062 | - | 104 B | |
Parse_StructTable_SingleValue_WriteThrough | 31.64 ns | 0.761 ns | 0.271 ns | 31.50 ns | 31.97 ns | 220 B | 0.0062 | - | 104 B | |
Parse_StructTable_VecRef | 507.35 ns | 9.698 ns | 2.518 ns | 506.78 ns | 509.63 ns | 291 B | 0.1087 | - | 1824 B | |
Parse_StructTable_VecRef_WriteThrough | 543.90 ns | 15.351 ns | 5.474 ns | 540.46 ns | 551.73 ns | 304 B | 0.1087 | - | 1824 B | |
Parse_StructTable_VecValue | 177.64 ns | 5.425 ns | 1.935 ns | 176.28 ns | 180.26 ns | 279 B | 0.0296 | - | 496 B | |
Parse_StructTable_VecValue_WriteThrough | 279.93 ns | 7.217 ns | 2.574 ns | 278.29 ns | 283.65 ns | 312 B | 0.0296 | - | 496 B | |
ParseAndTraverse_SafeUnionVector | 770.63 ns | 78.869 ns | 28.125 ns | 747.46 ns | 806.52 ns | 1,370 B | 0.1011 | - | 1704 B | |
ParseAndTraverse_UnsafeUnionVector | 784.03 ns | 29.085 ns | 7.553 ns | 781.39 ns | 791.53 ns | 1,370 B | 0.0992 | - | 1672 B | |
ParseAndTraverse_MixedUnionVector | 708.00 ns | 12.702 ns | 3.299 ns | 706.03 ns | 712.23 ns | 1,462 B | 0.0992 | - | 1664 B |
FlatSharp 7 with pooling:
Method | Mean | Error | StdDev | P25 | P95 | Code Size | Gen0 | Allocated |
---|---|---|---|---|---|---|---|---|
Parse_StringTable_SingleString | 67.90 ns | 0.411 ns | 0.147 ns | 67.91 ns | 68.00 ns | 331 B | 0.0057 | 96 B |
Parse_StringTable_SingleString_Repeated | 114.85 ns | 2.160 ns | 0.561 ns | 114.46 ns | 115.58 ns | 259 B | 0.0057 | 96 B |
Parse_StringTable_Vector | 1,075.70 ns | 69.748 ns | 24.873 ns | 1,057.87 ns | 1,107.13 ns | 331 B | 0.1717 | 2880 B |
Parse_StringTable_Empty | 45.21 ns | 1.838 ns | 0.477 ns | 44.88 ns | 45.81 ns | 331 B | - | - |
Parse_PrimitivesTable_Empty | 67.69 ns | 2.552 ns | 0.910 ns | 67.32 ns | 68.84 ns | 406 B | - | - |
Parse_PrimitivesTable_Full | 75.10 ns | 2.761 ns | 0.985 ns | 74.67 ns | 75.80 ns | 406 B | - | - |
Parse_StructTable_SingleRef | 61.17 ns | 0.347 ns | 0.090 ns | 61.08 ns | 61.26 ns | 256 B | - | - |
Parse_StructTable_SingleRef_WriteThrough | 61.09 ns | 1.145 ns | 0.408 ns | 60.74 ns | 61.52 ns | 256 B | - | - |
Parse_StructTable_SingleValue | 42.90 ns | 1.508 ns | 0.538 ns | 42.64 ns | 43.34 ns | 243 B | - | - |
Parse_StructTable_SingleValue_WriteThrough | 43.19 ns | 1.020 ns | 0.364 ns | 42.96 ns | 43.70 ns | 242 B | - | - |
Parse_StructTable_VecRef | 1,055.11 ns | 376.613 ns | 134.304 ns | 963.82 ns | 1,231.70 ns | 321 B | - | - |
Parse_StructTable_VecRef_WriteThrough | 995.01 ns | 22.320 ns | 5.796 ns | 993.14 ns | 1,000.43 ns | 330 B | - | - |
Parse_StructTable_VecValue | 237.21 ns | 8.514 ns | 3.036 ns | 235.68 ns | 240.68 ns | 308 B | - | - |
Parse_StructTable_VecValue_WriteThrough | 349.37 ns | 9.803 ns | 3.496 ns | 346.63 ns | 353.89 ns | 336 B | - | - |
ParseAndTraverse_SafeUnionVector | 1,348.77 ns | 172.590 ns | 61.547 ns | 1,316.41 ns | 1,389.31 ns | 1,311 B | - | - |
ParseAndTraverse_UnsafeUnionVector | 1,379.63 ns | 46.598 ns | 16.617 ns | 1,370.82 ns | 1,402.75 ns | 1,311 B | - | - |
ParseAndTraverse_MixedUnionVector | 1,398.65 ns | 26.543 ns | 6.893 ns | 1,393.00 ns | 1,407.32 ns | 1,324 B | - | - |
The allocated column tells the story. Pooling definitely works, and it's also definitely slower (on .NET 7 at least). The allocations you do see are for reading strings (which are not poolable and do still need allocations).
Version 7.0.0 is published with experimental support for object pools. Feedback welcome. https://github.com/jamescourtney/FlatSharp/wiki/Object-Pooling
Over the years, FlatSharp has gotten better at reducing allocations. Initially, everything was a class: Tables, Structs, Unions, and Vectors. Now, structs are optionally value types and unions are always value types. However, Tables and Vectors are still always reference types.
The tension with Tables and Vectors is that FlatSharp uses virtual property overrides to fulfill many of the FlatBuffers contracts. This gives tremendous flexibility, and is really the reason that FlatSharp can provide 4 different deserialization modes. However, it also means that FlatSharp is stuck with classes for Tables (virtual properties) and Vectors (
IList<T>
implementations).This post is to gather some feedback about what users would like to see in a potential "allocation-reduction" feature.
Current State & Background
FlatSharp's stated goals are usability and speed, in that order. FlatSharp tries to be idiomatic, which is to say that it tries to look like a "normal" .NET serialization library, which is why the current abstractions exist. For better or worse, the idiomatic thing to do in C# is to create new objects and let the GC deal with the cleanup.
IDisposable
can also be used, but the real intent there is to free up native resources.However, many users of FlatBuffers have incredibly high performance cases in mind. In such cases, even small stops for GC can be problematic. So reducing allocations has a double benefit: 1) Reduces the frequency of GC 2) Reduces the time per GC
Object Pooling
Object pooling is often a solution for avoiding allocations. If FlatSharp had an object pool, the general idea is that these managed objects could be "reset" to point at a new buffer or a new spot in an existing buffer. However, there are some problems with object pooling generally, and some FlatSharp specific problems:
FlatSharp has some concerns as well, since pooling and object recycling looks very different depending upon the deserialization mode:
For example, returning a
Lazy
object to a pool is no problem at all, since there are no internal dependencies. The next time that object is needed, it can just be reconstituted from the underlying buffer.Returning a
Progressive
object to a pool is more difficult; InProgressive
mode, there are Parent --> Child references, which means that when a Child is returned to a pool, the parent references also need to be invalidated. However, Parents can also be returned to a pool, so now references need to be invalidated in both directions. Over-invalidating is not ideal, but does not breakProgressive
mode since it still has the capability to consult the underlying buffer to reconstitute the correct nodes.Greedy
is even more challenging, since those objects do not maintain a reference to the underlying buffer, which means that it would be incorrect to return a child to the pool before its parent.One could imagine a "top-down" approach to object pooling:
This code would implicitly work for all serialization approaches:
Lazy
never has aparentLink
, returning to the pool is always immediateProgressive
andGreedy
always haveparentLink
s, return to the pool happens when the root node is disposed, recursively. Dispose would be a No-Op for all but the root node, which would return the full graph to the pool.However, there are some concerns for this:
.Dispose()
?IList<T>
does not implementIDisposable
.Value Type Traversal Options
Expose some value-type semantics that allow traversing a FlatBuffer using value types. There is a lot to unpack here:
Questions
If you have an opinion here, please respond to this issue. I'd like to get a good amount of feedback before implementing a solution. Both of these are interesting. I'd like feedback on: