Closed Astn closed 3 years ago
So, I did at one point prototype and build value-type structs, but benchmarks were actually in favor of the class-based approach, so I abandoned that branch. I'm happy to revive it at some point if there is a demonstrable reason to add all of that complexity.
One thing that immediately jumps out to me is that you're using Greedy
deserialization. That will definitely put a lot of GC pressure on your application, especially if your vectors are large. Have you tried using Lazy
or PropertyCache
? Those should amortize the GC hit more evenly.
I haven't yet tried those other deserialization options. As your docs seemed to indicate that if you were using all the data then greedy would be better. My use case always needs all the data every time.
I'll give it a shot though.
Let me know how it goes. FWIW -- the docs aren't really geared around 20k item vectors. My team at Microsoft uses this and we have similarly sized vectors (using FlatBuffers inside a file), and I observed the same GC hit. Switching to Lazy resolved it completely.
The problem with big vectors and Greedy serialization is:
The docs say Greedy is fastest because the cost for accessing a single property in Greedy mode is really cheap, but the semantics of Greedy force allocations before you need them. This is usually fine for small buffers. Lazy/PropertyCache are slightly slower, but don't allocate anything before you ask for it. So if you're looping through this list of 20k items, you'll have a few in memory at a time, but they'll all get swept up in Gen0 since the references are ephemeral.
I gave the lazy route a shot, and it didn't really help things in my use case.
I'm loading these objects and then sending the whole array of them over the network. So as quickly as the Grpc send function can iterate the array they are being de-serialized and then immediately serialized again. I think this makes the Greedy mode more efficient right now.
I haven't had a chance to dig around that branch you linked, I'll see if I can take a peek tomorrow.
It's got me wondering if there would be a way to reuse the same instance of one of the array objects if it was accessed through an iterator. Maybe have a way to specify an allocator function that's used to feed the iterator, then you would have the option of having a new object each time, or reusing one your hanging on to.
I gave the lazy route a shot, and it didn't really help things in my use case.
Thanks for trying!
I'm loading these objects and then sending the whole array of them over the network. So as quickly as the Grpc send function can iterate the array they are being de-serialized and then immediately serialized again. I think this makes the Greedy mode more efficient right now.
Are you doing any sort of translation to a different schema? I'm wondering why you need to deserialize at all if you're just piping the data around. Also -- wouldn't gRPC streaming work well for this? It doesn't seem like they all need to be in the array at once.
It's got me wondering if there would be a way to reuse the same instance of one of the array objects if it was accessed through an iterator. Maybe have a way to specify an allocator function that's used to feed the iterator, then you would have the option of having a new object each time, or reusing one your hanging on to.
A way to "repoint" an object at a different spot in the buffer is an interesting idea that probably merits some deeper consideration. It's a little tricky given that FlatSharp's whole mantra is "subclass property implementations", so your application-level code has no knowledge of the actual class that it's dealing with, only the parent class.
I think what you'd probably want to do is give the ISerializer
instance that you're using access to an object pool that it tries to consult before resorting to new
. Your code would then be on the hook for returning objects to the pool when they are finished. In the worst case, your code never returns anything to the pool and the behavior is exactly as it is today with GC needing to clean it up.
I'd need to think carefully about whether I want to actually implement this pool in a non-naive way and deal with all of the associated pitfalls of pooling memory, or just chuck it into an interface and let people like you (ie, those who care deeply) control it.
In any case, this is the most interesting area of thought for me with FlatSharp in quite some time.
After sleeping on this, I'm sort of leaning the other way.
I think I'd like to make this work by letting the user define pooling properties in the fbs file (or C# if using attributes):
table SomeTable (fs_pool_size:"100") { ... } // keep 100 items
struct SomeStruct (fs_pool_size:"300") { ... }
This would translate into the normal Dispose
pattern for C#:
[FlatBufferTable(PoolSize = 100)]
public class SomeTable : IDisposable // Use Dispose semantics for returning to the pool
{
~SomeTable() { this.Dispose(false); }
public sealed void Dispose() { this.Dispose(true); GC.SuppressFinalize(this); }
protected virtual void Dipose(bool disposing) { this.OnDisposing(disposing); }
private partial void OnDisposing(bool disposing); // let partial classes also have the option to dispose stuff.
}
The dynamic subclasses would then look like:
public class SomeTableReader : IDisposable
{
// Declaring pool as static field makes it really fast and avoids a layer of Type indirection.
private static readonly ConcurrentBag<SomeTableReader> pool = new();
private volatile int free;
// Default ctor is private.
private SomeTableReader() { }
public static SomeTableReader GetOrCreate(IInputbuffer buffer, int offset)
{
if (!pool.TryTake(out var reader))
{
reader = new SomeTableReader();
}
reader.Initialize(buffer, offset);
return reader;
}
protected override void Dispose(bool disposing)
{
if (Interlocked.CompareExchange(ref this.free, 0, 1) == 1) // who should own thread safety? I'm not sure this accomplishes much because customers can still do use-after-dispose, and we don't want to have locks everywhere. This only prevents double-dispose.
{
base.Dispose(disposing);
if (pool.Count > 100) // comes from the fs_pool_size attribute
{
this.Clear();
pool.Add(this);
}
}
}
// reset this object to clean state, release references so GC can reap them
private void Clear() { Debug.Assert(this.free == 1); }
// (re)initialize this object
private void Initialize(IInputBuffer buffer, int offset) { this.free = 0; }
}
My main question/concerns here are around thread safety. There are two main issues:
A non-thread safety issue would be if people were running this in a heterogenous environment. Would they want different pool sizes to somehow be configurable? A static attribute isn't very helpful there. Though perhaps a value of -1
could indicate that the pool should grow until it is right-sized. Of course, if the workload is bursty it will hang onto extra items, though this isn't dissimilar to List<T>
.
I perused that branch you posted, and I think it will take me reading through a bit more code to really grok how things are working. I was a bit busy today, so maybe have more time tomorrow.
I'm liking your disposable idea for returning items to the pool, I may have some thoughts on it after I digest it more.
A few related ideas for chewing on.
I've spent some time on the IDisposable idea today. I moved away from strictly using IDisposable and ended up extending an interface that I have already by adding a Release
method, which has the same semantics as what I talked about up above. I'm a little leery of IDisposable since it implies some sort of native resource under the hood that needs disposal, and most other object pool APIs use Rent/Return semantics.
var parsed = SomeTable.Serializer.Parse(buffer);
...
if (parsed is IFlatBufferDeserializedObject deserialized)
{
deserialized.Release();
}
I have most of this prototyped and it seems at least somewhat promising. However, there are some places where this becomes incompatible with other features (init-only property setters don't play nice with reusing the same object!), so I need to refactor my way out of this one.
I have a very beta build of the pooling behavior (as in it looks like it works, but I need to write actual tests). You can grab it from the CI build here (look under artifacts): https://github.com/jamescourtney/FlatSharp/actions/runs/902280796. Alternatively, you can clone the 'objectPool' branch and build yourself.
To enable pooling, you've got to do two things:
1) Add the fs_pool
attribute to your fbs. I'm not settled on this yet
table SomeTable (fs_pool) { }
2) Actually recycle your objects after you're done with them:
var parsed = SomeTable.Serializer.Parse(buffer);
...
if (parsed is IFlatBufferDeserializedObject deserialized)
{
deserialized.Release();
}
When the fs_pool
attribute is specified, returned items are sent to a ConcurrentBag
after you release them. Release is a no-op when fs_pool
is not specified. There is some tracking for double-release and use-after-release, though it is not sophisticated. You can enable better error messages for these scenarios by setting :
FlatSharpGlobalSettings.CollectPooledObjectStackTraces = true;
I'd love to know if this helps your scenario or not.
Very cool, and thanks for helping on this!
I'll setup a benchmark to test. Might take me a few hours.
So, I ran a few benchmarks myself. ConcurrentBag
is apparently pretty slow. I've just pushed a new version that switches to [ThreadStatic]
and old-fashioned Queue
(https://github.com/jamescourtney/FlatSharp/actions/runs/904207394)
Here are the rough benchmarks for a full traversal of this schema (20k items)
table SomeTable (fs_serializer:Lazy) {
Points : [Vec3];
}
struct Vec3 (fs_nonVirtual, fs_pool) {
X : float;
Y : float;
Z : float;
}
[Benchmark]
public int ParseAndTraverse()
{
var t = SomeTable.Serializer.Parse(this.inputBuffer);
int sum = 0;
var points = t.Points;
int count = points.Count;
for (int i = 0; i < count; ++i)
{
var item = points[i];
sum += (int)(item.X + item.Y + item.Z);
((IFlatBufferDeserializedObject)item).Release();
}
return sum;
}
Pooled | Mode | Object Pool | Full Traversal Time (us) |
---|---|---|---|
No | Greedy | 715 | |
No | Lazy | 214 | |
Yes | Greedy | ConcurrentBag |
1000 |
Yes | Lazy | ConcurrentBag |
1100 |
Yes | Greedy | ThreadLocal<Queue> |
774 |
Yes | Lazy | ThreadLocal<Queue> |
691 |
Yes | Greedy | [ThreadStatic] + Queue |
630 |
Yes | Lazy | [ThreadStatic] + Queue |
513 |
Yes | Greedy | [ThreadStatic] + Stack |
616 |
Yes | Lazy | [ThreadStatic] + Stack |
522 |
My commentary on these results is:
ConcurrentBag
is really slow. I assume some of this is because it implements some work-stealing from the thread-local storage for other threads so the total number of items is constrained.Stack
might perform better than Queue
due to higher likelihood of cache hits, but the results were a wash.This seems to require a bit more tinkering. Would you mind just giving me a simple little benchmark that reflects your scenario so I can play with it some more on my own?
I've had quite a day, and haven't been able to work much today so far.
Here is the core of my flat buffers definition. Just dropping here so you don't have to wait more on these. I'll try to get that benchmark showing a use case going asap.
struct Vector4
{
x:float;
y:float;
z:float;
w:float;
}
struct Vector3Int
{
x:int32;
y:int32;
z:int32;
}
struct Vector3
{
x:float;
y:float;
z:float;
}
struct Vector2Int
{
x:int32;
y:int32;
}
struct Vector2
{
x:float;
y:float;
}
struct Color
{
r: ubyte;
g: ubyte;
b: ubyte;
a: ubyte;
}
struct Voxel
{
VoxelType: ubyte;
SubType : ubyte;
Hp:ubyte;
Unused:ubyte;
}
table Mesh (fs_serializer:greedy)
{
vertices: [Vector3];
normals: [Vector3];
uv: [Vector2];
color: [Color];
triangles: [ushort];
}
table VoxelRegion3D (fs_serializer:greedy)
{
location: Vector3Int;
iteration: uint32;
size: ushort;
voxels: [Voxel];
}
But to summarize use a usecase.
Any time a value in the voxels[] is changed, the mesh needs to be rebuilt. This results in
I've added a use case benchmark to https://github.com/jamescourtney/FlatSharp/pull/163
Thanks for the detail and the branch! I will take a look this evening.
Couple of comments/questions that might help:
- Deserialize VoxelRegion3d
- Modify voxel
- Generate new Mesh
- Serialize VoxelRegion3d
- Save VoxelRegion3d
Version 5.3 of FlatSharp includes support for write through properties to the underlying buffer. If you were to create a MemoryInputBuffer
based on a memory-mapped file, you could potentially combine all of these steps into one. This allows you to do an in-place update to the existing buffer without a full parse/re-serialize and automatically flush that to disk. Flatsharp doesn't do anything with mmapped files on its own, but it should be possible. https://github.com/dotnet/runtime/issues/24805 might be able to help. There are a few constraints here:
VectorCacheMutable
(Lazy
/PropertyCache
maybe could be extended to support this by adding mutable versions of those, but Greedy
is a nonstarter since those do not maintain a reference to the input buffer)
- Load already saved mesh
- Send over Grpc to client
Is this gRPC call a batch mode or streaming mode? If you're using gRPC streaming you may be able to get away from loading everything into a giant array, though you mentioned some GPU processing as well which might be driving this requirement.
Hey -- so I've tinkered with this, and managed to speed things up a bunch using the write through option I mentioned above. Here's the original bench you shared:
Method | ParseOption | Mean | Error | StdDev |
---|---|---|---|---|
SendRegionToClient | Lazy | 1.925 ms | 0.3082 ms | 0.0169 ms |
SendVisibleRegionsToClient | Lazy | 755.228 ms | 169.9860 ms | 9.3175 ms |
SendVisibleMeshesToClient | Lazy | 743.959 ms | 3.6774 ms | 0.2016 ms |
SendMeshToClient | Lazy | 2.906 ms | 2.0721 ms | 0.1136 ms |
ModifyMeshAndSendToClients | Lazy | 29.511 ms | 41.8578 ms | 2.2944 ms |
SendRegionToClient | GreedyMutable | 2.922 ms | 1.6592 ms | 0.0909 ms |
SendVisibleRegionsToClient | GreedyMutable | 1,174.830 ms | 703.3006 ms | 38.5503 ms |
SendVisibleMeshesToClient | GreedyMutable | 1,199.901 ms | 306.4329 ms | 16.7966 ms |
SendMeshToClient | GreedyMutable | 5.925 ms | 5.9447 ms | 0.3258 ms |
ModifyMeshAndSendToClients | GreedyMutable | 27.656 ms | 11.5352 ms | 0.6323 ms |
After changing it to writethrough, these turn into:
Method | Mean | Error | StdDev |
---|---|---|---|
SendRegionToClient | 93.78 us | 6.696 us | 0.367 us |
SendVisibleRegionsToClient | 38,177.58 us | 5,279.602 us | 289.393 us |
SendVisibleMeshesToClient | 35,542.62 us | 47,627.519 us | 2,610.625 us |
SendMeshToClient | 1,127.36 us | 508.945 us | 27.897 us |
ModifyMeshAndSendToClients | 39,161.27 us | 10,160.844 us | 556.950 us |
This looks to be a speedup of a couple of orders of magnitude for a lot of these tests. The downside is that I had to overallocate some of the arrays to accommodate the variable number of items in the mesh vectors. Whether this works for you or not I couldn't say.
// Update fillSize to accomodate max of (fillSize * 3). Some items may be null.
Mesh mesh = new Mesh
{
color = new Color[fillSize * 3],
normals = new Vector3[fillSize * 3],
triangles = new ushort[fillSize * 3],
uv = new Vector2[fillSize * 3],
vertices = new Vector3[fillSize * 3]
};
You can find the code I used here: https://github.com/jamescourtney/FlatSharp/tree/voxelBench/src/Benchmarks/ExperimentalBenchmark
Yours are largely unchanged, and mine are copied and named Modified.
Very cool! I'll checking it out now!
I've read through your modified version and am stunned by how much better it is. It will take me a while to comprehend why the changes you made were so impactful! Over allocating the mesh is not a problem. I use lz4 on each byte[] before it gets persisted. I'm not sure if that messes up your ideas with memory mapped files though. I've been storing everything using rocksdb. Maybe some things would be better in memory mapped files though. But the lz4 compression on over allocated buffers and regions of voxels is really great.
Cool! Glad I could help :) I'm going to drop the object pooling approach for now since I've thought of some things about it that I dislike. What might be possible is an additional parse API that reuses the same object graph when possible:
((IFlatBufferDeserializedObject)something).LoadFrom(byte[] buffer)
The short version of why it helps is that the fs_writeThrough
attribute makes the mutations directly in the underlying buffer, so you're saving a fortune on copies. The other thing is that VectorCacheMutable
lazily initializes the items as they are read. If you actually read through the serializer code that FlatSharp spits out, you'll see something like this (which I've lightly annotated):
// This is the 'x' property of the Vec3 structure.
// The base class is virtual and FlatSharp overrides it to speak FlatBuffer.
public override System.Single x
{
get
{
// Test to see if it's already in memory.
if ((this.__mask0 & (byte)1) == 0)
{
// If not, read it and update the bit mask.
this.__index0Value = ReadIndex0Value(this.__buffer, this.__offset, default, default);
this.__mask0 |= (byte)1;
}
return this.__index0Value;
}
set
{
// Set the value of the backing field.
this.__index0Value = value;
// update mask to indicate that this value is now in memory and doesn't need to be pulled from the buffer.
this.__mask0 |= (byte)1;
// fs_writeThrough injects this line, which writes the new value back to the underlying buffer.
WriteIndex0Value(this.__buffer, __offset, value);
}
}
There are two key parts of this:
You can ignore what I said about files -- I was assuming you were storing your Flatbuffers directly on disk. I've used RocksDb before. You might want to be careful about using LZ4 yourself unless you've explicitly disabled compression in RocksDb. It's been a few years, but I recall that it has Snappy and/or LZ4 linked in.
By the way -- I did push another update to the voxelBench branch that modestly improves the perf from before. Mostly because I dropped that IsNull
property from the structs and just added a length
property to the Mesh
. This avoids the alignment padding issues when you have a trailing byte on a struct that you're storing in a vector, and will save quite a bit of space.
Thanks for helping me understand the changes. And I'm pulling in your updates.
I'm going to read through more of the code and see if I can wrap my head around it.
I managed to come up with a model I liked better for object pooling (calling it Recycling now):
// Traverses the full object graph and recycles poolable objects. Mesh is set to null on completion.
this.meshSerializer.Recycle(ref mesh);
This adds another large speedup over what I did earlier:
Method | WriteThrough + Recycle | WriteThrough | Baseline (lazy allocation) |
---|---|---|---|
SendRegionToClient | 12.65 us | 93 us | 1925 us |
SendVisibleRegionsToClient | 5,361.71 us | 38,177 us | 755,000 us |
SendVisibleMeshesToClient | 5,459.84 us | 35,542 us | 743,000 us |
SendMeshToClient | 313.27 us | 1,127 us | 2,906 us |
ModifyMeshAndSendToClients | 22,879.78 us | 39,161 us | 29,511 us |
Those changes are pushed now as well. Again -- I should stress that the Recycle changes are experimental and very much use-at-your-own-risk for the moment until I can get a full suite of tests built around it.
Very cool!!! Pulling it down.
Hi there -- I've pushed one final change to that branch for you. The main change is that I've yanked out all of the code for object pooling.
The good news is that I've replaced it with something better and simpler (for you, at least). I added a new serialization mode: LazyWriteThrough
.
I'd encouraged you to use VectorCacheMutable
in the past because it enabled write-through semantics. It could do this for a couple of reasons:
foo.bar[3].baz.bat[2]
you get the same object instance back. It did this by preallocating all vectors and filling them with stubs (then gradually filling those stubs in as you accessed the properties). Writes went back to the buffer and also got stored on the object itself (which is why it is important that there is only one object per Flatbuffer element).This was a big win for you because it saved so much deserialize/parse work. However, there was still a ton of array allocation happening to fill these big arrays with stubs. The work I did on object pooling helped a little bit, but there were still problems:
All in all, I was feeling uneasy about the approach, which is usually a sign I need to rethink things.
Switching gears, FlatSharp's Lazy
mode avoids array allocations altogether: Accessing foo.bar[2].baz.bat[3]
twice gives you two different instances that point at the same spot in the buffer. Lazy
also disallows all mutations. The great thing about Lazy
is that if you're only referencing objects ephemerally, they can all be scooped up in Gen0 before the "expensive" GC kicks in. The lack of huge vectors also keeps GC at bay.
The new mode LazyWriteThrough
combines enables Lazy
with fs_writeThrough
properties:
fs_writeThrough
properties. Table properties and those with fs_writeThrough
disabled will still throw exceptions.Lazy
since they don't cache the values anyway.So finally, benchmarks!
For context, here are the results of the original one you uploaded (I made a small tweak to stop allocating new byte[]
for the fake network buffers and use a static one instead, since we're benchmarking FlatSharp and not the CLR allocator)
Method | ParseOption | Mean | Error | StdDev | Gen 0 | Gen 1 | Gen 2 | Allocated | Completed Work Items | Lock Contentions |
---|---|---|---|---|---|---|---|---|---|---|
SendRegionToClient | Lazy | 1.498 ms | 0.0211 ms | 0.0012 ms | 121.0938 | - | - | 1.95 MB | 0.0039 | - |
SendVisibleRegionsToClient | Lazy | 587.325 ms | 20.9992 ms | 1.1510 ms | 48000.0000 | - | - | 781.33 MB | 2.0000 | - |
SendVisibleMeshesToClient | Lazy | 591.809 ms | 25.5498 ms | 1.4005 ms | 48000.0000 | - | - | 781.33 MB | 2.0000 | - |
SendMeshToClient | Lazy | 1.618 ms | 0.0699 ms | 0.0038 ms | 232.4219 | - | - | 3.71 MB | 0.0039 | - |
ModifyMeshAndSendToClients | Lazy | 29.721 ms | 5.3744 ms | 0.2946 ms | 1093.7500 | 656.2500 | 250.0000 | 19.93 MB | 0.0625 | - |
SendRegionToClient | GreedyMutable | 2.311 ms | 1.5809 ms | 0.0867 ms | 101.5625 | 62.5000 | 23.4375 | 1.95 MB | 0.0078 | - |
SendVisibleRegionsToClient | GreedyMutable | 914.440 ms | 40.3927 ms | 2.2141 ms | 41000.0000 | 25000.0000 | 10000.0000 | 781.35 MB | 2.0000 | - |
SendVisibleMeshesToClient | GreedyMutable | 905.911 ms | 477.1683 ms | 26.1552 ms | 41000.0000 | 25000.0000 | 10000.0000 | 781.35 MB | 2.0000 | - |
SendMeshToClient | GreedyMutable | 4.317 ms | 1.3701 ms | 0.0751 ms | 242.1875 | 148.4375 | 54.6875 | 4.15 MB | 0.0156 | - |
ModifyMeshAndSendToClients | GreedyMutable | 26.734 ms | 1.8844 ms | 0.1033 ms | 781.2500 | 468.7500 | 156.2500 | 16.03 MB | 0.0625 | - |
You can see how much the GC was running and the amount of data being allocated. Notice that though Lazy
and Greedy
allocated the same amount of data, Lazy was faster since it was collected in Concurrent Gen0 collections instead of blocking Gen2 collections.
Switching to VectorCacheMutable
with writethrough helped:
Method | Mean | Error | StdDev | Completed Work Items | Lock Contentions | Gen 0 | Gen 1 | Gen 2 | Allocated |
---|---|---|---|---|---|---|---|---|---|
SendRegionToClient | 4.147 us | 0.0810 us | 0.0044 us | 0.0000 | - | 0.0076 | - | - | 184 B |
SendVisibleRegionsToClient | 1,671.888 us | 23.1648 us | 1.2697 us | 0.0039 | - | 3.9063 | - | - | 73600 B |
SendVisibleMeshesToClient | 1,726.317 us | 61.3808 us | 3.3645 us | 0.0039 | - | 3.9063 | - | - | 73600 B |
SendMeshToClient | 112.981 us | 3.5482 us | 0.1945 us | 0.0002 | - | - | - | - | 224 B |
ModifyMeshAndSendToClients | 29,270.213 us | 638.2455 us | 34.9844 us | 0.0625 | - | 1468.7500 | 875.0000 | 312.5000 | 24577725 B |
However, ModifyMeshAndSendToClients
actually got worse because of the extra costs of the stub objects and allocating vectors each time. The Gen2 numbers bear that out. LazyWriteThrough
address all of these problems:
Method | Mean | Error | StdDev | Completed Work Items | Lock Contentions | Gen 0 | Gen 1 | Gen 2 | Allocated |
---|---|---|---|---|---|---|---|---|---|
SendRegionToClient | 4.090 us | 0.0466 us | 0.0026 us | 0.0000 | - | 0.0076 | - | - | 152 B |
SendVisibleRegionsToClient | 1,765.334 us | 63.8413 us | 3.4994 us | 0.0039 | - | 1.9531 | - | - | 60800 B |
SendVisibleMeshesToClient | 1,809.066 us | 24.1105 us | 1.3216 us | 0.0039 | - | 1.9531 | - | - | 60800 B |
SendMeshToClient | 111.914 us | 6.4346 us | 0.3527 us | 0.0002 | - | - | - | - | 176 B |
ModifyMeshAndSendToClients | 12,675.831 us | 246.5131 us | 13.5122 us | 0.0313 | - | 1234.3750 | - | - | 20675816 B |
There are no Gen1 or Gen2 collections any longer (Gen0 is busy, but that is cheaper than fancy object pooling logic).
I will try and get FlatSharp version 5.4.0 published at some point this week. There is some documentation and samples that need to happen before that. I hope this helps you.
FlatSharp version 5.4.0 is published with support for Lazy + WriteThrough: https://github.com/jamescourtney/FlatSharp/releases/tag/5.4.0
Let me know if you need anything else.
Wow! Fantastic stuff here @jamescourtney. Really Really great stuff. Thanks and thanks again!
@Astn are you making use of the valueStructs
branch to support struct Vector types, or are you just translating during your modify/generate steps?
I have Vector2, Vector3, Vector4 vectors as well as Matrix4x4s that at the present need to be translated into the class types to support FlatBuffers interop.
I'd like to have an option to alias them to native/intrinsic CLR types as well have them represented as value types, perhaps by an attribute or something. (e.g. MyNamespace.Vector3f
-> System.Numerics.Vector3
)
@jamescourtney while FlatSharp would be slower in serialize/deserialize operation in this configuration, the copy and manipulation overhead and glue code would disappear.
I'd like to have an option to alias them to native/intrinsic CLR types as well have them represented as value types, perhaps by an attribute or something. (e.g.
MyNamespace.Vector3f
->System.Numerics.Vector3
)
This exists today, but not when using FBS files. The feature is called type facades, and it allows you to define a higher level type in terms of a lower level one. Imagine you want Flatsharp to support DateTimeOffset
. You can define a Facade that maps from DateTimeOffset
-> int64
. The data is stored as int64
in the buffer, but the serialize/parse code map that to DateTimeOffset
by way of the facade. I've considered making the FlatSharp compiler extensible to support Facades and custom Type Models, but have demurred because:
flatc
.are you making use of the
valueStructs
branch to support struct Vector types, or are you just translating during your modify/generate steps?
I don't think that they are, though I could be wrong. That branch is just there for reference purposes and isn't being updated. Value structs did work, but had some significant drawbacks:
Slower to parse/serialize
Required [StructLayout(LayoutKind.Explicit)]
Added an entire new dimension to the test matrix. This is something I worry about because FlatSharp is a one person project, and I don't work on it full time. I desperately want to avoid any data corruption issues and am generally conservative when adding new stuff unless there is unambiguous benefit. I've prototyped several features that never made it in for this reason (object pooling, value structs, etc).
C# guidance recommends structs be less than 16 bytes and be immutable, which doesn't give much wiggle room for interesting structs. You could argue that this isn't FlatSharp's decision to make, and I wouldn't fight you on it.
Value structs are incompatible with other FlatSharp concepts (non-greedy deserialization, write through, etc).
@jamescourtney while FlatSharp would be slower in serialize/deserialize operation in this configuration, the copy and manipulation overhead and glue code would disappear.
I get that. Let me sleep on the idea of bringing them back. If you really need structs, you do have the option of using the Google library, which does use structs. Of course, that may come with some other drawbacks.
@TYoungSL I am not using the valueStructs
branch. I'm using the latest released version with the new features for Lazy + Writethrough https://github.com/jamescourtney/FlatSharp/releases/tag/5.4.0
The way I think about it is FlatBuffers and Flatsharp are not giving me access to blittable memory for my vectors of structs, but the structs within a vector are effectivly blittable with the Lazy
+ writethrough
and still use value semantics when reading and writing.
The limitation here is I can't interact with the flatbuffers vector memory directly and have to go through FlatSharp to interact with it. So you will have to have a second copy of your data if you are going to do any SIMD or GPGPU work with it.
So you will have to have a second copy of your data if you are going to do any SIMD or GPGPU work with it.
Yeah that's a lot of messy code. Would be easier to just access the vector within the buffer or to have a greedy struct model already in a ready state.
The vector data (indices, triangle lists) I'm working with is in the multi-gigabyte range.
Being able to request spans of value struct types (whether lazily or greedily) from a vector member would be ideal.
Being able to request spans of value struct types (whether lazily or greedily) from a vector member would be ideal.
Do you mean literal Span<T>
?
The vector data (indices, triangle lists) I'm working with is in the multi-gigabyte range.
Are these in one FlatBuffer? FlatSharp has a hard upper limit of 1GB or so per buffer. I could bump this to 2GB, but that's the max given the limits of int32 (Span<T>
does not have indexers based on int64).
I do have a thought that might work for you. I'm not familiar with GPGPU, so please forgive my ignorance.
Right now, FlatSharp does define an interface called IFlatBufferDeserializedObject
. This gives you access to some things about the object. Every deserialized object implements this interface, and it gives you access to a few things, such as the IInputBuffer
used to deserialize it.
What if I was to extend this to have two additional fields: AbsoluteOffset
and Length
. So imagine you had a struct that was logically a System.Numerics.Vector3
. What you could do is:
void Process(FlatSharpVector3 vector)
{
if (vector is IFlatBufferDeserializedObject deserialized)
{
int offset = deserialized.Offset;
int length = deserialized.Length;
Span<byte> data = deserialized.InputBuffer.GetByteMemory(offset, length).Span;
System.Numerics.Vector3 numeric = MemoryMarshal.Cast<byte, Vector3>(data)[0];
// something
}
}
That would be very nice to be able to get span access to the vectors!
Just some fyi stuff for reference.
@Astn -- That would be really cheap to add. FlatSharp (of course) already knows all that information, but it doesn't expose it. Would that meaningfully improve your life?
Yeah, being able to access the backing buffer as a struct typed array is the ideal result.
The mesh data I'm dealing with is sliced into axis-aligned bounds, but one particularly large multi-mesh structure weighs in at over a hundred gigs. The 1GB/2GB limit may pose a problem in the future we hadn't considered.
From peeking at @Astn's repo, he's working with terrain and voxel-to-mesh scenarios, so somewhat large scale too.
Being able to interop with ILGPU or .NET SIMD intrinsics as fallback in these scenarios would be nice. Getting a ReadOnlySpan<byte>
or Span<byte>
that can be marshalled into Span<Vector3>
or whatever would save a lot of glue code.
Also, the indices in the meshes are at largest 32-bit, so theoretically they can be as a single document object just over 16GB, though low-detail volumes are even often reduced to 16-bit indices.
We would probably subdivide the meshes further if reducing the document size to fit in a flatbuffer becomes necessary.
Unfortunately, there's nothing I can do about Span
's int32 indexer limits. I really wish the CLR team would add a nint
overload for the indexer and the Slice method.
I'll see about getting #175 and this addressed in the next week or so. Thanks for the discussion here @TYoungSL and @Astn . Hopefully we've arrived at a place where you guys are unblocked and I'm not extending FlatSharp in unnecessary ways.
@Astn -- That would be really cheap to add. FlatSharp (of course) already knows all that information, but it doesn't expose it. Would that meaningfully improve your life?
Very much so! I have more then a few use cases where the data in a vector needs to be sent directly to the gpu, or some unmanaged code.
The way to work around the Span limitation is to provide either a sequence of multiple Spans or allow providing an offset to the start pointer/reference; a span getter that accepts an offset and a length.
The way to work around the Span limitation is to provide either a sequence of multiple Spans or allow providing an offset to the start pointer/reference; a span getter that accepts an offset and a length.
That would work if that was the only limit, but FlatBuffers also starts to run into some internal 32-bit limits as well. Basically, the FlatBuffer format uses int32
/uint32
offsets a lot internally (these are relative offsets, however FlatSharp likes to know absolute offsets). Here's one example of a table that would be un-serializable in FlatBuffers:
table Table
{
Vector1 : [ulong]; // int.maxValue elements at 8 bytes / element => 16 GB
Vector2 : [ulong]; // same
}
Tables store uint32 byte offsets to their non-struct fields, but in this case there is no offset that Vector2 could pick to get the right address. You'd have to change your definition to be a vector of Tables in this case, with each one being a fixed size. In which case, it's just as easy to model it as a series of independent FlatBuffers.
Hopefully this explains why I picked the 1GB limit for FlatSharp.
IFlatBufferAddressableStruct
:public interface IFlatBufferAddressableStruct
{
int Offset { get; }
int Size { get; }
int Alignment { get; }
}
Deserialized classes implement this interface when:
Consider a vector of Vec3
structs:
struct Vec3 { x : float; y : float; z : float }
table SomeTable (fs_serialzier:"lazy") { Points : [Vec3]; }
var parsed = SomeTable.Serializer.Parse(buffer);
// grab a reference to the first point and the length. This is all we need from FlatSharp's deserializer.
Vec3 vec = parsed.Points[0];
int length = parsed.Points.Count;
if (vec is IFlatBufferAddressableStruct @struct)
{
int offset = @struct.Offset;
int size = @struct.Size;
int alignment = @struct.Alignment;
System.Numerics.Vector3 vec3 = default;
for (int i = 0; i < length; ++i)
{
// cast the input buffer into the SIMD-capable structure and increment the existing vector.
vec3 += AsVec3(buffer, offset, size);
// Advance offset and compensate for alignment differences. Vec3 won't have this problem, but
// jagged structs might.
offset += size;
offset += SerializationHelpers.GetAlignmentError(offset, alignment);
}
}
static System.Numerics.Vector3 AsVec3(Memory<byte> memory, int offset, int length)
{
return MemoryMarshal.Cast<byte, System.Numerics.Vector3>(memory.Span.Slice(offset, length))[0];
}
That would work if that was the only limit, but FlatBuffers also starts to run into some internal 32-bit limits as well. Basically, the FlatBuffer format uses
int32
/uint32
offsets a lot internally (these are relative offsets, however FlatSharp likes to know absolute offsets).
Relative int32
/uint32
offsets are definitely a problem in FlatBuffers when there is a vector that goes over the limit. The use of absolute offsets are a problem you might be able to address when individual issues come up. I think we can break up individual documents/messages to the point it's not a concern though; accessing a significant chunk of the data at a time with zero-copy span to hand off to SIMD/GPGPU process at a time is good enough.
There may times where we'd need a large contiguous buffer, and FlatBuffers may not work for these purposes, but we haven't run into it yet. One-copy from contiguous buffers vs. discontinuous buffers into virtually contiguous GPU space is not problematic; for SIMD processes that expect contiguous buffers that we want zero-copy operations for is something we can address later.
At some point we may need some extension to the official spec in the future, e.g. 64-bit offsets, arbitrary length packed ints for offsets, and incrementally relative offsets.
64-bit offsets, Varints/LEBs offsets, etc.; https://github.com/google/flatbuffers/projects/10#card-14545298
Forking the lib and creating FlatBuffers64 and FlatSharp64 is eyeroll worthy, but easy. From https://google.github.io/flatbuffers/flatbuffers_internals.html;
The most important and generic offset type (see flatbuffers.h) is uoffset_t, which is currently always a uint32_t, and is used to refer to all tables/unions/strings/vectors (these are never stored in-line). 32bit is intentional, since we want to keep the format binary compatible between 32 and 64bit systems, and a 64bit offset would bloat the size for almost all uses. A version of this format with 64bit (or 16bit) offsets is easy to set when needed. Unsigned means they can only point in one direction, which typically is forward (towards a higher memory location). Any backwards offsets will be explicitly marked as such.
Nested FlexBuffers support up to 64-bit sizing, strangely enough. I'm not sure how that would even be representable.
Looks like a reasonable way to shoe-horn 64-bit offset support would be to add an attribute for it per table. Topic for another issue.
5.5.0 is published on nuget.
Very cool @jamescourtney !!
Full docs are linked here, if you need them: https://github.com/jamescourtney/FlatSharp/releases/tag/5.5.0
Let me know how it goes for you, @Astn
I'm not sure if I'm missing something, but my current experience is using large arrays of flat buffers structs, it seems that the code generated is all using classes for these structs which causes a huge amount of GC activity when trying to serialize and de serialize these arrays.
For example:
Produces "struct" code like this:
This generates code where each of the structs [Vector2, Vector3, Vector4] are c# class objects.
These arrays can each be 20000 items. When they arrays of structs that can be a single allocation. When they are arrays of classes it bogs down the GC pretty hard.
What should I be doing here to work around this? Am I missing something that lets me treat the structs as structs?
Also here is a screen shot from profiling.