jamescourtney / FlatSharp

Fast, idiomatic C# implementation of Flatbuffers
Apache License 2.0
499 stars 50 forks source link

Option to treat structs as structs instead of objects #158

Closed Astn closed 3 years ago

Astn commented 3 years ago

I'm not sure if I'm missing something, but my current experience is using large arrays of flat buffers structs, it seems that the code generated is all using classes for these structs which causes a huge amount of GC activity when trying to serialize and de serialize these arrays.

For example:


struct Vector4
{
    x:float;
    y:float;
    z:float;
    w:float;
}
struct Vector3 
{
    x:float;
    y:float;
    z:float;
}
struct Vector2
{
    x:float;
    y:float;
}

table Mesh (fs_serializer:greedy)
{
    vertices:   [Vector3];
    normals:  [Vector3];
    uv:           [Vector2];
    uv2:         [Vector4];
    triangles:  [ushort];
}

Produces "struct" code like this:

[FlatBufferStruct]
        [System.Runtime.CompilerServices.CompilerGenerated]
        public partial class Vector2
            : object
        {
            public Vector2()
            {
                checked
                {
                    this.OnInitialized(null);
                }
            }

#pragma warning disable CS8618
            protected Vector2(FlatBufferDeserializationContext context)
            {
                checked
                {
                }
            }

#pragma warning restore CS8618
            public Vector2(Vector2 source)
            {
                checked
                {
                    this.x = FlatSharp.Compiler.Generated.CloneHelpers_f062897219dd458591c35871e16c5150.Clone(source.x);
                    this.y = FlatSharp.Compiler.Generated.CloneHelpers_f062897219dd458591c35871e16c5150.Clone(source.y);
                    this.OnInitialized(null);
                }
            }

            partial void OnInitialized(FlatBufferDeserializationContext? context);

            protected void OnFlatSharpDeserialized(FlatBufferDeserializationContext? context) => this.OnInitialized(context);

            [FlatBufferItemAttribute(0)]
            public virtual System.Single x { get; set; }

            [FlatBufferItemAttribute(1)]
            public virtual System.Single y { get; set; }

        }

This generates code where each of the structs [Vector2, Vector3, Vector4] are c# class objects.

These arrays can each be 20000 items. When they arrays of structs that can be a single allocation. When they are arrays of classes it bogs down the GC pretty hard.

What should I be doing here to work around this? Am I missing something that lets me treat the structs as structs?

Also here is a screen shot from profiling. image

jamescourtney commented 3 years ago

So, I did at one point prototype and build value-type structs, but benchmarks were actually in favor of the class-based approach, so I abandoned that branch. I'm happy to revive it at some point if there is a demonstrable reason to add all of that complexity.

One thing that immediately jumps out to me is that you're using Greedy deserialization. That will definitely put a lot of GC pressure on your application, especially if your vectors are large. Have you tried using Lazy or PropertyCache? Those should amortize the GC hit more evenly.

jamescourtney commented 3 years ago

https://github.com/jamescourtney/FlatSharp/tree/valueStructs

Astn commented 3 years ago

I haven't yet tried those other deserialization options. As your docs seemed to indicate that if you were using all the data then greedy would be better. My use case always needs all the data every time.

I'll give it a shot though.

jamescourtney commented 3 years ago

Let me know how it goes. FWIW -- the docs aren't really geared around 20k item vectors. My team at Microsoft uses this and we have similarly sized vectors (using FlatBuffers inside a file), and I observed the same GC hit. Switching to Lazy resolved it completely.

The problem with big vectors and Greedy serialization is:

The docs say Greedy is fastest because the cost for accessing a single property in Greedy mode is really cheap, but the semantics of Greedy force allocations before you need them. This is usually fine for small buffers. Lazy/PropertyCache are slightly slower, but don't allocate anything before you ask for it. So if you're looping through this list of 20k items, you'll have a few in memory at a time, but they'll all get swept up in Gen0 since the references are ephemeral.

Astn commented 3 years ago

I gave the lazy route a shot, and it didn't really help things in my use case.

I'm loading these objects and then sending the whole array of them over the network. So as quickly as the Grpc send function can iterate the array they are being de-serialized and then immediately serialized again. I think this makes the Greedy mode more efficient right now.

I haven't had a chance to dig around that branch you linked, I'll see if I can take a peek tomorrow.

It's got me wondering if there would be a way to reuse the same instance of one of the array objects if it was accessed through an iterator. Maybe have a way to specify an allocator function that's used to feed the iterator, then you would have the option of having a new object each time, or reusing one your hanging on to.

jamescourtney commented 3 years ago

I gave the lazy route a shot, and it didn't really help things in my use case.

Thanks for trying!

I'm loading these objects and then sending the whole array of them over the network. So as quickly as the Grpc send function can iterate the array they are being de-serialized and then immediately serialized again. I think this makes the Greedy mode more efficient right now.

Are you doing any sort of translation to a different schema? I'm wondering why you need to deserialize at all if you're just piping the data around. Also -- wouldn't gRPC streaming work well for this? It doesn't seem like they all need to be in the array at once.

It's got me wondering if there would be a way to reuse the same instance of one of the array objects if it was accessed through an iterator. Maybe have a way to specify an allocator function that's used to feed the iterator, then you would have the option of having a new object each time, or reusing one your hanging on to.

A way to "repoint" an object at a different spot in the buffer is an interesting idea that probably merits some deeper consideration. It's a little tricky given that FlatSharp's whole mantra is "subclass property implementations", so your application-level code has no knowledge of the actual class that it's dealing with, only the parent class.

I think what you'd probably want to do is give the ISerializer instance that you're using access to an object pool that it tries to consult before resorting to new. Your code would then be on the hook for returning objects to the pool when they are finished. In the worst case, your code never returns anything to the pool and the behavior is exactly as it is today with GC needing to clean it up.

I'd need to think carefully about whether I want to actually implement this pool in a non-naive way and deal with all of the associated pitfalls of pooling memory, or just chuck it into an interface and let people like you (ie, those who care deeply) control it.

In any case, this is the most interesting area of thought for me with FlatSharp in quite some time.

jamescourtney commented 3 years ago

After sleeping on this, I'm sort of leaning the other way.

I think I'd like to make this work by letting the user define pooling properties in the fbs file (or C# if using attributes):

table SomeTable (fs_pool_size:"100") { ... } // keep 100 items
struct SomeStruct (fs_pool_size:"300") { ... }

This would translate into the normal Dispose pattern for C#:

[FlatBufferTable(PoolSize = 100)]
public class SomeTable : IDisposable // Use Dispose semantics for returning to the pool
{
     ~SomeTable() { this.Dispose(false); }
     public sealed void Dispose() { this.Dispose(true); GC.SuppressFinalize(this); }
     protected virtual void Dipose(bool disposing) { this.OnDisposing(disposing); }

     private partial void OnDisposing(bool disposing); // let partial classes also have the option to dispose stuff.
}

The dynamic subclasses would then look like:

public class SomeTableReader : IDisposable
{
      // Declaring pool as static field makes it really fast and avoids a layer of Type indirection.
      private static readonly ConcurrentBag<SomeTableReader> pool = new();

      private volatile int free;

      // Default ctor is private.
      private SomeTableReader() { }

      public static SomeTableReader GetOrCreate(IInputbuffer buffer, int offset)
      {
            if (!pool.TryTake(out var reader))
            {
                  reader = new SomeTableReader();
            }

            reader.Initialize(buffer, offset);
            return reader;
      }

      protected override void Dispose(bool disposing)
      {
            if (Interlocked.CompareExchange(ref this.free, 0, 1) == 1) // who should own thread safety? I'm not sure this accomplishes much because customers can still do use-after-dispose, and we don't want to have locks everywhere. This only prevents double-dispose.
            {
                base.Dispose(disposing);
                if (pool.Count > 100) // comes from the fs_pool_size attribute
                {
                    this.Clear();
                    pool.Add(this);
                }
            }
      }

      // reset this object to clean state, release references so GC can reap them
      private void Clear() {  Debug.Assert(this.free == 1); }

      // (re)initialize this object
      private void Initialize(IInputBuffer buffer, int offset) { this.free = 0; }
}

My main question/concerns here are around thread safety. There are two main issues:

A non-thread safety issue would be if people were running this in a heterogenous environment. Would they want different pool sizes to somehow be configurable? A static attribute isn't very helpful there. Though perhaps a value of -1 could indicate that the pool should grow until it is right-sized. Of course, if the workload is bursty it will hang onto extra items, though this isn't dissimilar to List<T>.

Astn commented 3 years ago

I perused that branch you posted, and I think it will take me reading through a bit more code to really grok how things are working. I was a bit busy today, so maybe have more time tomorrow.

I'm liking your disposable idea for returning items to the pool, I may have some thoughts on it after I digest it more.

A few related ideas for chewing on.

jamescourtney commented 3 years ago

I've spent some time on the IDisposable idea today. I moved away from strictly using IDisposable and ended up extending an interface that I have already by adding a Release method, which has the same semantics as what I talked about up above. I'm a little leery of IDisposable since it implies some sort of native resource under the hood that needs disposal, and most other object pool APIs use Rent/Return semantics.

var parsed = SomeTable.Serializer.Parse(buffer);
...
if (parsed is IFlatBufferDeserializedObject deserialized)
{
     deserialized.Release();
}

I have most of this prototyped and it seems at least somewhat promising. However, there are some places where this becomes incompatible with other features (init-only property setters don't play nice with reusing the same object!), so I need to refactor my way out of this one.

jamescourtney commented 3 years ago

I have a very beta build of the pooling behavior (as in it looks like it works, but I need to write actual tests). You can grab it from the CI build here (look under artifacts): https://github.com/jamescourtney/FlatSharp/actions/runs/902280796. Alternatively, you can clone the 'objectPool' branch and build yourself.

To enable pooling, you've got to do two things:

1) Add the fs_pool attribute to your fbs. I'm not settled on this yet

table SomeTable (fs_pool) { }

2) Actually recycle your objects after you're done with them:

var parsed = SomeTable.Serializer.Parse(buffer);
...
if (parsed is IFlatBufferDeserializedObject deserialized)
{
     deserialized.Release();
}

When the fs_pool attribute is specified, returned items are sent to a ConcurrentBag after you release them. Release is a no-op when fs_pool is not specified. There is some tracking for double-release and use-after-release, though it is not sophisticated. You can enable better error messages for these scenarios by setting :

FlatSharpGlobalSettings.CollectPooledObjectStackTraces = true;

I'd love to know if this helps your scenario or not.

Astn commented 3 years ago

Very cool, and thanks for helping on this!

I'll setup a benchmark to test. Might take me a few hours.

jamescourtney commented 3 years ago

So, I ran a few benchmarks myself. ConcurrentBag is apparently pretty slow. I've just pushed a new version that switches to [ThreadStatic] and old-fashioned Queue (https://github.com/jamescourtney/FlatSharp/actions/runs/904207394)

Here are the rough benchmarks for a full traversal of this schema (20k items)

table SomeTable (fs_serializer:Lazy) {
    Points : [Vec3];
}

struct Vec3 (fs_nonVirtual, fs_pool) {
        X : float;
    Y : float;
    Z : float;
}
        [Benchmark]
        public int ParseAndTraverse()
        {
            var t = SomeTable.Serializer.Parse(this.inputBuffer);

            int sum = 0;

            var points = t.Points;
            int count = points.Count;
            for (int i = 0; i < count; ++i)
            {
                var item = points[i];
                sum += (int)(item.X + item.Y + item.Z);

                ((IFlatBufferDeserializedObject)item).Release();
            }

            return sum;
        }
Pooled Mode Object Pool Full Traversal Time (us)
No Greedy 715
No Lazy 214
Yes Greedy ConcurrentBag 1000
Yes Lazy ConcurrentBag 1100
Yes Greedy ThreadLocal<Queue> 774
Yes Lazy ThreadLocal<Queue> 691
Yes Greedy [ThreadStatic] + Queue 630
Yes Lazy [ThreadStatic] + Queue 513
Yes Greedy [ThreadStatic] + Stack 616
Yes Lazy [ThreadStatic] + Stack 522

My commentary on these results is:

This seems to require a bit more tinkering. Would you mind just giving me a simple little benchmark that reflects your scenario so I can play with it some more on my own?

Astn commented 3 years ago

I've had quite a day, and haven't been able to work much today so far.

Here is the core of my flat buffers definition. Just dropping here so you don't have to wait more on these. I'll try to get that benchmark showing a use case going asap.


struct Vector4
{
    x:float;
    y:float;
    z:float;
    w:float;
}

struct Vector3Int 
{
    x:int32;
    y:int32;
    z:int32;
}

struct Vector3 
{
    x:float;
    y:float;
    z:float;
}

struct Vector2Int
{
    x:int32;
    y:int32;
}

struct Vector2
{
    x:float;
    y:float;
}

struct Color
{
    r: ubyte;
    g: ubyte;
    b: ubyte;
    a: ubyte;
}

struct Voxel
{
    VoxelType: ubyte;
    SubType : ubyte;
    Hp:ubyte;
    Unused:ubyte;
}

table Mesh (fs_serializer:greedy)
{
    vertices:   [Vector3];
    normals:    [Vector3];
    uv:         [Vector2];
    color:      [Color];
    triangles:  [ushort];
}

table VoxelRegion3D (fs_serializer:greedy)
{
    location: Vector3Int;
    iteration: uint32;
    size: ushort;
    voxels: [Voxel];
}

But to summarize use a usecase.

Any time a value in the voxels[] is changed, the mesh needs to be rebuilt. This results in

Astn commented 3 years ago

I've added a use case benchmark to https://github.com/jamescourtney/FlatSharp/pull/163

jamescourtney commented 3 years ago

Thanks for the detail and the branch! I will take a look this evening.

Couple of comments/questions that might help:

  • Deserialize VoxelRegion3d
  • Modify voxel
  • Generate new Mesh
  • Serialize VoxelRegion3d
  • Save VoxelRegion3d

Version 5.3 of FlatSharp includes support for write through properties to the underlying buffer. If you were to create a MemoryInputBuffer based on a memory-mapped file, you could potentially combine all of these steps into one. This allows you to do an in-place update to the existing buffer without a full parse/re-serialize and automatically flush that to disk. Flatsharp doesn't do anything with mmapped files on its own, but it should be possible. https://github.com/dotnet/runtime/issues/24805 might be able to help. There are a few constraints here:

  • Load already saved mesh
  • Send over Grpc to client

Is this gRPC call a batch mode or streaming mode? If you're using gRPC streaming you may be able to get away from loading everything into a giant array, though you mentioned some GPU processing as well which might be driving this requirement.

jamescourtney commented 3 years ago

Hey -- so I've tinkered with this, and managed to speed things up a bunch using the write through option I mentioned above. Here's the original bench you shared:

Method ParseOption Mean Error StdDev
SendRegionToClient Lazy 1.925 ms 0.3082 ms 0.0169 ms
SendVisibleRegionsToClient Lazy 755.228 ms 169.9860 ms 9.3175 ms
SendVisibleMeshesToClient Lazy 743.959 ms 3.6774 ms 0.2016 ms
SendMeshToClient Lazy 2.906 ms 2.0721 ms 0.1136 ms
ModifyMeshAndSendToClients Lazy 29.511 ms 41.8578 ms 2.2944 ms
SendRegionToClient GreedyMutable 2.922 ms 1.6592 ms 0.0909 ms
SendVisibleRegionsToClient GreedyMutable 1,174.830 ms 703.3006 ms 38.5503 ms
SendVisibleMeshesToClient GreedyMutable 1,199.901 ms 306.4329 ms 16.7966 ms
SendMeshToClient GreedyMutable 5.925 ms 5.9447 ms 0.3258 ms
ModifyMeshAndSendToClients GreedyMutable 27.656 ms 11.5352 ms 0.6323 ms

After changing it to writethrough, these turn into:

Method Mean Error StdDev
SendRegionToClient 93.78 us 6.696 us 0.367 us
SendVisibleRegionsToClient 38,177.58 us 5,279.602 us 289.393 us
SendVisibleMeshesToClient 35,542.62 us 47,627.519 us 2,610.625 us
SendMeshToClient 1,127.36 us 508.945 us 27.897 us
ModifyMeshAndSendToClients 39,161.27 us 10,160.844 us 556.950 us

This looks to be a speedup of a couple of orders of magnitude for a lot of these tests. The downside is that I had to overallocate some of the arrays to accommodate the variable number of items in the mesh vectors. Whether this works for you or not I couldn't say.

            // Update fillSize to accomodate max of (fillSize * 3). Some items may be null.
            Mesh mesh = new Mesh
            {
                color = new Color[fillSize * 3],
                normals = new Vector3[fillSize * 3],
                triangles = new ushort[fillSize * 3],
                uv = new Vector2[fillSize * 3],
                vertices = new Vector3[fillSize * 3]
            };

You can find the code I used here: https://github.com/jamescourtney/FlatSharp/tree/voxelBench/src/Benchmarks/ExperimentalBenchmark

Yours are largely unchanged, and mine are copied and named Modified.

Astn commented 3 years ago

Very cool! I'll checking it out now!

Astn commented 3 years ago

I've read through your modified version and am stunned by how much better it is. It will take me a while to comprehend why the changes you made were so impactful! Over allocating the mesh is not a problem. I use lz4 on each byte[] before it gets persisted. I'm not sure if that messes up your ideas with memory mapped files though. I've been storing everything using rocksdb. Maybe some things would be better in memory mapped files though. But the lz4 compression on over allocated buffers and regions of voxels is really great.

jamescourtney commented 3 years ago

Cool! Glad I could help :) I'm going to drop the object pooling approach for now since I've thought of some things about it that I dislike. What might be possible is an additional parse API that reuses the same object graph when possible:

((IFlatBufferDeserializedObject)something).LoadFrom(byte[] buffer)

The short version of why it helps is that the fs_writeThrough attribute makes the mutations directly in the underlying buffer, so you're saving a fortune on copies. The other thing is that VectorCacheMutable lazily initializes the items as they are read. If you actually read through the serializer code that FlatSharp spits out, you'll see something like this (which I've lightly annotated):

  // This is the 'x' property of the Vec3 structure. 
  // The base class is virtual and FlatSharp overrides it to speak FlatBuffer.
  public override System.Single x
  {
      get
      {
          // Test to see if it's already in memory.
          if ((this.__mask0 & (byte)1) == 0)
          {
              // If not, read it and update the bit mask.
              this.__index0Value = ReadIndex0Value(this.__buffer, this.__offset, default, default);
              this.__mask0 |= (byte)1;
          }
          return this.__index0Value;
      }

      set
      {
           // Set the value of the backing field.
           this.__index0Value = value; 

           // update mask to indicate that this value is now in memory and doesn't need to be pulled from the buffer.
           this.__mask0 |= (byte)1;      

           // fs_writeThrough injects this line, which writes the new value back to the underlying buffer.
           WriteIndex0Value(this.__buffer, __offset, value);
      }
  }

There are two key parts of this:

You can ignore what I said about files -- I was assuming you were storing your Flatbuffers directly on disk. I've used RocksDb before. You might want to be careful about using LZ4 yourself unless you've explicitly disabled compression in RocksDb. It's been a few years, but I recall that it has Snappy and/or LZ4 linked in.

jamescourtney commented 3 years ago

By the way -- I did push another update to the voxelBench branch that modestly improves the perf from before. Mostly because I dropped that IsNull property from the structs and just added a length property to the Mesh. This avoids the alignment padding issues when you have a trailing byte on a struct that you're storing in a vector, and will save quite a bit of space.

Astn commented 3 years ago

Thanks for helping me understand the changes. And I'm pulling in your updates.

I'm going to read through more of the code and see if I can wrap my head around it.

jamescourtney commented 3 years ago

I managed to come up with a model I liked better for object pooling (calling it Recycling now):

// Traverses the full object graph and recycles poolable objects. Mesh is set to null on completion.
this.meshSerializer.Recycle(ref mesh);

This adds another large speedup over what I did earlier:

Method WriteThrough + Recycle WriteThrough Baseline (lazy allocation)
SendRegionToClient 12.65 us 93 us 1925 us
SendVisibleRegionsToClient 5,361.71 us 38,177 us 755,000 us
SendVisibleMeshesToClient 5,459.84 us 35,542 us 743,000 us
SendMeshToClient 313.27 us 1,127 us 2,906 us
ModifyMeshAndSendToClients 22,879.78 us 39,161 us 29,511 us

Those changes are pushed now as well. Again -- I should stress that the Recycle changes are experimental and very much use-at-your-own-risk for the moment until I can get a full suite of tests built around it.

Astn commented 3 years ago

Very cool!!! Pulling it down.

jamescourtney commented 3 years ago

Hi there -- I've pushed one final change to that branch for you. The main change is that I've yanked out all of the code for object pooling.

The good news is that I've replaced it with something better and simpler (for you, at least). I added a new serialization mode: LazyWriteThrough.

I'd encouraged you to use VectorCacheMutable in the past because it enabled write-through semantics. It could do this for a couple of reasons:

This was a big win for you because it saved so much deserialize/parse work. However, there was still a ton of array allocation happening to fill these big arrays with stubs. The work I did on object pooling helped a little bit, but there were still problems:

All in all, I was feeling uneasy about the approach, which is usually a sign I need to rethink things.


Switching gears, FlatSharp's Lazy mode avoids array allocations altogether: Accessing foo.bar[2].baz.bat[3] twice gives you two different instances that point at the same spot in the buffer. Lazy also disallows all mutations. The great thing about Lazy is that if you're only referencing objects ephemerally, they can all be scooped up in Gen0 before the "expensive" GC kicks in. The lack of huge vectors also keeps GC at bay.

The new mode LazyWriteThrough combines enables Lazy with fs_writeThrough properties:

So finally, benchmarks!

For context, here are the results of the original one you uploaded (I made a small tweak to stop allocating new byte[] for the fake network buffers and use a static one instead, since we're benchmarking FlatSharp and not the CLR allocator)

Method ParseOption Mean Error StdDev Gen 0 Gen 1 Gen 2 Allocated Completed Work Items Lock Contentions
SendRegionToClient Lazy 1.498 ms 0.0211 ms 0.0012 ms 121.0938 - - 1.95 MB 0.0039 -
SendVisibleRegionsToClient Lazy 587.325 ms 20.9992 ms 1.1510 ms 48000.0000 - - 781.33 MB 2.0000 -
SendVisibleMeshesToClient Lazy 591.809 ms 25.5498 ms 1.4005 ms 48000.0000 - - 781.33 MB 2.0000 -
SendMeshToClient Lazy 1.618 ms 0.0699 ms 0.0038 ms 232.4219 - - 3.71 MB 0.0039 -
ModifyMeshAndSendToClients Lazy 29.721 ms 5.3744 ms 0.2946 ms 1093.7500 656.2500 250.0000 19.93 MB 0.0625 -
SendRegionToClient GreedyMutable 2.311 ms 1.5809 ms 0.0867 ms 101.5625 62.5000 23.4375 1.95 MB 0.0078 -
SendVisibleRegionsToClient GreedyMutable 914.440 ms 40.3927 ms 2.2141 ms 41000.0000 25000.0000 10000.0000 781.35 MB 2.0000 -
SendVisibleMeshesToClient GreedyMutable 905.911 ms 477.1683 ms 26.1552 ms 41000.0000 25000.0000 10000.0000 781.35 MB 2.0000 -
SendMeshToClient GreedyMutable 4.317 ms 1.3701 ms 0.0751 ms 242.1875 148.4375 54.6875 4.15 MB 0.0156 -
ModifyMeshAndSendToClients GreedyMutable 26.734 ms 1.8844 ms 0.1033 ms 781.2500 468.7500 156.2500 16.03 MB 0.0625 -

You can see how much the GC was running and the amount of data being allocated. Notice that though Lazy and Greedy allocated the same amount of data, Lazy was faster since it was collected in Concurrent Gen0 collections instead of blocking Gen2 collections.

Switching to VectorCacheMutable with writethrough helped:

Method Mean Error StdDev Completed Work Items Lock Contentions Gen 0 Gen 1 Gen 2 Allocated
SendRegionToClient 4.147 us 0.0810 us 0.0044 us 0.0000 - 0.0076 - - 184 B
SendVisibleRegionsToClient 1,671.888 us 23.1648 us 1.2697 us 0.0039 - 3.9063 - - 73600 B
SendVisibleMeshesToClient 1,726.317 us 61.3808 us 3.3645 us 0.0039 - 3.9063 - - 73600 B
SendMeshToClient 112.981 us 3.5482 us 0.1945 us 0.0002 - - - - 224 B
ModifyMeshAndSendToClients 29,270.213 us 638.2455 us 34.9844 us 0.0625 - 1468.7500 875.0000 312.5000 24577725 B

However, ModifyMeshAndSendToClients actually got worse because of the extra costs of the stub objects and allocating vectors each time. The Gen2 numbers bear that out. LazyWriteThrough address all of these problems:

Method Mean Error StdDev Completed Work Items Lock Contentions Gen 0 Gen 1 Gen 2 Allocated
SendRegionToClient 4.090 us 0.0466 us 0.0026 us 0.0000 - 0.0076 - - 152 B
SendVisibleRegionsToClient 1,765.334 us 63.8413 us 3.4994 us 0.0039 - 1.9531 - - 60800 B
SendVisibleMeshesToClient 1,809.066 us 24.1105 us 1.3216 us 0.0039 - 1.9531 - - 60800 B
SendMeshToClient 111.914 us 6.4346 us 0.3527 us 0.0002 - - - - 176 B
ModifyMeshAndSendToClients 12,675.831 us 246.5131 us 13.5122 us 0.0313 - 1234.3750 - - 20675816 B

There are no Gen1 or Gen2 collections any longer (Gen0 is busy, but that is cheaper than fancy object pooling logic).

I will try and get FlatSharp version 5.4.0 published at some point this week. There is some documentation and samples that need to happen before that. I hope this helps you.

jamescourtney commented 3 years ago

FlatSharp version 5.4.0 is published with support for Lazy + WriteThrough: https://github.com/jamescourtney/FlatSharp/releases/tag/5.4.0

Let me know if you need anything else.

Astn commented 3 years ago

Wow! Fantastic stuff here @jamescourtney. Really Really great stuff. Thanks and thanks again!

TYoungSL commented 3 years ago

@Astn are you making use of the valueStructs branch to support struct Vector types, or are you just translating during your modify/generate steps?

I have Vector2, Vector3, Vector4 vectors as well as Matrix4x4s that at the present need to be translated into the class types to support FlatBuffers interop.

I'd like to have an option to alias them to native/intrinsic CLR types as well have them represented as value types, perhaps by an attribute or something. (e.g. MyNamespace.Vector3f -> System.Numerics.Vector3)

@jamescourtney while FlatSharp would be slower in serialize/deserialize operation in this configuration, the copy and manipulation overhead and glue code would disappear.

jamescourtney commented 3 years ago

I'd like to have an option to alias them to native/intrinsic CLR types as well have them represented as value types, perhaps by an attribute or something. (e.g. MyNamespace.Vector3f -> System.Numerics.Vector3)

This exists today, but not when using FBS files. The feature is called type facades, and it allows you to define a higher level type in terms of a lower level one. Imagine you want Flatsharp to support DateTimeOffset. You can define a Facade that maps from DateTimeOffset -> int64. The data is stored as int64 in the buffer, but the serialize/parse code map that to DateTimeOffset by way of the facade. I've considered making the FlatSharp compiler extensible to support Facades and custom Type Models, but have demurred because:

are you making use of the valueStructs branch to support struct Vector types, or are you just translating during your modify/generate steps?

I don't think that they are, though I could be wrong. That branch is just there for reference purposes and isn't being updated. Value structs did work, but had some significant drawbacks:

@jamescourtney while FlatSharp would be slower in serialize/deserialize operation in this configuration, the copy and manipulation overhead and glue code would disappear.

I get that. Let me sleep on the idea of bringing them back. If you really need structs, you do have the option of using the Google library, which does use structs. Of course, that may come with some other drawbacks.

Astn commented 3 years ago

@TYoungSL I am not using the valueStructs branch. I'm using the latest released version with the new features for Lazy + Writethrough https://github.com/jamescourtney/FlatSharp/releases/tag/5.4.0

The way I think about it is FlatBuffers and Flatsharp are not giving me access to blittable memory for my vectors of structs, but the structs within a vector are effectivly blittable with the Lazy + writethrough and still use value semantics when reading and writing.

The limitation here is I can't interact with the flatbuffers vector memory directly and have to go through FlatSharp to interact with it. So you will have to have a second copy of your data if you are going to do any SIMD or GPGPU work with it.

TYoungSL commented 3 years ago

So you will have to have a second copy of your data if you are going to do any SIMD or GPGPU work with it.

Yeah that's a lot of messy code. Would be easier to just access the vector within the buffer or to have a greedy struct model already in a ready state.

The vector data (indices, triangle lists) I'm working with is in the multi-gigabyte range.

Being able to request spans of value struct types (whether lazily or greedily) from a vector member would be ideal.

jamescourtney commented 3 years ago

Being able to request spans of value struct types (whether lazily or greedily) from a vector member would be ideal.

Do you mean literal Span<T>?

The vector data (indices, triangle lists) I'm working with is in the multi-gigabyte range.

Are these in one FlatBuffer? FlatSharp has a hard upper limit of 1GB or so per buffer. I could bump this to 2GB, but that's the max given the limits of int32 (Span<T> does not have indexers based on int64).

I do have a thought that might work for you. I'm not familiar with GPGPU, so please forgive my ignorance.

Right now, FlatSharp does define an interface called IFlatBufferDeserializedObject. This gives you access to some things about the object. Every deserialized object implements this interface, and it gives you access to a few things, such as the IInputBuffer used to deserialize it.

What if I was to extend this to have two additional fields: AbsoluteOffset and Length. So imagine you had a struct that was logically a System.Numerics.Vector3. What you could do is:

void Process(FlatSharpVector3 vector)
{
       if (vector is IFlatBufferDeserializedObject deserialized)
       {
              int offset = deserialized.Offset;
              int length = deserialized.Length;
              Span<byte> data = deserialized.InputBuffer.GetByteMemory(offset, length).Span;
              System.Numerics.Vector3 numeric = MemoryMarshal.Cast<byte, Vector3>(data)[0];
              // something
       }
}
Astn commented 3 years ago

That would be very nice to be able to get span access to the vectors!

Just some fyi stuff for reference.

jamescourtney commented 3 years ago

@Astn -- That would be really cheap to add. FlatSharp (of course) already knows all that information, but it doesn't expose it. Would that meaningfully improve your life?

TYoungSL commented 3 years ago

Yeah, being able to access the backing buffer as a struct typed array is the ideal result.

The mesh data I'm dealing with is sliced into axis-aligned bounds, but one particularly large multi-mesh structure weighs in at over a hundred gigs. The 1GB/2GB limit may pose a problem in the future we hadn't considered.

From peeking at @Astn's repo, he's working with terrain and voxel-to-mesh scenarios, so somewhat large scale too.

Being able to interop with ILGPU or .NET SIMD intrinsics as fallback in these scenarios would be nice. Getting a ReadOnlySpan<byte> or Span<byte> that can be marshalled into Span<Vector3> or whatever would save a lot of glue code.

TYoungSL commented 3 years ago

Also, the indices in the meshes are at largest 32-bit, so theoretically they can be as a single document object just over 16GB, though low-detail volumes are even often reduced to 16-bit indices.

We would probably subdivide the meshes further if reducing the document size to fit in a flatbuffer becomes necessary.

jamescourtney commented 3 years ago

Unfortunately, there's nothing I can do about Span's int32 indexer limits. I really wish the CLR team would add a nint overload for the indexer and the Slice method.

I'll see about getting #175 and this addressed in the next week or so. Thanks for the discussion here @TYoungSL and @Astn . Hopefully we've arrived at a place where you guys are unblocked and I'm not extending FlatSharp in unnecessary ways.

Astn commented 3 years ago

@Astn -- That would be really cheap to add. FlatSharp (of course) already knows all that information, but it doesn't expose it. Would that meaningfully improve your life?

Very much so! I have more then a few use cases where the data in a vector needs to be sent directly to the gpu, or some unmanaged code.

TYoungSL commented 3 years ago

The way to work around the Span limitation is to provide either a sequence of multiple Spans or allow providing an offset to the start pointer/reference; a span getter that accepts an offset and a length.

jamescourtney commented 3 years ago

The way to work around the Span limitation is to provide either a sequence of multiple Spans or allow providing an offset to the start pointer/reference; a span getter that accepts an offset and a length.

That would work if that was the only limit, but FlatBuffers also starts to run into some internal 32-bit limits as well. Basically, the FlatBuffer format uses int32/uint32 offsets a lot internally (these are relative offsets, however FlatSharp likes to know absolute offsets). Here's one example of a table that would be un-serializable in FlatBuffers:

table Table
{
       Vector1 : [ulong];              // int.maxValue elements at 8 bytes / element => 16 GB
       Vector2 : [ulong];              // same
}

Tables store uint32 byte offsets to their non-struct fields, but in this case there is no offset that Vector2 could pick to get the right address. You'd have to change your definition to be a vector of Tables in this case, with each one being a fixed size. In which case, it's just as easy to model it as a series of independent FlatBuffers.

Hopefully this explains why I picked the 1GB limit for FlatSharp.

jamescourtney commented 3 years ago

178 adds support for IFlatBufferAddressableStruct:

public interface IFlatBufferAddressableStruct
{
     int Offset { get; }
     int Size { get; }
     int Alignment { get; }
}

Deserialized classes implement this interface when:

Usage

Consider a vector of Vec3 structs:

struct Vec3 {  x : float; y : float; z : float }
table SomeTable (fs_serialzier:"lazy") { Points : [Vec3]; }
  var parsed = SomeTable.Serializer.Parse(buffer);

  // grab a reference to the first point and the length. This is all we need from FlatSharp's deserializer.
  Vec3 vec = parsed.Points[0];
  int length = parsed.Points.Count;

  if (vec is IFlatBufferAddressableStruct @struct)
  {
      int offset = @struct.Offset;
      int size = @struct.Size;
      int alignment = @struct.Alignment;

      System.Numerics.Vector3 vec3 = default;
      for (int i = 0; i < length; ++i)
      {
          // cast the input buffer into the SIMD-capable structure and increment the existing vector.
          vec3 += AsVec3(buffer, offset, size);

          // Advance offset and compensate for alignment differences. Vec3 won't have this problem, but 
          // jagged structs might.
          offset += size;
          offset += SerializationHelpers.GetAlignmentError(offset, alignment);
      }
  }

  static System.Numerics.Vector3 AsVec3(Memory<byte> memory, int offset, int length)
  {
      return MemoryMarshal.Cast<byte, System.Numerics.Vector3>(memory.Span.Slice(offset, length))[0];
  }
TYoungSL commented 3 years ago

That would work if that was the only limit, but FlatBuffers also starts to run into some internal 32-bit limits as well. Basically, the FlatBuffer format uses int32/uint32 offsets a lot internally (these are relative offsets, however FlatSharp likes to know absolute offsets).

Relative int32/uint32 offsets are definitely a problem in FlatBuffers when there is a vector that goes over the limit. The use of absolute offsets are a problem you might be able to address when individual issues come up. I think we can break up individual documents/messages to the point it's not a concern though; accessing a significant chunk of the data at a time with zero-copy span to hand off to SIMD/GPGPU process at a time is good enough.

There may times where we'd need a large contiguous buffer, and FlatBuffers may not work for these purposes, but we haven't run into it yet. One-copy from contiguous buffers vs. discontinuous buffers into virtually contiguous GPU space is not problematic; for SIMD processes that expect contiguous buffers that we want zero-copy operations for is something we can address later.

At some point we may need some extension to the official spec in the future, e.g. 64-bit offsets, arbitrary length packed ints for offsets, and incrementally relative offsets.

64-bit offsets, Varints/LEBs offsets, etc.; https://github.com/google/flatbuffers/projects/10#card-14545298

Forking the lib and creating FlatBuffers64 and FlatSharp64 is eyeroll worthy, but easy. From https://google.github.io/flatbuffers/flatbuffers_internals.html;

The most important and generic offset type (see flatbuffers.h) is uoffset_t, which is currently always a uint32_t, and is used to refer to all tables/unions/strings/vectors (these are never stored in-line). 32bit is intentional, since we want to keep the format binary compatible between 32 and 64bit systems, and a 64bit offset would bloat the size for almost all uses. A version of this format with 64bit (or 16bit) offsets is easy to set when needed. Unsigned means they can only point in one direction, which typically is forward (towards a higher memory location). Any backwards offsets will be explicitly marked as such.

Nested FlexBuffers support up to 64-bit sizing, strangely enough. I'm not sure how that would even be representable.

Looks like a reasonable way to shoe-horn 64-bit offset support would be to add an attribute for it per table. Topic for another issue.

jamescourtney commented 3 years ago

5.5.0 is published on nuget.

Astn commented 3 years ago

Very cool @jamescourtney !!

jamescourtney commented 3 years ago

Full docs are linked here, if you need them: https://github.com/jamescourtney/FlatSharp/releases/tag/5.5.0

Let me know how it goes for you, @Astn