Astn commented 3 years ago

I'm not sure if I'm missing something, but my current experience is using large arrays of flat buffers structs, it seems that the code generated is all using classes for these structs which causes a huge amount of GC activity when trying to serialize and de serialize these arrays.

For example:


struct Vector4
{
    x:float;
    y:float;
    z:float;
    w:float;
}
struct Vector3 
{
    x:float;
    y:float;
    z:float;
}
struct Vector2
{
    x:float;
    y:float;
}

table Mesh (fs_serializer:greedy)
{
    vertices:   [Vector3];
    normals:  [Vector3];
    uv:           [Vector2];
    uv2:         [Vector4];
    triangles:  [ushort];
}

Produces "struct" code like this:

[FlatBufferStruct]
        [System.Runtime.CompilerServices.CompilerGenerated]
        public partial class Vector2
            : object
        {
            public Vector2()
            {
                checked
                {
                    this.OnInitialized(null);
                }
            }

#pragma warning disable CS8618
            protected Vector2(FlatBufferDeserializationContext context)
            {
                checked
                {
                }
            }

#pragma warning restore CS8618
            public Vector2(Vector2 source)
            {
                checked
                {
                    this.x = FlatSharp.Compiler.Generated.CloneHelpers_f062897219dd458591c35871e16c5150.Clone(source.x);
                    this.y = FlatSharp.Compiler.Generated.CloneHelpers_f062897219dd458591c35871e16c5150.Clone(source.y);
                    this.OnInitialized(null);
                }
            }

            partial void OnInitialized(FlatBufferDeserializationContext? context);

            protected void OnFlatSharpDeserialized(FlatBufferDeserializationContext? context) => this.OnInitialized(context);

            [FlatBufferItemAttribute(0)]
            public virtual System.Single x { get; set; }

            [FlatBufferItemAttribute(1)]
            public virtual System.Single y { get; set; }

        }

This generates code where each of the structs [Vector2, Vector3, Vector4] are c# class objects.

These arrays can each be 20000 items. When they arrays of structs that can be a single allocation. When they are arrays of classes it bogs down the GC pretty hard.

What should I be doing here to work around this? Am I missing something that lets me treat the structs as structs?

Also here is a screen shot from profiling.

jamescourtney commented 3 years ago

So, I did at one point prototype and build value-type structs, but benchmarks were actually in favor of the class-based approach, so I abandoned that branch. I'm happy to revive it at some point if there is a demonstrable reason to add all of that complexity.

One thing that immediately jumps out to me is that you're using Greedy deserialization. That will definitely put a lot of GC pressure on your application, especially if your vectors are large. Have you tried using Lazy or PropertyCache? Those should amortize the GC hit more evenly.

jamescourtney commented 3 years ago

https://github.com/jamescourtney/FlatSharp/tree/valueStructs

Astn commented 3 years ago

I haven't yet tried those other deserialization options. As your docs seemed to indicate that if you were using all the data then greedy would be better. My use case always needs all the data every time.

I'll give it a shot though.

jamescourtney commented 3 years ago

Let me know how it goes. FWIW -- the docs aren't really geared around 20k item vectors. My team at Microsoft uses this and we have similarly sized vectors (using FlatBuffers inside a file), and I observed the same GC hit. Switching to Lazy resolved it completely.

The problem with big vectors and Greedy serialization is:

Deserialize allocates a huge spike of objects
That allocation will probably trigger a GC, so all of those objects will get promoted into Gen 1 / Gen 2.

The docs say Greedy is fastest because the cost for accessing a single property in Greedy mode is really cheap, but the semantics of Greedy force allocations before you need them. This is usually fine for small buffers. Lazy/PropertyCache are slightly slower, but don't allocate anything before you ask for it. So if you're looping through this list of 20k items, you'll have a few in memory at a time, but they'll all get swept up in Gen0 since the references are ephemeral.

Astn commented 3 years ago

I gave the lazy route a shot, and it didn't really help things in my use case.

I'm loading these objects and then sending the whole array of them over the network. So as quickly as the Grpc send function can iterate the array they are being de-serialized and then immediately serialized again. I think this makes the Greedy mode more efficient right now.

I haven't had a chance to dig around that branch you linked, I'll see if I can take a peek tomorrow.

It's got me wondering if there would be a way to reuse the same instance of one of the array objects if it was accessed through an iterator. Maybe have a way to specify an allocator function that's used to feed the iterator, then you would have the option of having a new object each time, or reusing one your hanging on to.

jamescourtney commented 3 years ago

I gave the lazy route a shot, and it didn't really help things in my use case.

Thanks for trying!

I'm loading these objects and then sending the whole array of them over the network. So as quickly as the Grpc send function can iterate the array they are being de-serialized and then immediately serialized again. I think this makes the Greedy mode more efficient right now.

Are you doing any sort of translation to a different schema? I'm wondering why you need to deserialize at all if you're just piping the data around. Also -- wouldn't gRPC streaming work well for this? It doesn't seem like they all need to be in the array at once.

It's got me wondering if there would be a way to reuse the same instance of one of the array objects if it was accessed through an iterator. Maybe have a way to specify an allocator function that's used to feed the iterator, then you would have the option of having a new object each time, or reusing one your hanging on to.

A way to "repoint" an object at a different spot in the buffer is an interesting idea that probably merits some deeper consideration. It's a little tricky given that FlatSharp's whole mantra is "subclass property implementations", so your application-level code has no knowledge of the actual class that it's dealing with, only the parent class.

I think what you'd probably want to do is give the ISerializer instance that you're using access to an object pool that it tries to consult before resorting to new. Your code would then be on the hook for returning objects to the pool when they are finished. In the worst case, your code never returns anything to the pool and the behavior is exactly as it is today with GC needing to clean it up.

I'd need to think carefully about whether I want to actually implement this pool in a non-naive way and deal with all of the associated pitfalls of pooling memory, or just chuck it into an interface and let people like you (ie, those who care deeply) control it.

In any case, this is the most interesting area of thought for me with FlatSharp in quite some time.

jamescourtney commented 3 years ago

After sleeping on this, I'm sort of leaning the other way.

I think I'd like to make this work by letting the user define pooling properties in the fbs file (or C# if using attributes):

table SomeTable (fs_pool_size:"100") { ... } // keep 100 items
struct SomeStruct (fs_pool_size:"300") { ... }

This would translate into the normal Dispose pattern for C#:

[FlatBufferTable(PoolSize = 100)]
public class SomeTable : IDisposable // Use Dispose semantics for returning to the pool
{
     ~SomeTable() { this.Dispose(false); }
     public sealed void Dispose() { this.Dispose(true); GC.SuppressFinalize(this); }
     protected virtual void Dipose(bool disposing) { this.OnDisposing(disposing); }

     private partial void OnDisposing(bool disposing); // let partial classes also have the option to dispose stuff.
}

The dynamic subclasses would then look like:

public class SomeTableReader : IDisposable
{
      // Declaring pool as static field makes it really fast and avoids a layer of Type indirection.
      private static readonly ConcurrentBag<SomeTableReader> pool = new();

      private volatile int free;

      // Default ctor is private.
      private SomeTableReader() { }

      public static SomeTableReader GetOrCreate(IInputbuffer buffer, int offset)
      {
            if (!pool.TryTake(out var reader))
            {
                  reader = new SomeTableReader();
            }

            reader.Initialize(buffer, offset);
            return reader;
      }

      protected override void Dispose(bool disposing)
      {
            if (Interlocked.CompareExchange(ref this.free, 0, 1) == 1) // who should own thread safety? I'm not sure this accomplishes much because customers can still do use-after-dispose, and we don't want to have locks everywhere. This only prevents double-dispose.
            {
                base.Dispose(disposing);
                if (pool.Count > 100) // comes from the fs_pool_size attribute
                {
                    this.Clear();
                    pool.Add(this);
                }
            }
      }

      // reset this object to clean state, release references so GC can reap them
      private void Clear() {  Debug.Assert(this.free == 1); }

      // (re)initialize this object
      private void Initialize(IInputBuffer buffer, int offset) { this.free = 0; }
}

My main question/concerns here are around thread safety. There are two main issues:

use after free. There's nothing I can do about this short of throwing locks around every method and adding an if statement to each accessor. I don't think I'm willing to eat that perf regression, though I'm persuadable.
Double free. I can detect this and prevent it, but it needs an Interlocked statement. Is this valuable assuming I don't do the above?

A non-thread safety issue would be if people were running this in a heterogenous environment. Would they want different pool sizes to somehow be configurable? A static attribute isn't very helpful there. Though perhaps a value of -1 could indicate that the pool should grow until it is right-sized. Of course, if the workload is bursty it will hang onto extra items, though this isn't dissimilar to List<T>.

Astn commented 3 years ago

I perused that branch you posted, and I think it will take me reading through a bit more code to really grok how things are working. I was a bit busy today, so maybe have more time tomorrow.

I'm liking your disposable idea for returning items to the pool, I may have some thoughts on it after I digest it more.

A few related ideas for chewing on.

Consider the relationship between a span and its backing memory and compare to SomeTableReader and its backing buffer.
I have some desire to do GPGPU processing on this data using ILGPU or similar. Is there a way to cheapen the costs of sending flaltbuffers arrays to the GPU and back? Today I have to declare another struct type that matches the flat buffers struct and copy all the data back and forth from flat buffers to struct array to GPU and back again.
My use case is running a world server for multiple game clients such as Unity and Unreal. I have cases where I want to stream lots of world data to clients who have none of the data. Also cases where a client changes some world data and it needs to be updated on the server and then broadcast to all connected clients in range. Imagine google earth or similar world servers, but a mutable world.

jamescourtney commented 3 years ago

I've spent some time on the IDisposable idea today. I moved away from strictly using IDisposable and ended up extending an interface that I have already by adding a Release method, which has the same semantics as what I talked about up above. I'm a little leery of IDisposable since it implies some sort of native resource under the hood that needs disposal, and most other object pool APIs use Rent/Return semantics.

var parsed = SomeTable.Serializer.Parse(buffer);
...
if (parsed is IFlatBufferDeserializedObject deserialized)
{
     deserialized.Release();
}

I have most of this prototyped and it seems at least somewhat promising. However, there are some places where this becomes incompatible with other features (init-only property setters don't play nice with reusing the same object!), so I need to refactor my way out of this one.

jamescourtney commented 3 years ago

I have a very beta build of the pooling behavior (as in it looks like it works, but I need to write actual tests). You can grab it from the CI build here (look under artifacts): https://github.com/jamescourtney/FlatSharp/actions/runs/902280796. Alternatively, you can clone the 'objectPool' branch and build yourself.

To enable pooling, you've got to do two things:

1) Add the fs_pool attribute to your fbs. I'm not settled on this yet

table SomeTable (fs_pool) { }

2) Actually recycle your objects after you're done with them:

var parsed = SomeTable.Serializer.Parse(buffer);
...
if (parsed is IFlatBufferDeserializedObject deserialized)
{
     deserialized.Release();
}

When the fs_pool attribute is specified, returned items are sent to a ConcurrentBag after you release them. Release is a no-op when fs_pool is not specified. There is some tracking for double-release and use-after-release, though it is not sophisticated. You can enable better error messages for these scenarios by setting :

FlatSharpGlobalSettings.CollectPooledObjectStackTraces = true;

I'd love to know if this helps your scenario or not.

Astn commented 3 years ago

Very cool, and thanks for helping on this!

I'll setup a benchmark to test. Might take me a few hours.

jamescourtney commented 3 years ago

So, I ran a few benchmarks myself. ConcurrentBag is apparently pretty slow. I've just pushed a new version that switches to [ThreadStatic] and old-fashioned Queue (https://github.com/jamescourtney/FlatSharp/actions/runs/904207394)

Here are the rough benchmarks for a full traversal of this schema (20k items)

table SomeTable (fs_serializer:Lazy) {
    Points : [Vec3];
}

struct Vec3 (fs_nonVirtual, fs_pool) {
        X : float;
    Y : float;
    Z : float;
}

        [Benchmark]
        public int ParseAndTraverse()
        {
            var t = SomeTable.Serializer.Parse(this.inputBuffer);

            int sum = 0;

            var points = t.Points;
            int count = points.Count;
            for (int i = 0; i < count; ++i)
            {
                var item = points[i];
                sum += (int)(item.X + item.Y + item.Z);

                ((IFlatBufferDeserializedObject)item).Release();
            }

            return sum;
        }

Pooled	Mode	Object Pool	Full Traversal Time (us)
No	Greedy		715
No	Lazy		214
Yes	Greedy	`ConcurrentBag`	1000
Yes	Lazy	`ConcurrentBag`	1100
Yes	Greedy	`ThreadLocal<Queue>`	774
Yes	Lazy	`ThreadLocal<Queue>`	691
Yes	Greedy	`[ThreadStatic]` + `Queue`	630
Yes	Lazy	`[ThreadStatic]` + `Queue`	513
Yes	Greedy	`[ThreadStatic]` + `Stack`	616
Yes	Lazy	`[ThreadStatic]` + `Stack`	522

My commentary on these results is:

For Lazy without pooling, the test patterns allow items to be collected ephemerally since I'm not loading them into any sort of array. Allocating and background Gen0 seems to be faster than the semantics of putting into a queue and popping back out of the queue.
Greedy without pooling suffers from both array allocations and lots of item allocations.
ConcurrentBag is really slow. I assume some of this is because it implements some work-stealing from the thread-local storage for other threads so the total number of items is constrained.
I thought that Stack might perform better than Queue due to higher likelihood of cache hits, but the results were a wash.

This seems to require a bit more tinkering. Would you mind just giving me a simple little benchmark that reflects your scenario so I can play with it some more on my own?

Astn commented 3 years ago

I've had quite a day, and haven't been able to work much today so far.

Here is the core of my flat buffers definition. Just dropping here so you don't have to wait more on these. I'll try to get that benchmark showing a use case going asap.


struct Vector4
{
    x:float;
    y:float;
    z:float;
    w:float;
}

struct Vector3Int 
{
    x:int32;
    y:int32;
    z:int32;
}

struct Vector3 
{
    x:float;
    y:float;
    z:float;
}

struct Vector2Int
{
    x:int32;
    y:int32;
}

struct Vector2
{
    x:float;
    y:float;
}

struct Color
{
    r: ubyte;
    g: ubyte;
    b: ubyte;
    a: ubyte;
}

struct Voxel
{
    VoxelType: ubyte;
    SubType : ubyte;
    Hp:ubyte;
    Unused:ubyte;
}

table Mesh (fs_serializer:greedy)
{
    vertices:   [Vector3];
    normals:    [Vector3];
    uv:         [Vector2];
    color:      [Color];
    triangles:  [ushort];
}

table VoxelRegion3D (fs_serializer:greedy)
{
    location: Vector3Int;
    iteration: uint32;
    size: ushort;
    voxels: [Voxel];
}

But to summarize use a usecase.

Create VoxelRegion3D with a voxels[sizesizesize] to represent a 3D cube of voxels. Values for size I've used have so far been 16, 20, and 40. I'd like to keep size^3 small enough to fit in memory, and send to GPU for processing. Also large sizes increase load and save time to disk / db.
Create a Mesh from the VoxelRegion3D but only for voxels that are visible from outside the region. So worst case would be a 50% fill with every other column solid.

Any time a value in the voxels[] is changed, the mesh needs to be rebuilt. This results in

Load from disk
Deserialize VoxelRegion3d
Modify voxel
Generate new Mesh
Serialize VoxelRegion3d
Save VoxelRegion3d
Serialize mesh
Save mesh The simple usecase is
Load already saved mesh
Send over Grpc to client

Astn commented 3 years ago

I've added a use case benchmark to https://github.com/jamescourtney/FlatSharp/pull/163

jamescourtney commented 3 years ago

Thanks for the detail and the branch! I will take a look this evening.

Couple of comments/questions that might help:

Deserialize VoxelRegion3d

Modify voxel

Generate new Mesh

Serialize VoxelRegion3d

Save VoxelRegion3d

Version 5.3 of FlatSharp includes support for write through properties to the underlying buffer. If you were to create a MemoryInputBuffer based on a memory-mapped file, you could potentially combine all of these steps into one. This allows you to do an in-place update to the existing buffer without a full parse/re-serialize and automatically flush that to disk. Flatsharp doesn't do anything with mmapped files on its own, but it should be possible. https://github.com/dotnet/runtime/issues/24805 might be able to help. There are a few constraints here:

Serializer type must be VectorCacheMutable (Lazy/PropertyCache maybe could be extended to support this by adding mutable versions of those, but Greedy is a nonstarter since those do not maintain a reference to the input buffer)
Vectors are not-expandable. You can change individual items, but the buffer has to remain the same size.
Only members of structs can be updated in place (which looks like it will be OK for you assuming the length of the vectors does not change)

Load already saved mesh

Send over Grpc to client

Is this gRPC call a batch mode or streaming mode? If you're using gRPC streaming you may be able to get away from loading everything into a giant array, though you mentioned some GPU processing as well which might be driving this requirement.

jamescourtney commented 3 years ago

Hey -- so I've tinkered with this, and managed to speed things up a bunch using the write through option I mentioned above. Here's the original bench you shared:

Method	ParseOption	Mean	Error	StdDev
SendRegionToClient	Lazy	1.925 ms	0.3082 ms	0.0169 ms
SendVisibleRegionsToClient	Lazy	755.228 ms	169.9860 ms	9.3175 ms
SendVisibleMeshesToClient	Lazy	743.959 ms	3.6774 ms	0.2016 ms
SendMeshToClient	Lazy	2.906 ms	2.0721 ms	0.1136 ms
ModifyMeshAndSendToClients	Lazy	29.511 ms	41.8578 ms	2.2944 ms
SendRegionToClient	GreedyMutable	2.922 ms	1.6592 ms	0.0909 ms
SendVisibleRegionsToClient	GreedyMutable	1,174.830 ms	703.3006 ms	38.5503 ms
SendVisibleMeshesToClient	GreedyMutable	1,199.901 ms	306.4329 ms	16.7966 ms
SendMeshToClient	GreedyMutable	5.925 ms	5.9447 ms	0.3258 ms
ModifyMeshAndSendToClients	GreedyMutable	27.656 ms	11.5352 ms	0.6323 ms

After changing it to writethrough, these turn into:

Method	Mean	Error	StdDev
SendRegionToClient	93.78 us	6.696 us	0.367 us
SendVisibleRegionsToClient	38,177.58 us	5,279.602 us	289.393 us
SendVisibleMeshesToClient	35,542.62 us	47,627.519 us	2,610.625 us
SendMeshToClient	1,127.36 us	508.945 us	27.897 us
ModifyMeshAndSendToClients	39,161.27 us	10,160.844 us	556.950 us

This looks to be a speedup of a couple of orders of magnitude for a lot of these tests. The downside is that I had to overallocate some of the arrays to accommodate the variable number of items in the mesh vectors. Whether this works for you or not I couldn't say.

            // Update fillSize to accomodate max of (fillSize * 3). Some items may be null.
            Mesh mesh = new Mesh
            {
                color = new Color[fillSize * 3],
                normals = new Vector3[fillSize * 3],
                triangles = new ushort[fillSize * 3],
                uv = new Vector2[fillSize * 3],
                vertices = new Vector3[fillSize * 3]
            };

You can find the code I used here: https://github.com/jamescourtney/FlatSharp/tree/voxelBench/src/Benchmarks/ExperimentalBenchmark

Yours are largely unchanged, and mine are copied and named Modified.

Astn commented 3 years ago

Very cool! I'll checking it out now!

Astn commented 3 years ago

I've read through your modified version and am stunned by how much better it is. It will take me a while to comprehend why the changes you made were so impactful! Over allocating the mesh is not a problem. I use lz4 on each byte[] before it gets persisted. I'm not sure if that messes up your ideas with memory mapped files though. I've been storing everything using rocksdb. Maybe some things would be better in memory mapped files though. But the lz4 compression on over allocated buffers and regions of voxels is really great.

jamescourtney commented 3 years ago

Cool! Glad I could help :) I'm going to drop the object pooling approach for now since I've thought of some things about it that I dislike. What might be possible is an additional parse API that reuses the same object graph when possible:

((IFlatBufferDeserializedObject)something).LoadFrom(byte[] buffer)

The short version of why it helps is that the fs_writeThrough attribute makes the mutations directly in the underlying buffer, so you're saving a fortune on copies. The other thing is that VectorCacheMutable lazily initializes the items as they are read. If you actually read through the serializer code that FlatSharp spits out, you'll see something like this (which I've lightly annotated):

  // This is the 'x' property of the Vec3 structure. 
  // The base class is virtual and FlatSharp overrides it to speak FlatBuffer.
  public override System.Single x
  {
      get
      {
          // Test to see if it's already in memory.
          if ((this.__mask0 & (byte)1) == 0)
          {
              // If not, read it and update the bit mask.
              this.__index0Value = ReadIndex0Value(this.__buffer, this.__offset, default, default);
              this.__mask0 |= (byte)1;
          }
          return this.__index0Value;
      }

      set
      {
           // Set the value of the backing field.
           this.__index0Value = value; 

           // update mask to indicate that this value is now in memory and doesn't need to be pulled from the buffer.
           this.__mask0 |= (byte)1;      

           // fs_writeThrough injects this line, which writes the new value back to the underlying buffer.
           WriteIndex0Value(this.__buffer, __offset, value);
      }
  }

There are two key parts of this:

The fact you can make in-place updates back to the original buffer. This saves you from deserialize-modify-reserialize (which is effectively two inefficient memcopy operations)
The fact that you can take that same underlying buffer and stream it over the network or hand it to RocksDb. Even if you do need to make a copy for some reason, memcopy won't kill you since it can do block by block using big vector instructions.

You can ignore what I said about files -- I was assuming you were storing your Flatbuffers directly on disk. I've used RocksDb before. You might want to be careful about using LZ4 yourself unless you've explicitly disabled compression in RocksDb. It's been a few years, but I recall that it has Snappy and/or LZ4 linked in.

jamescourtney commented 3 years ago

By the way -- I did push another update to the voxelBench branch that modestly improves the perf from before. Mostly because I dropped that IsNull property from the structs and just added a length property to the Mesh. This avoids the alignment padding issues when you have a trailing byte on a struct that you're storing in a vector, and will save quite a bit of space.

Astn commented 3 years ago

Thanks for helping me understand the changes. And I'm pulling in your updates.

I'm going to read through more of the code and see if I can wrap my head around it.

jamescourtney commented 3 years ago

I managed to come up with a model I liked better for object pooling (calling it Recycling now):

// Traverses the full object graph and recycles poolable objects. Mesh is set to null on completion.
this.meshSerializer.Recycle(ref mesh);

This adds another large speedup over what I did earlier:

Method	WriteThrough + Recycle	WriteThrough	Baseline (lazy allocation)
SendRegionToClient	12.65 us	93 us	1925 us
SendVisibleRegionsToClient	5,361.71 us	38,177 us	755,000 us
SendVisibleMeshesToClient	5,459.84 us	35,542 us	743,000 us
SendMeshToClient	313.27 us	1,127 us	2,906 us
ModifyMeshAndSendToClients	22,879.78 us	39,161 us	29,511 us

Those changes are pushed now as well. Again -- I should stress that the Recycle changes are experimental and very much use-at-your-own-risk for the moment until I can get a full suite of tests built around it.

Astn commented 3 years ago

Very cool!!! Pulling it down.

jamescourtney commented 3 years ago

Hi there -- I've pushed one final change to that branch for you. The main change is that I've yanked out all of the code for object pooling.

The good news is that I've replaced it with something better and simpler (for you, at least). I added a new serialization mode: LazyWriteThrough.

I'd encouraged you to use VectorCacheMutable in the past because it enabled write-through semantics. It could do this for a couple of reasons:

It has a reference to the underlying input buffer (so it can write back)
It guaranteed that only a single instance of a given Flatbuffer element was created; that is, every time you access foo.bar[3].baz.bat[2] you get the same object instance back. It did this by preallocating all vectors and filling them with stubs (then gradually filling those stubs in as you accessed the properties). Writes went back to the buffer and also got stored on the object itself (which is why it is important that there is only one object per Flatbuffer element).

This was a big win for you because it saved so much deserialize/parse work. However, there was still a ton of array allocation happening to fill these big arrays with stubs. The work I did on object pooling helped a little bit, but there were still problems:

Those arrays had to be cleared out when being returned to the pool
Those objects inside them also needed to be pooled (traversing those is nontrivial)
The objects had to be yanked back out of the pool when deserializing and re-inserted into the array.
Object pooling is dangerous. Small bugs break everything.

All in all, I was feeling uneasy about the approach, which is usually a sign I need to rethink things.

Switching gears, FlatSharp's Lazy mode avoids array allocations altogether: Accessing foo.bar[2].baz.bat[3] twice gives you two different instances that point at the same spot in the buffer. Lazy also disallows all mutations. The great thing about Lazy is that if you're only referencing objects ephemerally, they can all be scooped up in Gen0 before the "expensive" GC kicks in. The lack of huge vectors also keeps GC at bay.

The new mode LazyWriteThrough combines enables Lazy with fs_writeThrough properties:

Mutations are allowed, but only on fs_writeThrough properties. Table properties and those with fs_writeThrough disabled will still throw exceptions.
Even if there are multiple objects that point at the same spot in the buffer, that doesn't matter with Lazy since they don't cache the values anyway.
You can now selectively update parts of the object graph with no huge allocation tax (there is progressive allocation tax, but you only pay for what you use, and GC will scoop it up pretty quick).

So finally, benchmarks!

For context, here are the results of the original one you uploaded (I made a small tweak to stop allocating new byte[] for the fake network buffers and use a static one instead, since we're benchmarking FlatSharp and not the CLR allocator)

Method	ParseOption	Mean	Error	StdDev	Gen 0	Gen 1	Gen 2	Allocated	Completed Work Items	Lock Contentions
SendRegionToClient	Lazy	1.498 ms	0.0211 ms	0.0012 ms	121.0938	-	-	1.95 MB	0.0039	-
SendVisibleRegionsToClient	Lazy	587.325 ms	20.9992 ms	1.1510 ms	48000.0000	-	-	781.33 MB	2.0000	-
SendVisibleMeshesToClient	Lazy	591.809 ms	25.5498 ms	1.4005 ms	48000.0000	-	-	781.33 MB	2.0000	-
SendMeshToClient	Lazy	1.618 ms	0.0699 ms	0.0038 ms	232.4219	-	-	3.71 MB	0.0039	-
ModifyMeshAndSendToClients	Lazy	29.721 ms	5.3744 ms	0.2946 ms	1093.7500	656.2500	250.0000	19.93 MB	0.0625	-
SendRegionToClient	GreedyMutable	2.311 ms	1.5809 ms	0.0867 ms	101.5625	62.5000	23.4375	1.95 MB	0.0078	-
SendVisibleRegionsToClient	GreedyMutable	914.440 ms	40.3927 ms	2.2141 ms	41000.0000	25000.0000	10000.0000	781.35 MB	2.0000	-
SendVisibleMeshesToClient	GreedyMutable	905.911 ms	477.1683 ms	26.1552 ms	41000.0000	25000.0000	10000.0000	781.35 MB	2.0000	-
SendMeshToClient	GreedyMutable	4.317 ms	1.3701 ms	0.0751 ms	242.1875	148.4375	54.6875	4.15 MB	0.0156	-
ModifyMeshAndSendToClients	GreedyMutable	26.734 ms	1.8844 ms	0.1033 ms	781.2500	468.7500	156.2500	16.03 MB	0.0625	-

You can see how much the GC was running and the amount of data being allocated. Notice that though Lazy and Greedy allocated the same amount of data, Lazy was faster since it was collected in Concurrent Gen0 collections instead of blocking Gen2 collections.

Switching to VectorCacheMutable with writethrough helped:

Method	Mean	Error	StdDev	Completed Work Items	Lock Contentions	Gen 0	Gen 1	Gen 2	Allocated
SendRegionToClient	4.147 us	0.0810 us	0.0044 us	0.0000	-	0.0076	-	-	184 B
SendVisibleRegionsToClient	1,671.888 us	23.1648 us	1.2697 us	0.0039	-	3.9063	-	-	73600 B
SendVisibleMeshesToClient	1,726.317 us	61.3808 us	3.3645 us	0.0039	-	3.9063	-	-	73600 B
SendMeshToClient	112.981 us	3.5482 us	0.1945 us	0.0002	-	-	-	-	224 B
ModifyMeshAndSendToClients	29,270.213 us	638.2455 us	34.9844 us	0.0625	-	1468.7500	875.0000	312.5000	24577725 B

However, ModifyMeshAndSendToClients actually got worse because of the extra costs of the stub objects and allocating vectors each time. The Gen2 numbers bear that out. LazyWriteThrough address all of these problems:

Method	Mean	Error	StdDev	Completed Work Items	Lock Contentions	Gen 0	Gen 1	Gen 2	Allocated
SendRegionToClient	4.090 us	0.0466 us	0.0026 us	0.0000	-	0.0076	-	-	152 B
SendVisibleRegionsToClient	1,765.334 us	63.8413 us	3.4994 us	0.0039	-	1.9531	-	-	60800 B
SendVisibleMeshesToClient	1,809.066 us	24.1105 us	1.3216 us	0.0039	-	1.9531	-	-	60800 B
SendMeshToClient	111.914 us	6.4346 us	0.3527 us	0.0002	-	-	-	-	176 B
ModifyMeshAndSendToClients	12,675.831 us	246.5131 us	13.5122 us	0.0313	-	1234.3750	-	-	20675816 B

There are no Gen1 or Gen2 collections any longer (Gen0 is busy, but that is cheaper than fancy object pooling logic).

I will try and get FlatSharp version 5.4.0 published at some point this week. There is some documentation and samples that need to happen before that. I hope this helps you.

jamescourtney commented 3 years ago

FlatSharp version 5.4.0 is published with support for Lazy + WriteThrough: https://github.com/jamescourtney/FlatSharp/releases/tag/5.4.0

Let me know if you need anything else.

Astn commented 3 years ago

Wow! Fantastic stuff here @jamescourtney. Really Really great stuff. Thanks and thanks again!

TYoungSL commented 3 years ago

@Astn are you making use of the valueStructs branch to support struct Vector types, or are you just translating during your modify/generate steps?

I have Vector2, Vector3, Vector4 vectors as well as Matrix4x4s that at the present need to be translated into the class types to support FlatBuffers interop.

I'd like to have an option to alias them to native/intrinsic CLR types as well have them represented as value types, perhaps by an attribute or something. (e.g. MyNamespace.Vector3f -> System.Numerics.Vector3)

@jamescourtney while FlatSharp would be slower in serialize/deserialize operation in this configuration, the copy and manipulation overhead and glue code would disappear.

jamescourtney commented 3 years ago

I'd like to have an option to alias them to native/intrinsic CLR types as well have them represented as value types, perhaps by an attribute or something. (e.g. MyNamespace.Vector3f -> System.Numerics.Vector3)

This exists today, but not when using FBS files. The feature is called type facades, and it allows you to define a higher level type in terms of a lower level one. Imagine you want Flatsharp to support DateTimeOffset. You can define a Facade that maps from DateTimeOffset -> int64. The data is stored as int64 in the buffer, but the serialize/parse code map that to DateTimeOffset by way of the facade. I've considered making the FlatSharp compiler extensible to support Facades and custom Type Models, but have demurred because:

It's super advanced, and I'm not convinced anyone would use it.
It breaks compatibility with flatc.

are you making use of the valueStructs branch to support struct Vector types, or are you just translating during your modify/generate steps?

I don't think that they are, though I could be wrong. That branch is just there for reference purposes and isn't being updated. Value structs did work, but had some significant drawbacks:

Slower to parse/serialize
Required [StructLayout(LayoutKind.Explicit)]
Added an entire new dimension to the test matrix. This is something I worry about because FlatSharp is a one person project, and I don't work on it full time. I desperately want to avoid any data corruption issues and am generally conservative when adding new stuff unless there is unambiguous benefit. I've prototyped several features that never made it in for this reason (object pooling, value structs, etc).
C# guidance recommends structs be less than 16 bytes and be immutable, which doesn't give much wiggle room for interesting structs. You could argue that this isn't FlatSharp's decision to make, and I wouldn't fight you on it.
Value structs are incompatible with other FlatSharp concepts (non-greedy deserialization, write through, etc).

@jamescourtney while FlatSharp would be slower in serialize/deserialize operation in this configuration, the copy and manipulation overhead and glue code would disappear.

I get that. Let me sleep on the idea of bringing them back. If you really need structs, you do have the option of using the Google library, which does use structs. Of course, that may come with some other drawbacks.

Astn commented 3 years ago

@TYoungSL I am not using the valueStructs branch. I'm using the latest released version with the new features for Lazy + Writethrough https://github.com/jamescourtney/FlatSharp/releases/tag/5.4.0

The way I think about it is FlatBuffers and Flatsharp are not giving me access to blittable memory for my vectors of structs, but the structs within a vector are effectivly blittable with the Lazy + writethrough and still use value semantics when reading and writing.

The limitation here is I can't interact with the flatbuffers vector memory directly and have to go through FlatSharp to interact with it. So you will have to have a second copy of your data if you are going to do any SIMD or GPGPU work with it.

TYoungSL commented 3 years ago

So you will have to have a second copy of your data if you are going to do any SIMD or GPGPU work with it.

Yeah that's a lot of messy code. Would be easier to just access the vector within the buffer or to have a greedy struct model already in a ready state.

The vector data (indices, triangle lists) I'm working with is in the multi-gigabyte range.

Being able to request spans of value struct types (whether lazily or greedily) from a vector member would be ideal.

jamescourtney commented 3 years ago

Being able to request spans of value struct types (whether lazily or greedily) from a vector member would be ideal.

Do you mean literal Span<T>?

The vector data (indices, triangle lists) I'm working with is in the multi-gigabyte range.

Are these in one FlatBuffer? FlatSharp has a hard upper limit of 1GB or so per buffer. I could bump this to 2GB, but that's the max given the limits of int32 (Span<T> does not have indexers based on int64).

I do have a thought that might work for you. I'm not familiar with GPGPU, so please forgive my ignorance.

Right now, FlatSharp does define an interface called IFlatBufferDeserializedObject. This gives you access to some things about the object. Every deserialized object implements this interface, and it gives you access to a few things, such as the IInputBuffer used to deserialize it.

What if I was to extend this to have two additional fields: AbsoluteOffset and Length. So imagine you had a struct that was logically a System.Numerics.Vector3. What you could do is:

void Process(FlatSharpVector3 vector)
{
       if (vector is IFlatBufferDeserializedObject deserialized)
       {
              int offset = deserialized.Offset;
              int length = deserialized.Length;
              Span<byte> data = deserialized.InputBuffer.GetByteMemory(offset, length).Span;
              System.Numerics.Vector3 numeric = MemoryMarshal.Cast<byte, Vector3>(data)[0];
              // something
       }
}

Astn commented 3 years ago

That would be very nice to be able to get span access to the vectors!

Just some fyi stuff for reference.

Simd Vectors in c# https://devblogs.microsoft.com/dotnet/hardware-intrinsics-in-net-core/
GpGpu https://www.ilgpu.net
GpGpu samples https://github.com/m4rs-mt/ILGPU.Samples

jamescourtney commented 3 years ago

@Astn -- That would be really cheap to add. FlatSharp (of course) already knows all that information, but it doesn't expose it. Would that meaningfully improve your life?

TYoungSL commented 3 years ago

Yeah, being able to access the backing buffer as a struct typed array is the ideal result.

The mesh data I'm dealing with is sliced into axis-aligned bounds, but one particularly large multi-mesh structure weighs in at over a hundred gigs. The 1GB/2GB limit may pose a problem in the future we hadn't considered.

From peeking at @Astn's repo, he's working with terrain and voxel-to-mesh scenarios, so somewhat large scale too.

Being able to interop with ILGPU or .NET SIMD intrinsics as fallback in these scenarios would be nice. Getting a ReadOnlySpan<byte> or Span<byte> that can be marshalled into Span<Vector3> or whatever would save a lot of glue code.

TYoungSL commented 3 years ago

Also, the indices in the meshes are at largest 32-bit, so theoretically they can be as a single document object just over 16GB, though low-detail volumes are even often reduced to 16-bit indices.

We would probably subdivide the meshes further if reducing the document size to fit in a flatbuffer becomes necessary.

jamescourtney commented 3 years ago

Unfortunately, there's nothing I can do about Span's int32 indexer limits. I really wish the CLR team would add a nint overload for the indexer and the Slice method.

I'll see about getting #175 and this addressed in the next week or so. Thanks for the discussion here @TYoungSL and @Astn . Hopefully we've arrived at a place where you guys are unblocked and I'm not extending FlatSharp in unnecessary ways.

Astn commented 3 years ago

@Astn -- That would be really cheap to add. FlatSharp (of course) already knows all that information, but it doesn't expose it. Would that meaningfully improve your life?

Very much so! I have more then a few use cases where the data in a vector needs to be sent directly to the gpu, or some unmanaged code.

TYoungSL commented 3 years ago

The way to work around the Span limitation is to provide either a sequence of multiple Spans or allow providing an offset to the start pointer/reference; a span getter that accepts an offset and a length.

jamescourtney commented 3 years ago

The way to work around the Span limitation is to provide either a sequence of multiple Spans or allow providing an offset to the start pointer/reference; a span getter that accepts an offset and a length.

That would work if that was the only limit, but FlatBuffers also starts to run into some internal 32-bit limits as well. Basically, the FlatBuffer format uses int32/uint32 offsets a lot internally (these are relative offsets, however FlatSharp likes to know absolute offsets). Here's one example of a table that would be un-serializable in FlatBuffers:

table Table
{
       Vector1 : [ulong];              // int.maxValue elements at 8 bytes / element => 16 GB
       Vector2 : [ulong];              // same
}

Tables store uint32 byte offsets to their non-struct fields, but in this case there is no offset that Vector2 could pick to get the right address. You'd have to change your definition to be a vector of Tables in this case, with each one being a fixed size. In which case, it's just as easy to model it as a series of independent FlatBuffers.

Hopefully this explains why I picked the 1GB limit for FlatSharp.

jamescourtney commented 3 years ago

178 adds support for `IFlatBufferAddressableStruct`:

public interface IFlatBufferAddressableStruct
{
     int Offset { get; }
     int Size { get; }
     int Alignment { get; }
}

Deserialized classes implement this interface when:

They are structs
The serialization mode is non-greedy (greedy does not store a reference to the input buffer or the offset -- why would it?)

Usage

Consider a vector of Vec3 structs:

struct Vec3 {  x : float; y : float; z : float }
table SomeTable (fs_serialzier:"lazy") { Points : [Vec3]; }

  var parsed = SomeTable.Serializer.Parse(buffer);

  // grab a reference to the first point and the length. This is all we need from FlatSharp's deserializer.
  Vec3 vec = parsed.Points[0];
  int length = parsed.Points.Count;

  if (vec is IFlatBufferAddressableStruct @struct)
  {
      int offset = @struct.Offset;
      int size = @struct.Size;
      int alignment = @struct.Alignment;

      System.Numerics.Vector3 vec3 = default;
      for (int i = 0; i < length; ++i)
      {
          // cast the input buffer into the SIMD-capable structure and increment the existing vector.
          vec3 += AsVec3(buffer, offset, size);

          // Advance offset and compensate for alignment differences. Vec3 won't have this problem, but 
          // jagged structs might.
          offset += size;
          offset += SerializationHelpers.GetAlignmentError(offset, alignment);
      }
  }

  static System.Numerics.Vector3 AsVec3(Memory<byte> memory, int offset, int length)
  {
      return MemoryMarshal.Cast<byte, System.Numerics.Vector3>(memory.Span.Slice(offset, length))[0];
  }

TYoungSL commented 3 years ago

That would work if that was the only limit, but FlatBuffers also starts to run into some internal 32-bit limits as well. Basically, the FlatBuffer format uses int32/uint32 offsets a lot internally (these are relative offsets, however FlatSharp likes to know absolute offsets).

Relative int32/uint32 offsets are definitely a problem in FlatBuffers when there is a vector that goes over the limit. The use of absolute offsets are a problem you might be able to address when individual issues come up. I think we can break up individual documents/messages to the point it's not a concern though; accessing a significant chunk of the data at a time with zero-copy span to hand off to SIMD/GPGPU process at a time is good enough.

There may times where we'd need a large contiguous buffer, and FlatBuffers may not work for these purposes, but we haven't run into it yet. One-copy from contiguous buffers vs. discontinuous buffers into virtually contiguous GPU space is not problematic; for SIMD processes that expect contiguous buffers that we want zero-copy operations for is something we can address later.

At some point we may need some extension to the official spec in the future, e.g. 64-bit offsets, arbitrary length packed ints for offsets, and incrementally relative offsets.

64-bit offsets, Varints/LEBs offsets, etc.; https://github.com/google/flatbuffers/projects/10#card-14545298

Forking the lib and creating FlatBuffers64 and FlatSharp64 is eyeroll worthy, but easy. From https://google.github.io/flatbuffers/flatbuffers_internals.html;

The most important and generic offset type (see flatbuffers.h) is uoffset_t, which is currently always a uint32_t, and is used to refer to all tables/unions/strings/vectors (these are never stored in-line). 32bit is intentional, since we want to keep the format binary compatible between 32 and 64bit systems, and a 64bit offset would bloat the size for almost all uses. A version of this format with 64bit (or 16bit) offsets is easy to set when needed. Unsigned means they can only point in one direction, which typically is forward (towards a higher memory location). Any backwards offsets will be explicitly marked as such.

Nested FlexBuffers support up to 64-bit sizing, strangely enough. I'm not sure how that would even be representable.

Looks like a reasonable way to shoe-horn 64-bit offset support would be to add an attribute for it per table. Topic for another issue.

jamescourtney commented 3 years ago

5.5.0 is published on nuget.

Astn commented 3 years ago

Very cool @jamescourtney !!

jamescourtney commented 3 years ago

Full docs are linked here, if you need them: https://github.com/jamescourtney/FlatSharp/releases/tag/5.5.0

Let me know how it goes for you, @Astn

jamescourtney / FlatSharp

Option to treat structs as structs instead of objects #158

178 adds support for `IFlatBufferAddressableStruct`:

Usage

jamescourtney / FlatSharp

Option to treat structs as structs instead of objects #158

178 adds support for IFlatBufferAddressableStruct:

Usage

178 adds support for `IFlatBufferAddressableStruct`: