jamescourtney / FlatSharp

Fast, idiomatic C# implementation of Flatbuffers

Apache License 2.0

510 stars 51 forks source link

Offset Model (8-bit, 16-bit, 64-bit offset sizes, file identifier sizes) #207

Closed TYoungSL closed 2 years ago

TYoungSL commented 3 years ago

Implements a new component for type models, OffsetModel to allow varying offset and file identifier sizes.

Bumps Microsoft.CodeAnalysis.CSharp.Workspaces and Microsoft.Net.Compilers versions from 3.8.0 to 3.11.0.

VTable offset sizes are not parameterized yet.

Messages can be extremely short when 16 and 8-bit offset sizes are specified.

Messages can (obviously) contain a lot more data when 64-bit offset sizes are specified.

Implemented by compiler options, --offset-size, --file-identifier-size and --strict-file-identifier-size.

This will be tested for parity against flatcc built with non-standard offset sizes. (https://github.com/dvidelabs/flatcc/pull/206)

StirlingLabs have a fork of Google's FlatBuffers called BigBuffers which also aims for parity with flatcc with 64-bit offset sizes.

codecov[bot] commented 3 years ago

Codecov Report

Merging #207 (5646f1a) into master (51e0f13) will decrease coverage by 0.00%. The diff coverage is 96.26%.

@@            Coverage Diff             @@
##           master     #207      +/-   ##
==========================================
- Coverage   94.68%   94.68%   -0.01%     
==========================================
  Files         104      105       +1     
  Lines        7584     7635      +51     
  Branches      677      681       +4     
==========================================
+ Hits         7181     7229      +48     
- Misses        300      303       +3     
  Partials      103      103

Impacted Files	Coverage Δ
src/FlatSharp/OffsetModel.cs	`91.66% <91.66%> (ø)`
src/FlatSharp/TypeModel/TypeModelContainer.cs	`86.20% <91.66%> (+1.20%)`	:arrow_up:
src/FlatSharp.Compiler/FlatSharpCompiler.cs	`95.20% <100.00%> (+0.40%)`	:arrow_up:
...ompiler/TypeDefinitions/TableOrStructDefinition.cs	`94.53% <100.00%> (+0.02%)`	:arrow_up:
src/FlatSharp/FlatBufferSerializer.cs	`91.95% <100.00%> (ø)`
src/FlatSharp/FlatBufferSerializerOptions.cs	`85.71% <100.00%> (+1.71%)`	:arrow_up:
src/FlatSharp/TypeModel/EnumTypeModel.cs	`91.95% <100.00%> (ø)`
src/FlatSharp/TypeModel/ItemMemberModel.cs	`90.54% <100.00%> (+0.12%)`	:arrow_up:
src/FlatSharp/TypeModel/NullableTypeModel.cs	`98.50% <100.00%> (ø)`
src/FlatSharp/TypeModel/RuntimeTypeModel.cs	`84.00% <100.00%> (+3.04%)`	:arrow_up:
... and 12 more

Continue to review full report at Codecov.

Legend - Click here to learn more Δ = absolute <relative> (impact), ø = not affected, ? = missing data Powered by Codecov. Last update 51e0f13...5646f1a. Read the comment docs.

jamescourtney commented 3 years ago

Hey -- Now that I have v6 done, I'm interested in taking a look at this. Thanks for the suggestions here. There is a lot to think about with this change.

There a few things that stick out to me initially:

All of the places I've implicitly assumed that a uoffset / soffset is 4 bytes, or that a vtable offset is 2 bytes will need to be tracked down. Hopefully it isn't too many, but it's something to be careful about.
Should the setting be per-table or per-compilation? My preference, I think, is for the latter. That looks like what you're proposing in the flatcc PR as well.
Are you planning on making vectors support > 2^32 items? Or is this just a limit on the total size of the buffer? I'm worried about things like IList<T>, which only has an int32 indexer defined.

I've seen your BigSpan<T> repo, and while it seems useful, it's also a lot of unsafe code that hasn't been audited. If you look at the readme for this repo, the very first goal is: "Full safety (no unsafe code or IL generation -- more on that below)." Obviously, the CLR team and .NET runtime folks use a lot of unsafe code upon which FlatSharp depends, but that's table stakes for a C# project. That's not to say I'm opposed to using BigSpan, but it must be optional (ie, FlatSharp.Runtime cannot depend on it). The first thing that comes to my mind is defining some IOutputBuffer struct, but I don't think that will work since ref structs can't implement interfaces (to avoid boxing).

So, I'm totally game for figuring out what the bigger picture here looks like, but this will be a pretty fundamental change and I want to make sure we do due diligence up front.

jamescourtney commented 3 years ago

And finally -- what does this look like if/when flatc starts supporting big offsets in a year and they look different than what we prototype?

You might also be interested in this: https://github.com/dotnet/apireviews/tree/main/2016/11-04-SpanOfT#spant-and-64-bit

TYoungSL commented 3 years ago

BigSpans is an implementation of option 3 in that API review.

It's the .NET Span runtime library code made to work with (u)longs.

There's no good way around not using it (or a reimplementation of it) if you need to address a 64-bit address range.

If flatc implements something different at the binary formatting, whether the offsets stay 32-bit, become 64-bit, or are implemented as variable-sized ints, unless it becomes just many 32-bit chunks, the rearchitecting to support 64-bit data will need be done regardless.

making vectors support > 2^32 items

Yeah, we'll have some of that.

Should the setting be per-table or per-compilation?

Per-compilation is easier. Per-table can come later as an extension or otherwise.

All of the places I've implicitly assumed that a uoffset / soffset is 4 bytes, or that a vtable offset is 2 bytes will need to be tracked down. Hopefully it isn't too many, but it's something to be careful about.

That's what's holding us up with porting flatc in BigBuffers but it seems to be working, we're testing output versus flatcc which somewhat already supported variable sized offsets. There's still plenty of 2 * vtable offset size assumption in the codegen, this PR only addresses offset sizes. The tests need to all be updated to specify a range of offset sizes, as in StructVectorTests.cs.

jamescourtney commented 3 years ago

Thanks for the thoughts. I've been mulling this over today. Based on what I've seen, the flatc folks don't seem terribly eager to head down this path, but it looks like the flatcc devs are open to it.

It seems like there's an opportunity here to do more than just "64 bit flatbuffers". I think if we're going to diverge from the binary format of FlatBuffers, we shouldn't do it halfway. There's quite a bit of mindshare and expertise in this area between you, me, and the folks from flatcc. My suggestion is to take this opportunity to actually fork the FlatBuffer format and use it as the basis for some new, non-backwards-compatible format that looks a bit like FlatBuffers. The bonus is that we're not just bolting on big offsets to the current format, but we can actually take some time and address some shortcomings of it:

Removing alignment requirements -- alignment leads to plenty of awkward things in the current format, not to mention increased size. Modern CPUs don't care too much about unaligned data AFAIK.
Dynamically sized offsets with varints?
Built-in map structure. Sorted vectors are neat, but maps are pretty fundamental, and is something I miss when using FlatBuffers.
2D vectors.

This is, obviously, a big endeavor, but much of the code from FlatSharp, flatcc, flatc will be reusable. I'm interested in collaborating on a whitepaper for some new specification, and seeing if something compelling pops out of that discussion.

jamescourtney commented 3 years ago

Given the foundational changes we're discussing here, I think I want to do this outside of the core FlatSharp repo. FlatSharp is at this point pretty mature and stable, and I'd like it to remain FlatBuffer focused and have no dependencies outside of Roslyn / .NET standard library. I also don't know what the future looks like for FlatSharp; there aren't any immediate "things I want to add" on my list (which has happened before). That's not to say I consider it "finished" or have plans to abandon it. It's more an expression of the fact that FlatSharp is at feature parity with flatc and I don't think there is much performance left to squeeze out of it, given that I've spent the last few years turning that crank (though I would be delighted to be proven wrong!)

Anyway, my thoughts in summary are that...

BigBuffers should be considered a new format. We may be able to get away with using .fbs as the IDL.
We should write a whitepaper and formal specification for it. Binary formats are something I'm a stickler for, because they are really hard to change once you have data in the wild.
We should use BigBuffers (can we find a better name?) as an excuse to address some of the shortcomings in FlatBuffers
We'll fork Flat* and apply the changes there.

Rjvs commented 3 years ago

This all sounds reasonable. Happy to change the name of BigBuffers to describe the project better.

Rjvs commented 3 years ago

You might be interested in our changelist at https://github.com/StirlingLabs/BigBuffers/blob/main/docs/source/DifferencesFromFlat.md

@TYoungSL made alignment optional without harming compatibility against FlatBuffers but it would be better to have optional alignment of tables in the spec. We actually need alignment in many of our tables for SIMD but some users wouldn't have that requirement.