google / flatbuffers

FlatBuffers: Memory Efficient Serialization Library
https://flatbuffers.dev/
Apache License 2.0
22.52k stars 3.19k forks source link

Exposing flatbuffer compilation internals #6428

Closed CasperN closed 1 year ago

CasperN commented 3 years ago

The maintainers had some conversations and discussions around a significant refactoring of flatc to expose more internals somehow.

However, we're not sure if there are enough use cases to justify such an effort, or even guide its technical direction. So, I'm now crowd sourcing ideas. So... Where does flatbuffers tooling fall short, and can these be addressed by exposing more information that currently exists in flatc? And also how do people feel about the priority of this effort relative to some other large projects such as implementing the ideas in #5875 or #6053

Some use cases:

Some methods

Some positive side effects

Some preferences / potential requirements

@dbaileychess @mikkelfj @aardappel @krojew @paulovap @adsharma

(I intend to keep this top comment up to date with discussion below)

CasperN commented 3 years ago

fwiw, I'm totally biased towards M3. protbuffers and captnproto do something like that: There's a CodeGenerator(Request|Response) exchange between the compiler and generator.

adsharma commented 3 years ago

A combination of M3 and M2 might have a few extra benefits in addition to what you outline.

On Fri, Jan 29, 2021, 7:39 AM Casper notifications@github.com wrote:

fwiw, I'm totally biased towards M3. protbuffers https://developers.google.com/protocol-buffers/docs/reference/other and captnproto https://capnproto.org/otherlang.html do something like that: There's a CodeGenerator(Request|Response) exchange between the compiler and generator.

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/google/flatbuffers/issues/6428#issuecomment-769878742, or unsubscribe https://github.com/notifications/unsubscribe-auth/AAFA2A2AXT5N3AHGXG3GWPDS4LJCDANCNFSM4WZBV7XA .

krojew commented 3 years ago

Exposing intermediate representation looks like a good idea - it's essentially what modern compilers do.

mikkelfj commented 3 years ago

I'm leaning towards M3 from a flatcc perspective. I believe M2 is useful, but it will not happen for flatcc for portability reasons - scripting engines are not as portable as native. Except if the scripting engine runs in an external process on top of M2. The would also allow easier interop between flatc and flatcc. Even with out of process scripting, flatcc will maintain an internal code generator for C for portability, but it might be rewritten to rely on M3. Except there is a chicken and egg problem for accessing the bfbs interface, so I am not fully convinced here even if attractive.

To clarify. For flatcc and any language other than C, M3 is certainly of interest, and out of process M2 also, via M3.

CasperN commented 3 years ago

Except there is a chicken and egg problem for accessing the bfbs interface, so I am not fully convinced here even if attractive.

I mean, that's only true if you want to bootstrap the code-generator for language X in that language, and i don't think that's such an important use case that we'd design around it. Even still, we can make the schema json-compatible and optionally output the IR in that form.

adsharma commented 3 years ago

For flattools/python case I'm not sure if a json representation is super useful. Parsing the source is easier (there is a ply-python parser that works well).

But a side effect of such a refactoring is that a standalone serde independent of the parser is generated and that sounds interesting.

On Sat, Jan 30, 2021, 3:05 PM Casper notifications@github.com wrote:

Except there is a chicken and egg problem for accessing the bfbs interface, so I am not fully convinced here even if attractive.

I mean, that's only true if you want to bootstrap the code-generator for language X in that language, and i don't think that's such an important use case that we'd design around it. Even still, we can make the schema json-compatible and optionally output the IR in that form.

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/google/flatbuffers/issues/6428#issuecomment-770294180, or unsubscribe https://github.com/notifications/unsubscribe-auth/AAFA2A3LP46ILDXXQOEDB3LS4SGEDANCNFSM4WZBV7XA .

mikkelfj commented 3 years ago

Parsing the source is not easier because parsing is only 10% of what a Flatbuffer compiler does before generating code.

adsharma commented 3 years ago

Sure - there is more work to do beyond pure parsing. What I was getting at is the somewhat sorry world of json schema and validating that the input json is well formed. I much prefer parsing a fbs file vs parsing json and then validating that it conforms to some json schema.

CasperN commented 3 years ago

Sure - there is more work to do beyond pure parsing. What I was getting at is the somewhat sorry world of json schema and validating that the input json is well formed. I much prefer parsing a fbs file vs parsing json and then validating that it conforms to some json schema.

I think this thread is a little off topic. One can assume that flatc would only output valid bfbs/json/$favorite_serde_format, and also the resulting schema has been type checked, name resolved, etc checked, fits a predefined schema, and is good for code generation in all respects.

  • Integrate with projects like pyserde

How would this work? It seems like pyserde defines data types in python which may conflict with the flatbuffer schema. Which would be the source of truth?

adsharma commented 3 years ago

On Mon, Feb 1, 2021 at 10:32 AM Casper notifications@github.com wrote:

  • Integrate with projects like pyserde

How would this work? It seems like pyserde defines data types in python which may conflict with the flatbuffer schema. Which would be the source of truth?

fbs file will be the source of truth. flatc.py can generate a dataclass from the fbs file, which could be decorated with decorators provided by pyserde.

aardappel commented 3 years ago

I personally would prefer M2, but only if we committed to cleaning up / converting all the C++ generators we have into said scripting language. If the C++ generators are going to continue being a thing, then M1/M3 are better.

M1 is kinda useless, since it still restricts the languages a generator can be written in a great deal. It also requires we deal with cross-platform dynamic linking, and making sure there is a dll/so for each platform.. not much of an improvement over just static linking it into flatc, imho.

M3 is definitely useful. One thing I wouldn't like is if we ended up with reflection.fbs and generator.fbs and they looked pretty similar. We have already enhanced reflection.fbs a few times in the past for the sake of external generators. I'd be in favor of extending reflection.fbs further, with a command-line flag that controls the level of information in it, as opposed to starting from scratch for the purpose of generators.

An extension of M3/M3 is if we took care of invoking scripts from flatc. Imagine you say flatc -gen foo.py schema.fbs, and flatc would first write out the .bfbs, found if python was installed and ran it for you. This saves the user having to write a script that runs both flatc and python, and possibly doing so for multiple platforms. Then again, not sure if we want to be in the business of running commands.

mikkelfj commented 3 years ago

Then again, not sure if we want to be in the business of running commands.

I imagine you pipe bfbs or json output from flatc into a script. And possible add a wrapper script to simplify things.

mustiikhalil commented 3 years ago

One of the things that would actually be worth it, is allowing the flatc integrate with other already existing generators (GRPC) as an example. This is one of the points I talked with the guys over at swift-grpc. Since this would allow us to keep our code base consistent and not break each time there is a new version of GRPC

fwiw, I'm totally biased towards M3. protbuffers and captnproto do something like that: There's a CodeGenerator(Request|Response) exchange between the compiler and generator.

which is what you suggested in M3 I believe

CasperN commented 3 years ago

One of the things that would actually be worth it, is allowing the flatc integrate with other already existing generators (GRPC) as an example

I think you mean M3 will allow the swift-grpc codebase to own the flatbuffers-swift-grpc generator, which will be invoked by flatc. Is that correct?

mustiikhalil commented 3 years ago

Yeah basically. I am not sure about the intricate details. but i would assume that's how its done in protobufs

CasperN commented 3 years ago

I think that use case, or more generally "invoking pre-existing code generators", is a good argument in favor of M3 over M2. The latter restricts tooling to be written in the scripting language which probably prohibits this kind of integration. (Though, I guess M2 and M3 aren't totally mutually exclusive, we can have an IR and also use it with an integrated scripting language.)

aardappel commented 3 years ago

And possible add a wrapper script to simplify things.

Note that we are a cross-platform project. Maintaining and supplying, say, bash/cmd/powershell scripts is a pain, or requiring Windows users to have Python installed just to run flatc. Hence why making flatc do it directly has some value. But yes, I'd prefer to not be in the business of scripting commands at all.

CasperN commented 3 years ago

I agree, let's not get into the business of cross platform invocations. I think we should make the decision to prioritize M3 over M2, any objections? I think its a good idea because M3 is easier than M2, doesn't actually prevent M2, and is needed for U7.

If we go with M3, I can see this as a viable path:

  1. Figure out how to extend reflection.fbs to support code generators
  2. Refactor the code generators so they take a const Schema & (or whatever comes from step 1) instead of a const Parser&
    • This is a necessary "proof of concept" to verify step 1.
    • we could go so far as to make the code generators separate binaries that are shipped and invoked by flatc but that seems unnecessary.
  3. ???
  4. Start investing in new tooling?

And the next steps for this conversation is to identify how reflection.fbs falls short for existing code generators? Off the top of my head...

adsharma commented 3 years ago

requiring Windows users to have Python installed just to run flatc

Agree that this is a significant pain point for any scripting language.

I've made some progress writing a transpiler that generates rust from a small subset of python. That way we could ship self contained binaries.

The transpiler still has a long way to go, but something to think about

On Fri, Feb 5, 2021, 1:10 PM Wouter van Oortmerssen < notifications@github.com> wrote:

And possible add a wrapper script to simplify things.

Note that we are a cross-platform project. Maintaining and supplying, say, bash/cmd/powershell scripts is a pain, or requiring Windows users to have Python installed just to run flatc. Hence why making flatc do it directly has some value. But yes, I'd prefer to not be in the business of scripting commands at all.

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/google/flatbuffers/issues/6428#issuecomment-774290425, or unsubscribe https://github.com/notifications/unsubscribe-auth/AAFA2AZ27D5TEIKX3MWQ72DS5RNFJANCNFSM4WZBV7XA .

mikkelfj commented 3 years ago

@CasperN Agree on M3, I'd like to see it sent to stdout and accept fbs on stdin, at least as an option. This allows for piping, and it make flatc and flatcc mostly replaceable save for actual cli syntax.

Making code generators separate binaries? The benefit would be that flatc, flatcc, or any other schema compiler would be able to take advantage of code generators. This goes both ways> maybe an existing code generator is abandonware, or is written in a language that does not work on some user platforms, then maybe there is another implementation written for another schema compiler.

That said, this is the chicken and egg problem I mentioned before: if the code generator uses bfbs, it must necessarily be able to read flatbuffers. Parsing JSON is clumsy. It is not a problem for code generators targeting other languages as long as there is some other generator to create the bfbs interface.

Maybe it is just necessary to bite the bullet and hand generate a bfbs interface for supported code generator languages.

As to cross platform scripting: it is not difficult to write a small wrapper script as bat and bash files because they don't really need to be maintained and argument parsing is done by the shells. Calling natively is a lot of work in the general case but maybe a simple system call in C is enough. There are also binary path issues, but that is true regardless of method.

aardappel commented 3 years ago

Yes, lets move forward with M3.

Like I said, I'd like the to be an enum (set by code or command-line flag) that controls the richness of bfbs information, where the first value is "just enough to make reflection work", then added attributes, then added comments, all the way to "give me everything". We then document clearly which fields are set/non-default at what level.

Converting all current generators to work based on bfbs generated code is going to generate a lot of churn, and to some extend is going to create a chicken and egg situation since we require the C++ generator to be compiled to create reflection_generated.h. I agree we need a rich test case though, to ensure we are storing all information correctly. If we don't have a new test case for this direct bfbs usage, maybe we should start with 1 or a few languages.

Would be interesting if we could at some point relegate Parser to an internal data structure only :)

mikkelfj commented 3 years ago

@aardappel What is the benefit of having enum levels? It causes a lot of switches in the generating code, and the reader can ignore surplus information, and size hardly matters.

Maybe the benefit is in how much of the reader interface to implement? Or maybe it can simplify life for some limited schema parsers?

mikkelfj commented 3 years ago

@CasperN since I don't have a lot of time for moving on M3 atm., it would be helpful if you could help out with at least trying to understand the needs of flatcc wrt. a new bfbs format.

Some notable examples:

On a related note: flatcc does not support mutual type recursion across definitions in separate fbs although they do in the same file. I know this is important for flatc as it is used in some Google internal systems. The reasons it is not supported in flatcc is partially just that I didn't spent time on it, and partially because of complications with name shadowing and file uniqueness. However, I think it should be possible to support in some limited cases: a file can include an already included file and have access to those symbols even if the file is not physically included again - only the name shadowing will be processed differently for each file scope. The last part, I believe, is already handled by flatcc. It has a visibility map so any symbol lookup is both checked for existence and visibility.

mikkelfj commented 3 years ago

As to chicken and egg: at least for building bfbs buffers, flatcc has a low level typeless builder interface on top of which is is relatively easy to build buffers. All code builder generated code performs calls into this, and so does the generated JSON parser:

https://github.com/dvidelabs/flatcc/blob/master/include/flatcc/flatcc_builder.h

aardappel commented 3 years ago

@mikkelfj

size hardly matters

That is debatable.. for use cases where people want to use run-time reflection, having 5x the memory usage (and cache misses) for lots of metadata they don't use is not great.

aardappel commented 3 years ago

That said, I have been de-emphasizing the reflection use case of this data. The API for it is frankly horribly clumsy, and I'd prefer it if no-one used it. The "mini reflection" functionality in C++ is frankly nicer for that use case. But there are already existing users of bfbs based reflection, so we continue to offer it.

mikkelfj commented 3 years ago

Maybe we should just keep bfbs more or less as is, and create a separate schema for code generation. bfbs has issues with json though.

aardappel commented 3 years ago

@mikkelfj if you read my replies above, I am explicitly arguing that a separate schema would be undesirable.

CasperN commented 3 years ago

Is the recommendation for languages to come up with their own reflection API, like mini-reflection, and not use reflection.fbs, which will be simultaneously a legacy reflection API and also the new code generation API? How important is backwards compatibility in the reflection API? Can we deprecate Object and remove JSON-exploding cycles?

CasperN commented 3 years ago

@mikkelfj

I think your concerns are summarized as

  1. Symbol / included file shadow analysis
  2. Unique file detection
  3. Mutual type recursion support

I think these are not concerns of the IR, but rather of the compiler front ends that generate the IR. I agree that it would be nicer if flatc and flatcc were more consistent about this though.

That made me realize that we have to include the filepaths in the bfbs to support the current default style of generating files. The IR must provide at least a relative path from where the front end was invoked to the included file. I think absolute paths may leak too much info about the caller's directory structures but at the very least we will leak the platform the bfbs was generated on... (since windows has backslash filepaths).

colindjk commented 3 years ago

@CasperN I plan on looking into it more, but for rust specifically, couldn't code generation be done via macros?

The #[proc_macro] itself is very flexible and I think may be able to allow a pure rust implementation. This would also make usage of the library much easier, new devs would simply add the crate just like they would any other, and pass a struct-like config into a macro.

adsharma commented 3 years ago

+1 for using macros in rust and decorators in python and tighter integration with serde/pyserde.

On Sun, Feb 14, 2021 at 6:43 AM Colin notifications@github.com wrote:

@CasperN https://github.com/CasperN I plan on looking into it more, but for rust specifically, couldn't code generation be done via macros?

The #[proc_macro] itself is very flexible and I think may be able to allow a pure rust implementation. This would also make usage of the library much easier, new devs would simply add the crate just like they would any other, and pass a struct / config into a macro.

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/google/flatbuffers/issues/6428#issuecomment-778787554, or unsubscribe https://github.com/notifications/unsubscribe-auth/AAFA2A742X6SJ6XR7PQMYA3S67OS3ANCNFSM4WZBV7XA .

CasperN commented 3 years ago

I plan on looking into it more, but for rust specifically, couldn't code generation be done via macros?

That seems quite doable, though if we deprecate idl_gen_rust, we may or may not be able read the BFBS without running into the chicken/egg problem. That's the only risk that requires some thought. I guess we could depend on another IR format like JSON, as proposed, or even flexbuffers. This would be a decent solution to the somewhat unwieldy coupling between flatc and the flatbuffers library (#6149)

I see the same risks and benefits to python and decorators

liuliu commented 3 years ago

Exposing stable internal structure would be great for many tools. In my previous company (Snap), we had a flatbuffers-based code generator that depends on flatc internal directly. That turns out to be a big hurdle to upgrade along with new flatbuffers because flatbuffers doesn't expose a stable C++ interface.

My current open-source project depends on flatbuffers internal representation, and I have to make a small C++ shim to expose structures I need through JSON such that I can write the rest of the codegen in Swift: https://github.com/liuliu/dflat/blob/unstable/src/parser/dflats.cpp

Exposing a backward-compatible representation like M3 suggested would help a lot of tools out there that the flatbuffers core doesn't know to upgrade smoother.

One other thing I would suggest is to add the ability to retain "unknown" attributes throughout the representation. These can be used in various extensions to generate extension-specific meanings (in my case, "primary", "index" or "unique" used to specify database constraints). This currently requires to work inside the Parser object directly in flatc, which is not ideal.

adsharma commented 3 years ago

Did you use flatbuffers as an IDL or did you actually generate swift code to parse flatbuffer encoded bytes?

For the former case, there is some support here: https://github.com/adsharma/flattools/tree/master/lang/swift

Like Wouter noted, the problem with scripting languages is that they're not as nice from a deployment point of view as small statically linked binaries natively packaged for popular platforms.

On Mon, Feb 15, 2021 at 11:17 AM Liu Liu notifications@github.com wrote:

Exposing stable internal structure would be great for many tools. In my previous company (Snap), we had a flatbuffers-based code generator that depends on flatc directly. That turns out to be a big hurdle to upgrade along with new flatbuffers because flatbuffers doesn't expose a stable C++ interface.

My current open-source project depends on flatbuffers internal representation, and I have to make a small C++ shim to expose structures I need through JSON such that I can write the rest of the codegen in Swift: https://github.com/liuliu/dflat/blob/unstable/src/parser/dflats.cpp

Exposing a backward-compatible representation like M3 suggested would help a lot of tools out there that the flatbuffers core doesn't know to upgrade smoother.

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/google/flatbuffers/issues/6428#issuecomment-779407919, or unsubscribe https://github.com/notifications/unsubscribe-auth/AAFA2AZJU7TKGANTCYURJMDS7FXNZANCNFSM4WZBV7XA .

liuliu commented 3 years ago

@adsharma in our use-case, we do both. We use flatbuffers' utilities to encode / decode bytes from database, and also use it as IDL to describe the data (which field is primary key, which field is indexed) and generate some additional accessors / mutators on top of what flatbuffers' Swift binding already provides. That's why so far, our integration all focused on emitting additional information when running flatc and use that to generate more code in addition to flatbuffer's own encoder / decoder.

adsharma commented 3 years ago

@liuliu that's the same use case flattools addresses. The idea is to do template driven codegen. Has been used to generate json/yaml/SQL DDL among many output formats.

https://github.com/adsharma/flattools/blob/master/templates/fbs_template_yaml.yaml.j2

is open source. The other templates are not.

The templates assume a stable in-memory format like what's being discussed in this thread (__fbs_meta__ and friends in the code). There is some output language specific metadata (reserved words in the target language, type maps etc) that get added to the environment before invoking the template engine.

CasperN commented 3 years ago

@adsharma @liuliu

In what ways does the reflection API fall short, such that you had to build your own schema reader / inspect flatc internals?

liuliu commented 3 years ago

@adsharma thanks, will give it a deeper look at some point!

@CasperN I need to read more about the new reflection API to have an informed opinion. Previously it just the reflection API not available (we've been using flatbuffers for 3~4 years now). For my case, we use FlatBuffers IDL as the main IDL, but to add more flavors, we need to have more attributes declared and passed down to codegen.

adsharma commented 3 years ago

Yeah - ditto. The code is from Jan 2017. Not sure if reflection API existed then.

Secondly, I find a pure python parser easier to enhance and modify vs calling into an API. So then the question becomes: does the compatibility arise out of a shared grammar or shared in-memory parsed tree. I find them more or less equivalent although you could argue that in-memory parsed tree is better for compatibility across code generators written in multiple languages.

On Tue, Feb 16, 2021 at 10:28 AM Liu Liu notifications@github.com wrote:

@adsharma https://github.com/adsharma thanks, will give it a deeper look at some point!

@CasperN https://github.com/CasperN I need to read more about the new reflection API to have an informed opinion. Previously it just the reflection API not available. For my case, we use FlatBuffers IDL as the main IDL, but to add more flavors, we need to have more attributes declared and passed down to codegen.

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/google/flatbuffers/issues/6428#issuecomment-780032387, or unsubscribe https://github.com/notifications/unsubscribe-auth/AAFA2A37DKN3ISHYKRPFHUDS7K2MDANCNFSM4WZBV7XA .

CasperN commented 3 years ago

does the compatibility arise out of a shared grammar or shared in-memory parsed tree?

I don't think its the grammar or parsed tree that's important but rather the type system that ends up being specified. Name resolution, importing, type checking, parsing, etc are common features that aren't part of the grammar and should be shared. Though, maybe that's what you meant by "parsed tree".

CasperN commented 3 years ago

Ok so I poked around https://github.com/liuliu/dflat/blob/unstable/src/parser/dflats.cpp and it looks like it extracts Parser internals, like our code generators do, then dumps them into JSON. If we successfully refactor at least one code generator into using the IR, dflats will work.

I hope the same is true for flattools, but based on the discussion at the end of #6014, I suspect there will be subtle semantic differences despite sharing a grammar.

aardappel commented 3 years ago

@CasperN

Is the recommendation for languages to come up with their own reflection API, like mini-reflection, and not use reflection.fbs, which will be simultaneously a legacy reflection API and also the new code generation API? How important is backwards compatibility in the reflection API? Can we deprecate Object and remove JSON-exploding cycles?

Difficult question. It seems that most only C++ has reflection functionality at all in the form of reflection.fbs and the mini-reflection. For C++, the mini-reflection seems simpler, easier to use, doesn't require a file, but is also more limited. reflection.fbs is in theory nicer because, you know, it uses FlatBuffers itself :)

I do know we (used to) have internal users of reflection.fbs to major breakage should maybe be avoided, but small breakage (renaming/deprecating fields) is maybe acceptable.

I think if languages want to implement reflection, they should feel free which direction they take it.

CasperN commented 3 years ago

Ok here's my working list of "what to do"

I think reflection.fbs is actually in pretty good shape for being our IR; the few changes I can think of do not seem too difficult (though I'm sure more will come up when moving a code generator onto it). If so, then maybe working on documentation, ergonomics, and discovery of the reflection API for these use cases will be most important, i.e. we just needed a "so you want to make a code generator" page.

aardappel commented 3 years ago

"so you want to make a code generator"

Haha, yes, we need that too :)

CasperN commented 3 years ago

Eventually, after we do the last TODO list, I think we should aim for the following architecture (M3++ if you will):

Long term, I think we need to break up idl_parser.cpp, which imo has gotten far too complex. This refactoring effort will get the code generators off of Parser, isolating it, while also improving modularity. There should be a subsequent effort to simplify the Parser class, though maybe isolation behind the reflection API is sufficient.

dbaileychess commented 2 years ago

I think I could take a stab at converting the Lua generator over to this IR approach.

Just so I know what would be involved. I would do something like:

1) When flatc --lua ${SRCS} is invoked, have it actually do flatc -b --schema ${SRCS} and output the .bfbs to a temporary file. 1) Then invoke the new Lua Generator and pass it the filepath of the temporary .bfbs file. 1) This new generator will read the .bfbs in and deserialize it back into reflection::Schema object. 1) The generator will loop over the the fields of reflection::Schema and generate the corresponding Lua generated files.

aardappel commented 2 years ago

@dbaileychess is the new version of the Lua generator still based on the current C++ code? Because then I would say it would be nicer to have a short cut path to pass the .bfbs in memory, which means you don't need to deal with temp files on multiple platforms etc. I know we'll need those temp files eventually but no need to complicate the first step. Or is part of the exercise to make the Lua generator its own exe? Maybe in a v2?

dbaileychess commented 2 years ago

I was sort of aiming for the middle ground between the two: 1) Have a clean separation of the API, where the new generator just receives a filepath and/or memory, i.e. no flatc internals and 2) Avoid having a separate binary to deal with.

dbaileychess commented 2 years ago

So with the Lua generator converted over to using bfbs, the framework in place, and major gaps filled in the reflection.fbs I think we can proceed on converting more of the generators over.