[RFC] First class codegen support

This is a brain dump of an idea I've had for a long time around how we can make codegen a first class citizen that's easy to use and orchestrate in ReScript. I'm posting this here in the Rewatch repo because exploring it involves changes to the build system, and Rewatch looks like a great place to try that type of changes.

Summary

Proposing first class support for code generation in the ReScript build system and compiler. This can enable easily embedding other languages directly in your code. SQL, EdgeQL, GraphQL, markdown, CSS - anything really. Generators can be written in any language, and the build system will take care of everything from when to trigger the generators most efficiently, to managing the generated files from each generator (regenerate, delete, etc).

Here's a quick pseudo example of how this idea could work for embedding other languages, implementing a type safe SQL code generator:

// UserResolvers.res
// module Query is replaced with a reference to the file generated by the sql generator
module Query = %gen.sql(`
  select * from users where id = $1
`)

// The file generated by the sql generator has a function called `query`, that takes an argument `id`
let getUserById = (~id) => Query.query(id)

Let's break down at a high level how this pseudo example could work.

The build system scans UserResolvers.res before it compiles it, and sees that it has %gen.sql. It looks for a generator registered under the sql name.
It finds our sql generator and calls it with some data including the file name, the string inside of %gen.sql(), and a few other things that can help with codegen. The generator in this example will leverage information from a connected SQL database to type the query fed to it, and generate a simple function to execute the query. Since the generator is responsible for emitting an actual .res file and not rewrite an AST, it can be written in any language, as long as we can call it and feed it data via stdin.
The generator runs and outputs UserResolvers__sql.res. The build system knows this and now handles UserResolvers__sql.res as a dependency, meaning it knows when to clean up the generated file, and so on.
A built in PPX in the compiler turns the module Query = %gen.sql part into module Query = UserResolvers__sql. A very simple heuristics-based swap from the embedded code definition to the module its generator generates, powered by rules around how to name files emitted by generators.

Generation will be easily cacheable, since regeneration of the files is separate from the compiler running. This means that the build system and the generator in tandem decides when to regenerate code. And this in turn means that you pay the cost of code generation only when the source code for the generation itself changes.

There's of course a lot of subtlety and detail to how to make this work well, be performant, and so on. But the gist is the above. I'll detail with more examples later.

Goals

The idea behind this is that codegen is a fairly simple tool that's efficient in many use cases, but is too inaccessible right now. In order to do codegen today, you need to either write a PPX, or for separate codegen have:

Your own watcher that watches whatever source files you generate from
Your own dependency management of the files you generate
Separate build commands/processes for your code generators

With the approach to codegen outlined above, you'll instead need:

A code generator written in whatever language you want
Some simple configuration

...and that's it. The ReScript compiler and build system handles the rest.

Concerns

Performance

Performance is king. We need to be very mindful to keep build performance as fast as possible. This includes intelligent cacheing etc, but also setting up good starter projects for building performant generators.

We can of course ask users to write generators in performant languages like Rust and OCaml. But, one strength of this proposal is that you should be able to write generators in JS and ReScript directly. This has several benefits:

Using ReScript to write ReScript tooling is nice because ReScript is obviously a nice language
The JS ecosystem is huge and has tooling and packages for almost everything
All of the regular reasons JS is nice to write - not having to build and distribute binaries for each target platform, etc

In order to make the JS route as performant as possible, we can for example recommend using https://bun.sh/, a JS runtime with fast startup, and include tips on how to keep Bun startup performance fast.

As for the design of the generators themselves, they can hopefully be designed in a way so that they can:

Run async in dev mode, so they don't slow down the regular compiler
Be possible to run in parallell
Be heavily cacheable

Tooling (LSP, syntax highlighting, etc)

Embedding languages in other languages is a pretty common practice. For example, we already have both graphql-ppx and RescriptRelay embedding GraphQL in ReScript. So for tooling, it's a matter of adjusting whatever tooling already exists to be able to understand embedded code in ReScript.

Error reporting

In an ideal world, code generators can emit build errors that the build system picks up, and by extension reports to the user via the editor tooling. This would be the absolute best solution, if codegen errors are picked up and treated like any compiler error.

Future and ideas

Here are some loose ideas and thoughts:

We can have a dedicated editor code action to rerun a code generator whenever needed. Good for generators where you want full control of when they're rerun.
Generators could be driven both by embedded languages (%gen.sql as example is above) or by fully separate files (.gql, .sql, etc).
Generators could be both installable (npm packages) and local hand rolled (point to local file that's the code generator). In the package case, we could find a way for each package to be able to provide its own configuration.
We can provide "optimized" general tooling for writing code generators in ReScript (and OCaml?).
Could support AST based generation, as in allow regular ReScript code in %gen.<generator>, and pass a representation of that AST to generators.

Use case examples

Not sure we actually want to encourage all of these, but just to show capabilities.

Embedding EdgeDB

I did an experiment a while back for embedding EdgeDB inside of ReScript: https://twitter.com/___zth___/status/1666907067192320000

That experiment would fit great with this approach:

A generator for EdgeDB is written in JS and registered for %gen.edgedb.
That generator calls out to the general EdgeDB tooling to produce the types needed.
That's it. The build system handles the rest.

Embedding GraphQL

The same goes for GraphQL. For those who don't want to use a PPX-based solution, it'd be easy to build a generator (something similar to https://the-guild.dev/graphql/codegen perhaps) that just emits ReScript types and helpers.

Type providers: OpenAPI clients

F# has a concept of "type providers": https://learn.microsoft.com/en-us/dotnet/fsharp/tutorials/type-providers/ We could do something similar with this approach.

Imagine you have a URL to an open API specification. We'll take GitHub's as example: https://raw.githubusercontent.com/github/rest-api-description/main/descriptions/ghes-3.9/ghes-3.9.json

Now, imagine there's a generator for turning an OpenAPI spec into a ReScript client, ready to use. We could write a generator to hook up that OpenAPI generator:

module GitHubAPIClient = %gen.openapi("https://raw.githubusercontent.com/github/rest-api-description/main/descriptions/ghes-3.9/ghes-3.9.json")

// Pseudo
GitHubAPIClient.getUserById(~id="githubUserId")

Roll your own simple CSS modules

You could use this to roll your own simple CSS modules.

Imagine a code generator registered for gen.cssModules.

// SomeModule.res
module Styles = %gen.cssModules(`
  .primary {
    color: black;
  }
`)

let button = <Button className=Styles.primary />

The code generator is called with the CSS string above, and relevant meta data. It reads the CSS using standard CSS tooling, and just like CSS modules it hashes each class name based on the file name it's defined in, plus the local class name. It then outputs two files:

/* __generated__/SomeModule__cssModules.css */

/* This file is automatically generated. Do not edit manually. */

.dzs16n {
  color: black;
}

// __generated__/SomeModule__cssModules.res

// This file is automatically generated. Do not edit manually.
// @sourceHash("<file-hash-here>")

@inline let primary = "dzs16n"
%raw(`import "./SomeModule__cssModules.css"`)

And, the original file after it's transformed by the internal compiler PPX for the code gen:

// SomeModule.res
module Styles = SomeModule__cssModules

let button = <Button className=Styles.primary />

There, we've reinvented a small version of CSS modules, but fully integrated into the ReScript compiler.

Next step: a PoC

There's a lot to explore and talk about if there's interest in this route. A good next step would be to pick one simple generator, and PoC how it could look integrating it into the build system. @jfrolich we talked about this briefly.

If there's interest from you to explore this further, we could set up a simple spec of what needs to happen where to explore this further. What do you say?

I'm all for codegen.

Just to be sure to understand, bsb today supports generators on files, basically this would allow generator on inline strings?

The generator runs and outputs UserResolvers__sql.res

Maybe it should output UserResolvers__sql__Query.res to distinguish between multiple calls inside the same file.

Just to be sure to understand, bsb today supports generators on files, basically this would allow generator on inline strings?

Exactly, so what decides what triggers generation is the source code, not static build configuration only in the consumer project. Plus each generator is installable, and controls what files it generates. And also that the compiler replaces the source code with a reference to the generated module.

The main idea is to make this feel as seamless as possible. You shouldn't have to think about generation etc, it should feel like a language feature for embedded code.

Maybe it should output UserResolvers__sql__Query.res to distinguish between multiple calls inside the same file.

Yes, this is something we'll need to solve in a good way. I have some prior art in the EdgeDB experimentation I did.

Thanks for the super detailed RFC @zth. Very keen to see this be more integrated in the ecosystem 🎉 . I need to ponder this a bit more. Some initial thoughts:

For this to be performant we should do as little parsing as possible. But, we would still have to scan every file / every line. So there is quite some overhead. Currently we don't read any of the files, and just orchestrate the build. As the ecosystem is moving away from PPX's, this may be a valid thing to do, but I do think that PPX's (especially when not ran sequentially) could be more performant.
There are some interesting problems when we have things that depend on another. For instance, let's say we can query a schema from some database to generate types. And afterwards we have a decco like generator that adds json encoders / decoders. Either, there has to be some combination of the two in one of the packates (so 'schema generator' implements optional 'decco' generator), or we'd have to have some special syntax that generates multiple things and does these things recursively (%decco(%sql("select ... ;"))). Perhaps not even recursively, but by setting an explicit order of operations through the config or something. This is something I think we need to have some restrictions on / clear design from the get-go.
You mention the generators could simply generate rescript files. I think that this will make things less performant in the end. Especially when having multiple generators per file. ~As each transformation will require another full IO read / write (this means that we'd have at least 2 more reads per generated file)~ -- we could probably handle IO ourselves, and just pipe through stdin / stdout, but it would still have to parse the entire file itself. Perhaps we can figure out a function signature that we can use. For for instance for the sql("some-query") we call it with query, filename, and simply replace the sql ourselves. This will require some parsing on our end, especially when we have nested operations like above, but it will allow us to chain operations, make changes in-memory and only write once. We could still cache by ast-nodes (which would be nicer - because a change in white-space, or format would otherwise still trigger a re-generation)

Anywho - just some thoughts. Again. Exciting stuff 🙌

Thank you for your reply @rolandpeelen !

A quick reply, mostly in response to point 1 and 2: The simplest and likely most performant way is to scan each file as text, with a simple regexp or similar for %gen.<generatorName>(<payload>). We'd need to account for multiple generators in multiple places in a single file, etc. So there's a bit of complexity to figure out, but I'm fairly positive we can get away with just scanning the text. That's generally performant, but it does of course introduce overhead.

As a benchmark, the Relay compiler reads all target files in the source project (.res in ReScript, .js in JavaScript, etc) as text and looks for GraphQL tags to extract, and the Relay compiler is very performant.

It's also worth separating dev (watch) mode and running a single build. In a single build this would obviously need to be sequential, but in the dev mode case, this could all be fully parallell to the regular compiler process, meaning that reading the file contents wouldn't need to be scanned in the main compiler flow. Which in turn means it would get out of the way of the regular compilation process.

The effect would be generators causing recompiles whenever they write or change files, but that should be a pretty minor thing. Especially if you combine it with committing generated files.

What makes me confident we can make this work is I've already used this type of workflow for several years with the RescriptRelay compiler, and it works really well.

Happy to discuss more! Performance is really the key here.

I guess it could be useful to also compare it to the current PPX approach, although they're different in what they're trying to solve.

No PPX:es - no preprocessing - fast. For each PPX, there's blocking preprocessing regardless of if the file has things for the PPX or not. Deep copy of the AST to each PPX, wait for the PPX to finish (must run in sequence), and so on.

No generators - no preprocessing - fast. For each generator, there's the (potentially slight) performance penalty of scanning the text (not deep copy of the AST) for generators. All generators are found in one pass.

Some thoughts after a chat with @jfrolich

We can use the PPX to not only replace, but also extract, and append to a file. Then subsequently use that file for the generation in between parsing / compiling. Then we won't have to scan all the files manually, just files with some post-fix that we've marked somewhere.

The extraction process could also include some parse data, like the type that it's working on (for instance when trying to generate a decoder or something like that), or some arguments. Basically a very simple AST.

@zth -- Shall we move this content / condense it into a wiki?

@zth -- Shall we move this content / condense it into a wiki?

Sounds good!

rescript-lang / rewatch