rescript-lang / rewatch

Rewatch is an alternative build system for the Rescript Compiler.
104 stars 16 forks source link

[RFC] First class codegen support #62

Open zth opened 1 year ago

zth commented 1 year ago

This is a brain dump of an idea I've had for a long time around how we can make codegen a first class citizen that's easy to use and orchestrate in ReScript. I'm posting this here in the Rewatch repo because exploring it involves changes to the build system, and Rewatch looks like a great place to try that type of changes.

Summary

Proposing first class support for code generation in the ReScript build system and compiler. This can enable easily embedding other languages directly in your code. SQL, EdgeQL, GraphQL, markdown, CSS - anything really. Generators can be written in any language, and the build system will take care of everything from when to trigger the generators most efficiently, to managing the generated files from each generator (regenerate, delete, etc).

Here's a quick pseudo example of how this idea could work for embedding other languages, implementing a type safe SQL code generator:

// UserResolvers.res
// module Query is replaced with a reference to the file generated by the sql generator
module Query = %gen.sql(`
  select * from users where id = $1
`)

// The file generated by the sql generator has a function called `query`, that takes an argument `id`
let getUserById = (~id) => Query.query(id)

Let's break down at a high level how this pseudo example could work.

  1. The build system scans UserResolvers.res before it compiles it, and sees that it has %gen.sql. It looks for a generator registered under the sql name.
  2. It finds our sql generator and calls it with some data including the file name, the string inside of %gen.sql(), and a few other things that can help with codegen. The generator in this example will leverage information from a connected SQL database to type the query fed to it, and generate a simple function to execute the query. Since the generator is responsible for emitting an actual .res file and not rewrite an AST, it can be written in any language, as long as we can call it and feed it data via stdin.
  3. The generator runs and outputs UserResolvers__sql.res. The build system knows this and now handles UserResolvers__sql.res as a dependency, meaning it knows when to clean up the generated file, and so on.
  4. A built in PPX in the compiler turns the module Query = %gen.sql part into module Query = UserResolvers__sql. A very simple heuristics-based swap from the embedded code definition to the module its generator generates, powered by rules around how to name files emitted by generators.

Generation will be easily cacheable, since regeneration of the files is separate from the compiler running. This means that the build system and the generator in tandem decides when to regenerate code. And this in turn means that you pay the cost of code generation only when the source code for the generation itself changes.

There's of course a lot of subtlety and detail to how to make this work well, be performant, and so on. But the gist is the above. I'll detail with more examples later.

Goals

The idea behind this is that codegen is a fairly simple tool that's efficient in many use cases, but is too inaccessible right now. In order to do codegen today, you need to either write a PPX, or for separate codegen have:

With the approach to codegen outlined above, you'll instead need:

...and that's it. The ReScript compiler and build system handles the rest.

Concerns

Performance

Performance is king. We need to be very mindful to keep build performance as fast as possible. This includes intelligent cacheing etc, but also setting up good starter projects for building performant generators.

We can of course ask users to write generators in performant languages like Rust and OCaml. But, one strength of this proposal is that you should be able to write generators in JS and ReScript directly. This has several benefits:

In order to make the JS route as performant as possible, we can for example recommend using https://bun.sh/, a JS runtime with fast startup, and include tips on how to keep Bun startup performance fast.

As for the design of the generators themselves, they can hopefully be designed in a way so that they can:

Tooling (LSP, syntax highlighting, etc)

Embedding languages in other languages is a pretty common practice. For example, we already have both graphql-ppx and RescriptRelay embedding GraphQL in ReScript. So for tooling, it's a matter of adjusting whatever tooling already exists to be able to understand embedded code in ReScript.

Error reporting

In an ideal world, code generators can emit build errors that the build system picks up, and by extension reports to the user via the editor tooling. This would be the absolute best solution, if codegen errors are picked up and treated like any compiler error.

Future and ideas

Here are some loose ideas and thoughts:

Use case examples

Not sure we actually want to encourage all of these, but just to show capabilities.

Embedding EdgeDB

I did an experiment a while back for embedding EdgeDB inside of ReScript: https://twitter.com/___zth___/status/1666907067192320000

That experiment would fit great with this approach:

Embedding GraphQL

The same goes for GraphQL. For those who don't want to use a PPX-based solution, it'd be easy to build a generator (something similar to https://the-guild.dev/graphql/codegen perhaps) that just emits ReScript types and helpers.

Type providers: OpenAPI clients

F# has a concept of "type providers": https://learn.microsoft.com/en-us/dotnet/fsharp/tutorials/type-providers/ We could do something similar with this approach.

Imagine you have a URL to an open API specification. We'll take GitHub's as example: https://raw.githubusercontent.com/github/rest-api-description/main/descriptions/ghes-3.9/ghes-3.9.json

Now, imagine there's a generator for turning an OpenAPI spec into a ReScript client, ready to use. We could write a generator to hook up that OpenAPI generator:

module GitHubAPIClient = %gen.openapi("https://raw.githubusercontent.com/github/rest-api-description/main/descriptions/ghes-3.9/ghes-3.9.json")

// Pseudo
GitHubAPIClient.getUserById(~id="githubUserId")

Roll your own simple CSS modules

You could use this to roll your own simple CSS modules.

Imagine a code generator registered for gen.cssModules.

// SomeModule.res
module Styles = %gen.cssModules(`
  .primary {
    color: black;
  }
`)

let button = <Button className=Styles.primary />

The code generator is called with the CSS string above, and relevant meta data. It reads the CSS using standard CSS tooling, and just like CSS modules it hashes each class name based on the file name it's defined in, plus the local class name. It then outputs two files:

/* __generated__/SomeModule__cssModules.css */

/* This file is automatically generated. Do not edit manually. */

.dzs16n {
  color: black;
}
// __generated__/SomeModule__cssModules.res

// This file is automatically generated. Do not edit manually.
// @sourceHash("<file-hash-here>")

@inline let primary = "dzs16n"
%raw(`import "./SomeModule__cssModules.css"`)

And, the original file after it's transformed by the internal compiler PPX for the code gen:

// SomeModule.res
module Styles = SomeModule__cssModules

let button = <Button className=Styles.primary />

There, we've reinvented a small version of CSS modules, but fully integrated into the ReScript compiler.

Next step: a PoC

There's a lot to explore and talk about if there's interest in this route. A good next step would be to pick one simple generator, and PoC how it could look integrating it into the build system. @jfrolich we talked about this briefly.

If there's interest from you to explore this further, we could set up a simple spec of what needs to happen where to explore this further. What do you say?

tsnobip commented 1 year ago

I'm all for codegen.

Just to be sure to understand, bsb today supports generators on files, basically this would allow generator on inline strings?

The generator runs and outputs UserResolvers__sql.res

Maybe it should output UserResolvers__sql__Query.res to distinguish between multiple calls inside the same file.

zth commented 1 year ago

Just to be sure to understand, bsb today supports generators on files, basically this would allow generator on inline strings?

Exactly, so what decides what triggers generation is the source code, not static build configuration only in the consumer project. Plus each generator is installable, and controls what files it generates. And also that the compiler replaces the source code with a reference to the generated module.

The main idea is to make this feel as seamless as possible. You shouldn't have to think about generation etc, it should feel like a language feature for embedded code.

Maybe it should output UserResolvers__sql__Query.res to distinguish between multiple calls inside the same file.

Yes, this is something we'll need to solve in a good way. I have some prior art in the EdgeDB experimentation I did.

rolandpeelen commented 1 year ago

Thanks for the super detailed RFC @zth. Very keen to see this be more integrated in the ecosystem 🎉 . I need to ponder this a bit more. Some initial thoughts:

Anywho - just some thoughts. Again. Exciting stuff 🙌

zth commented 1 year ago

Thank you for your reply @rolandpeelen !

A quick reply, mostly in response to point 1 and 2: The simplest and likely most performant way is to scan each file as text, with a simple regexp or similar for %gen.<generatorName>(<payload>). We'd need to account for multiple generators in multiple places in a single file, etc. So there's a bit of complexity to figure out, but I'm fairly positive we can get away with just scanning the text. That's generally performant, but it does of course introduce overhead.

As a benchmark, the Relay compiler reads all target files in the source project (.res in ReScript, .js in JavaScript, etc) as text and looks for GraphQL tags to extract, and the Relay compiler is very performant.

It's also worth separating dev (watch) mode and running a single build. In a single build this would obviously need to be sequential, but in the dev mode case, this could all be fully parallell to the regular compiler process, meaning that reading the file contents wouldn't need to be scanned in the main compiler flow. Which in turn means it would get out of the way of the regular compilation process.

The effect would be generators causing recompiles whenever they write or change files, but that should be a pretty minor thing. Especially if you combine it with committing generated files.

What makes me confident we can make this work is I've already used this type of workflow for several years with the RescriptRelay compiler, and it works really well.

Happy to discuss more! Performance is really the key here.

zth commented 1 year ago

I guess it could be useful to also compare it to the current PPX approach, although they're different in what they're trying to solve.

No PPX:es - no preprocessing - fast. For each PPX, there's blocking preprocessing regardless of if the file has things for the PPX or not. Deep copy of the AST to each PPX, wait for the PPX to finish (must run in sequence), and so on.

No generators - no preprocessing - fast. For each generator, there's the (potentially slight) performance penalty of scanning the text (not deep copy of the AST) for generators. All generators are found in one pass.

rolandpeelen commented 1 year ago

Some thoughts after a chat with @jfrolich

We can use the PPX to not only replace, but also extract, and append to a file. Then subsequently use that file for the generation in between parsing / compiling. Then we won't have to scan all the files manually, just files with some post-fix that we've marked somewhere.

The extraction process could also include some parse data, like the type that it's working on (for instance when trying to generate a decoder or something like that), or some arguments. Basically a very simple AST.

rolandpeelen commented 1 month ago

@zth -- Shall we move this content / condense it into a wiki?

zth commented 1 month ago

@zth -- Shall we move this content / condense it into a wiki?

Sounds good!