Embed languages spec - Githubissues

zth commented 5 months ago

This is a WIP discussion for implementing generators support in the style of https://github.com/zth/rescript-embed-lang natively in rewatch and the compiler itself.

Relevant compiler PR: https://github.com/rescript-lang/rescript-compiler/pull/6823. That PR does the following in the compiler:

Make bsc output an .embeds file together with the .ast file, if the file processed has embeds. It'll also print 1 to stdout if it found embeds. More about .embeds and its format later.
Run a PPX that replaces the embed tags with links to the generated module for that content. More on that later too.

Generators and embeds are used a bit interchangeably in the text below. Generators are the program that generates code from some source input. Embeds is that source input embedded into ReScript source itself.

Configuring generators in the consuming project

We need a way to configure what generators to use, so the build system knows what to run for each embed. This should be done in rescript.json for consistency.

Suggestion: Like PPXes, point to a path

In this alternative, you point to a path. That path should be some sort of configuration file, that the build system can read once, and figure out what it needs for what generator this is, and how to run it. Example:

rescript.json in the consuming project.

{
  "embeds": {
    "generators": ["pgtyped-rescript/embed"]
  }
}

Example embed.json in the pgtyped-rescript package:

{
  "tags": ["sql", "sql.one", "sql.many", "sql.expectOne"],
  "command": "bun generator.js"
}

We'll go more into how to build generators later, but the build system would expect to be able to send some configuration as arg to that command and have it generate from that config.

Note that the command could be any type of binary. It's bun here but it could be node, or a Rust/OCaml/whatever binary. Doesn't matter. It's up to the user to have what's needed installed on its system to be able to run the generation.

This leaves us room to add more configuration if wanted, as well as give good DX with minimal manual work.

So, to recap what the build system would do:

Read embeds in rescript.json
Resolve each embed as it resolves the path to a PPX today
Append .json if it's not already in the file path
Read the configuration in the embed json file

It now knows what generator this is, how to run it, and what tags to run it for.

Configuring where to emit the generated content

I think we should force the user to configure a central place where to emit generated files, like ./src/__generated__. This will simplify a lot, and scale well up to the point where there's so many files in the same folder that you start to get perf issues. At which point we can solve that in a number of ways.

A proposed config could look like this:

{
  "embeds": {
    "generators": ["pgtyped-rescript/embed"],
    "artifactFolder": "./src/__generated__"
  }
}

We need to check that that folder is inside of a configured ReScript source folder etc, but that should be fine.

Questions and things to figure out

What if things clash, as in several embeds operate on the same tag names?

Overview of potential setup in build system

Here's an overview of how the build system could handle running generators.

This is how it looks at a high level:

Finding embeds

You can embed other languages or any string content into tags inside of ReScript. Example:

let findOne = %sql.one(`select * from users where id = :id!`)

let findMany = %sql.many(`select * from users`)

If there's a generator configured for sql.one, bsc will spit out a .embeds file next to .ast when it's asked to produce the .ast file. It looks roughly like this (format very much subject to change, we'll make it whatever makes most sense and is easiest/most efficient to read from the build system):

<<- item begin ->>
sql.one
select * from users where id = :id!
1:23-1:60

<<- item begin ->>
sql.many
select * from users
3:88-3:109

If bsc found embeds and printed a .embed file, it'll output 1 to stdout.

Running generators

Now, if we found embeds we'll want to run the appropriate generator for that file, if the embedded content has changed.

Generators are expected to be idempotent. We're building a pretty aggressive cache mechanism into this. This is important and will make the DX much better, including not having to run any generators in CI etc unless you really want to. Control that by simply committing or not committing the generated files.

So, we load the .embeds file, go through each of the embeds, and check whether they've already been generated. If they've been generated, we check if the generated content was generated from the same input, via a comment with a hash of the source content at the top of the generated file. If the generated file wasn't generated from the same source, or if it hasn't been generated yet, we run the appropriate generator and write the generated file.

Here's a number of hands on examples:

First time a generation runs

// SomeFile.res
let findOne = %sql.one(`select * from users where id = :id!`)

let findMany = %sql.many(`select * from users`)

bsc extracts 2 embeds from SomeFile.res and prints 1 to stdout to signify that
The build system reads the SomeFile.embeds file generated by bsc, and figures out that 2 files are to be generated: src/__generated__/SomeFile__sql_one__M1.res and src/__generated__/SomeFile__sql_many__M1.res. Notice the file format <sourceModuleName>__<tagName.replace(".", "_")>__M<indexOfTagInFile>. If multiple embeds of the same tag exists in the same file (multiple %sql.one for example), the M part is incremented, like src/__generated__/SomeFile__sql_one__M2.res for the next embed.
The build system checks if the generated files exist already. They don't, so...
...the build system triggers the appropriate generator for each embedded content. Maybe by passing stringified JSON as the sole argument to the generator: /command/to/run/generator '{"tag":"sql.one","content":"select * from users where id = :id!","loc":{"start":{"line":1,"col":23},"end":{"line":1,"col":60}}}'. This can all be done in parallell, since the generators should be idempotent (at least to start with).
The generator runs, and returns either the generated content, or errors. More about errors below.
The build system writes the generated content, including a source hash for the input it was generated from at the top of each generated file. Here's how a file could look: src/__generated__/SomeFile__sql_one__M1.res
```
// @sourceHash 83mksdf8782m4884i34
type response = {...}
// More generated content in here
```
New files were added, so we need to add these new files to the build system build state, and trigger ast generation of them. Notice that embeds in files generated by other embeds are not allowed. That way we avoid potentially slow and recursive embeds.
The build system cleans up any lingering embeds that are now irrelevant, if they exist. Maybe by just querying the file system for src/__generated__/SomeFile__sql_one__*.res and src/__generated__/SomeFile__sql_many__*.res and then remove any of them that aren't in use any more. This also needs to be updated in the build state.
Finally, when things have settled and the build system is ready, we move on to the compilation phase, as usual.

When generated content hasn't changed

The same setup as the first example, up until point 3, where instead:

Generated files exist for both embeds: src/__generated__/SomeFile__sql_one__M1.res and src/__generated__/SomeFile__sql_many__M1.res
The build system reads the first line of each of those files, and extracts the @sourceHash
It then compares the hash from the file with hashing the content extracted from the .embeds file.
All hashes match, so no generation needs to run, and the build state can be considered valid. Continue to regular compilation.

When generated content has changed

The same setup as above, but from point 5:

The hashes does not match. Run the generation again, as noted by point 4 in the first example.

Cleaning up

We'll need to continuously ensure that we clean up:

.embeds files when there aren't any embeds anymore (as notced by bsc not writing 1 to stdout)
Generated files when their parent source tag don't exist anymore

When errors in generation happen

We can flesh this out more, but ideally, when errors in generation happen, we can propagate those to the build system and have the build system both fail and write them to .compiler.log so that they end up in the editor tooling.

The one thing to take care of here is to translate the error locations so that the generator can return errors relative to the content it received, whereas the error itself is presented by the build system and in the editor tooling offset to the correct location in the source file.

Regenerating content?

The idea is that you can simply remove the generated file, at which point it'll be regenerated the next time the build system processes the file with the source content.

Questions and thoughts

Should generators be idempotent? This makes things a lot easier, and faster, but what about the scenario where for example a GraphQL schema changes, and we want to regenerate because of that? We probably need to figure out a few more strategies.

zth commented 5 months ago

One idea for the case where there are additional inputs that should control whether something is regenerated or not (like with GraphQL where ideally both the actual GraphQL text input, and the source schema should control whether things are regenerated) - let people define additional input(s) that the build system can take into account when writing the hash:

{
  "embeds": {
    "generators": ["pgtyped-rescript/embed", {"embed": "rescript-graphql-generator/embed", "additionalInputs": "./schema.graphql"}],
    "artifactFolder": "./src/__generated__"
  }
}

The build system can then track and hash that file as well, and use the hash of that file in addition to the source hash when comparing whether things need to be regenerated or not.

rolandpeelen commented 1 month ago

@zth -- I've enabled wiki's for the project so we can move these sort of 'permanent' issues (that we want to keep around for documentation) to there. Would you like me to move it over? I think you can do that as well as you're an author 👌

rescript-lang / rewatch

Embed languages spec #127

Configuring generators in the consuming project

Suggestion: Like PPXes, point to a path

Configuring where to emit the generated content

Questions and things to figure out

Overview of potential setup in build system

Finding embeds

Running generators

First time a generation runs

When generated content hasn't changed

When generated content has changed

Cleaning up

When errors in generation happen

Regenerating content?

Questions and thoughts