The compiler pipeline - Githubissues

jdonszelmann commented 9 months ago


graph TD;
source -- parses fast, no errors --> AST[AST: abstract syntax tree]
source -- parses half-fast --> AST[AST: abstract syntax tree]
source -- parses slow --> CST[CST: concrete syntax tree - with comments and parens etc]

CST -- simplify --> AST
CST -- autoformats --> source

AST -- prettyprint --> uglysource[pretty printed AST, but comments are gone]
AST -- desugar, remove for loops etc --> DST[DST: desugared syntax tree]

DST --> DST2

subgraph typechecking
    DST2(before typechecking) -- generate constraints--> constraints
    DST2 --> typed
    constraints --infer--> concrete[concrete types]
    concrete --annotate--> typed[TST: typed syntax tree]
end

typed -- flatten --> IR
IR -- optimise --> IR
IR -- codegen --> LLVM-IR

@tertsdiepraam and I discussed this for a while, looking at pipelines of among others rustc and roc, and thought this is the best way forward. We can parse two ways: with a very complicated parser that keeps comments around into a CST, or quickly into an AST. From the CST we can go back to the source, which is nice for autoformatting. The CST can trivially be turned into an AST by removing the comments etc.

Quickly, we go from an AST to a DST - the desugared syntax tree. At this stage, constructs like if let (if we have it) can become match, for loops become while loops, etc making typechecking easier. That's what rustc also does.

Then comes typechecking to a typed syntax tree, which is flattened into an intermediate representation (IR). On that we can do passes to optimise and do things like dead code elimination. The IR is just basic blocks and jumps, to make the translation to LLVM easier in the end. We can also, at that point, first write an interpreter of IR making testing easy.

The fuzzing story is as follows. We can prettyprint the AST. This does remove comments etc, but we can do a roundtrip to it. We can replace any sequence of spaces in the input string and the input --> parsed --> prettyprinted string with a single space and see if these are equal.

bal-e commented 8 months ago

Here's my take on the compiler pipeline. After consideration for potential features like metaprogramming, I've come up with a simple, fairly parallelizable structure to the compiler, relying on inter-item dependencies.

The core of the compiler is a work-stealing multi-threaded scheduler. It is assigned tasks, which may depend upon one another, and it is responsible for allocating them to threads and evaluating them as soon as possible. The results of various tasks may be cached, even across compilations.

The following tasks are defined:

Crate compilation.
1. The crate source (Cargo.toml equivalent) is loaded.
2. The dependencies of the crate are launched as separate tasks.
3. The compilation of the root module is launched.
4. The compiled functions' IR is passed to LLVM.
Module compilation.
1. The module API is resolved.
2. For each item, a compilation task is launched.
Function compilation.
1. The function API is resolved.
2. The function IR is resolved.
3. It is finally lowered to LLVM IR.
Constant compilation.
1. The constant API is resolved.
2. References in the constant definition are resolved.
3. The constant is type-checked.
4. The definition AST is lowered to IR.
  - Things like function calls result in function IR resolution tasks.
5. The IR is evaluated to calculate the constant.
  - Note that things like pointer addresses to other items cannot be evaluated further and are part of the result.
Type compilation.
1. The type is resolved.
2. The items of the type are compiled.
Function IR resolution.
1. The function API is resolved.
2. References in the function body are resolved.
3. The function is type-checked.
  - For each reference, a type resolution task is launched.
4. The typed AST is lowered to IR.
Module type resolution.
1. The module API is resolved.
2. The types of the top-level items are resolved.
Function type resolution.
1. References in the function's signature are resolved.
2. The function's signature is then type-checked.
Constant type resolution.
1. References in the constant's signature are resolved.
2. The constant's signature is then type-checked.
Type resolution.
1. References in the type's definition are resolved.
2. The type definition is then type-checked.
Module API resolution.
1. The module's source code is loaded and parsed.
2. The module's AST is divided into top-level items.
3. The APIs of the module's top-level items are resolved.

The scheduler effectively maintains a directed acyclic graph of tasks. A cycle between tasks indicates a circular dependency that would prevent compilation; this is a recoverable error as other tasks can still be evaluated.

Not all the dependencies of a task can be resolved immediately; name resolution has to be accomplished first. But all that is required is delegation to another module; resolving a::b::c, for instance, requires resolving the API of a using local information, then launching a task for resolving b::c within it. Doing this for every reference in the item will yield the complete list of dependencies.

To process tasks in parallel, the scheduler simply has to select tasks which are "far apart" from one another (i.e. only their furthest descendants are shared); this minimizes inter-process communication. It may be ideal for each compiler thread to have its own graph of tasks, which are occasionally synchronized or stolen from by others.

Performance analysis will rely heavily on inspecting the scheduler; it must offer a way to extract things like flame graphs.

bal-e commented 8 months ago

After discussion with @tertsdiepraam, we've decided to move the majority of the name resolution to module API resolution, rather than item resolution. We can then use stack graphs to merge and cache name resolution between files. This requires a bit of care, because things like associated types and struct fields cannot necessarily be resolved immediately (due to meta-programming); some degree of lazy resolution is required. But this can be tracked in the stack graphs structure without additional work. I think we need to investigate exactly how much data the stacked graphs structure can hold for us.

lace-language / lace

The compiler pipeline #7