implement first iteration of the lowest-level language

zerbina commented 3 months ago

Summary

add a specification, pass, test runner, and tests for the lowest- level language that immediately comes before VM bytecode
add a generic PackedTree implementation for use as the passes' IR

Details

Lanuage Design

the language provides simple abstractions over some VM concepts
all global entity descriptions (types, globals, procs) are stored in separate sections, and they're referenced by index (for fast lookup)
local continuations abstract over raw jumps
the type system makes a distinction between signed and unsigned integers

In its current form, the language is likely too high-level, but it's a good start to base further language development on.

Pass Design

The pass currently operates on whole modules at a time (though there's no multi-module support at the moment). Beyond some minor assertions, syntax and proper typing is not checked by the pass -- the idea is that during normal compilation, passes trust that their input is sane and sound, with validation implemented separately.

Storage

A simple generic PackedTree type (based on NimSkull's MirTree) is used as the IR for the pass. The node kind enum to use is provided by the spec module, alongside S-expression serialization/ deserialization.

Nodes only store the kind and a type-erased value. Extra information, such as types, needs to use nodes. This keeps nodes small (currently 8-byte) and serialization easy.

To-Do

[x] add the remaining tests

zerbina commented 3 months ago

For comparison, a MirNode currently has a size of 16 bytes. 8 byte is a good size, because it means that a node fits into a single register (on a 64-bit architecture).

Next Step

The next step will be implementing a tool that comprehends the grammar from the Markdown files (I'm already using an early implementation thereof locally). My plan is that it is responsible for:

making sure the grammar descriptions are syntactically valid
making sure the grammar descriptions are valid (e.g., no ambiguities, terminates, all used names exists, etc.)
generating the spec module based on the (merged) grammars
generating the syntax validation code for an intermediate language

That'll provide a solid base for quick and easy iteration.

Thoughts on Testing

There's the general question of how testing passes should work, as in:

should the pass output be compared against an expected version?
should the pass output be compiled down to bytecode and then run?

Right now, the second option is chosen, but I think it should be a mixture of both. Whether a language should be compiled down into bytecode and then run should be configurable via a command line option, so that during local development, only a pass' output is checked, whereas in CI, it's also fully compiled and run.

Fully compiling the output of a pass during testing could work by serializing the output to disk and then treating it as a test file for the runner of the lower-level language, repeating the process until reaching the VM tester. This has the benefit of not having to implement the whole pipeline into every runner, but the serialization + file IO + deserialization overhead might be too much.

To keep the cost of changing languages at a reasonable level, I think tests should stick to only covering each language feature in isolation, at least for now.

zerbina commented 3 months ago

@saem: Regarding the naming, given that there can (and will) be multiple target languages, I believe having L0 refer to the source language would be better (thought maybe a bit confusing, since lowering then corresponds to the number going up).

However, since we're developing the languages bottom-to-top, having L0 means "target language" is easier for now, I'd say, otherwise a renaming is necessary whenever adding a new higher-level IL. For the in-development top-level candidate, we could use Lx (or similar) as the name.

saem commented 3 months ago

@saem: Regarding the naming, given that there can (and will) be multiple target languages, I believe having L0 refer to the source language would be better (thought maybe a bit confusing, since lowering then corresponds to the number going up).

However, since we're developing the languages bottom-to-top, having L0 means "target language" is easier for now, I'd say, otherwise a renaming is necessary whenever adding a new higher-level IL. For the in-development top-level candidate, we could use Lx (or similar) as the name.

I think since we're going to be stacking languages on top, and the earlier (bottom) languages are going to be around longer, it's fair to assume they'll be more stable over time. I think starting the numbering where the bottom is L0 is likely to be the easiest (fewest renames and most stable references over time).

Although, if at some point we need a secondary numbering scheme, that does go from source -> target, then we could maybe do S0 for source-zero, and then increment the number as it gets lower. This would be the opposite of the Lx scheme, but it would allow us to reason about depth from source, if and when required.

zerbina commented 3 months ago

@saem: I've addressed the review comments, extended the test coverage, made some small language changes, and fixed some bugs. There's still tests missing (arithmetic and comparison operations have no coverage yet), but I think it's okay to add them in post, so that further work depending on lang0 can commence already.

All tests now also make sure the produced bytecode matches the expected one. Beyond making the workings of pass0 easier to understand, this also ensures that the produced bytecode doesn't silently change (when making changes to pass0).

nim-works / phy