Proposal: An LLVM Frontend for a Small Programming Language

ayakayorihiro commented 2 years ago

Who

@5hubh4m, @ayakayorihiro, and @anshumanmohan

What

We will implement an LLVM frontend for a small but useful programming language. The choices we are considering are:

awk, which is a scripting language used to process text, and is a standard utility feature in many operating systems; and
the DSL within jq, a popular JSON processing tool.

We wanted a language that is simple enough that tasks like parsing do not become the majority of work, while still useful enough that substantial programs can (and have been) written in it. To this end, we may implement a convenient subset of a language that fulfils our criteria. We are primarily interested in awk, and are listing jq as a fallback.

How

LLVM has excellent facilities for writing language frontends including this incredible tutorial. We will write a frontend for our target and then use LLVM's backend to generate optimized machine code.

How We Will Measure Success

An important but unquantifiable goal of this project is didactic; we would like to teach ourselves LLVM via immersion.

A more quantifiable set of goals is:

Correctness. We will include a test suite to increase confidence in our implementation.
Features. We will compare our implementation with other existing compiler implementations of our language.

For awk that would be gawk, the GNU implementation, mawk, an interpreter for awk with a focus on speed, lawk, an LLVM implementation of awk and frawk, another LLVM implementation that has advanced features like automatic parallelization.

For jq it might just be the official implementation.
Performance. We will compare our work against the above projects from a runtime perspective on reasonable benchmarks.

sampsyo commented 2 years ago

All sounds good! TBH I think you should probably not bother comparing performance against current existing fancy/limited/experimental compilers… I'm not sure what you would learn from that, really. Comparing against "mainstream" implementations (i.e., GNU gawk or jq itself) is likely to be much more instructive.

Do you have a sense for how quickly you can decide what language you'll targeting? Like, within a couple of days, hopefully?

5hubh4m commented 2 years ago

We has a discussion and decided that we will target awk. There are several advantages to it over jq, most importantly the fact that awk's grammar has been published in yacc format which would make parsing convenient.

sampsyo commented 2 years ago

Sounds good to me! Curious to see how it goes.

5hubh4m commented 2 years ago

Progress Report

We have a parser and a lexer for the awk language based mostly around the specification. The given grammar has a lot of redundancy to address precedence and dangling else's. So we resorted to simplifying the grammar itself and address those issues using Menhir's functionalities.

Since awk is completely dynamic, we are thinking of going the route of CPython where we represent every value using a pointer to a union, which can be a float, a string, or a hashmap (awk "arrays" are just hashmaps from strings to values), and all built-in operations will be implemented in a runtime in C/C++ which do appropriate type-checking/conversion and the generated LLVM will simply call them as needed. We realized we also need a garbage collector in the runtime - we hope to use a ready-made garbage collector like boehm-gc.

We had a brief discussion about whether it was possible to implement static types in the language but given the language specification, it seems almost impossible to do without changing the language drastically.

sampsyo commented 2 years ago

All sounds good (including sticking with the dynamically-typed language). Since you mentioned that built-in operations will rely on your hand-written C/C++ functions, I want to suggest an approach that you may discard as too elaborate, but which might clarify the design of your compiler:

Design an instruction-based IR (a "bytecode," but it could be JSON instead of actual bytes). Think something like Bril but customized to the Awk semantics (and without types). Use your OCaml/Menhir parser to generate this IR.
Write an interpreter for this format in C++. This interpreter will make calls to your C/C++ functions that do the special Awk stuff. Test the hell out of this.
Now write a compiler that uses the same C/C++ runtime functions. The first version could work by just translating every IR instruction into a function call. You can rely on LLVM's own inliner to get rid of the function-call overhead.
Finally, opportunistically add any optimizations to the above baseline compiler based on observations of what's slow.

anshumanmohan commented 2 years ago

We decided to go with a stripped-down version of your suggestion.

Parse awk into OCaml using Menhir.
Emit LLVM IR (with extern calls to built-in operations). Use OCaml's LLVM module.
Hand-write C++ runtime that provides built-in operations.
Link the above. Optimize in LLVM. Test.

We've run into some trouble with step 2; we're finding it hard to even cook up the boxed types we need in LLVM IR. We're now pivoting away to a new strategy:

Generate an AST in C++
Follow LLVM's Kaleidoscope example to generate LLVM IR
Same as before
Same as before

Our thinking is that the Kaleidoscope guide looks more mature and easier to follow. In exchange for doing a little more work in AST-land, we're hoping to make the LLVM part easier.

sampsyo commented 2 years ago

I think this makes sense! Is the point that IR generation via the OCaml bindings is harder than it looks, whereas there are plenty of resources (Kaleidoscope among them) showing you how to do it directly in C++?

I suppose the only question is: what does "generate an AST in C++" mean? Does it mean you have thrown away your Menhir parser and are writing/reusing a parser that exists in C++ land? Or are you somehow transporting your OCaml parse tree into C++?

All of this is fine, but it occurs to me to mention that an intermediate instruction-based language could actually save you time… if you already have a Menhir parser going, then you could write OCaml code to do the heavy lifting of "linearizing" the code into instructions, and then the C++ part would just be responsible for translating each instruction into LLVM instructions.

anshumanmohan commented 2 years ago

Yup, the thinking is that we'll have more help if we do it directly in C++.

We are still exploring our options re: a C++ AST. As you can imagine, we don't want to throw away our Menhir parser. The hope is to employ PPX rewrites to automatically get to JSON, then slurp up the JSON in C++. Then we'd almost be in Kaleidoscope territory. We explored this some over the weekend but got stuck and decided to come back with fresh eyes.

If this fails, we'll find some way to port our Menhir work to Bison.

sampsyo commented 2 years ago

Got it. I guess my suggestion then boils down to this: don't bother finding a way to serialize the AST itself as JSON. Instead, serialize an intermediate format that is more "linear" to simplify the work you do on the C++ side. But you do you!

anshumanmohan commented 2 years ago

Got it, and this is very helpful. Thank you!

anshumanmohan commented 2 years ago

A brief update: the C++ route was not proving any easier, so we've reverted to the first plan I described here. We're making decent progress. It helps that there is a Kaleidoscope guide for OCaml as well.

anshumanmohan commented 2 years ago

Hi @sampsyo, sadly we aren't yet done. We've been at it fairly doggedly for the last week, but have run into a few gotchas along the way and are realizing we may have bitten off more than we could chew. So sorry that this request is coming after the deadline, but we were wondering if we could have the weekend to turn this in.

We put together a slightly more detailed progress report to give you a sense of where we are. Our WIP code is here.

DONE

Runtime 90% completed.
Lexing, parsing, and type-checking.

We can generate code for user-defined functions, and correctly handle BEGIN and END blocks.

$> cd brawn
$> for i in test/benchmarks/*.brawn; do echo $i; dune exec brawn $i $i.ll; done

CHALLENGES (that we've overcome)

Figuring out how to get LLVM to use functions from another module.
LLVM crashes with SIGILL if you try to build an invalid load or store instruction. This alone took us a day to figure out.

TODO

FINISH COMPILER
- Initialize globals and literal constants with default values (currently they are initialized to null).
- Generate function for pattern matching and actions.
- Finish the main loop in runtime.
- Link generated module with runtime.
EVAL & TEST
FINISH BLOG POST (70% with current progress)

sampsyo commented 2 years ago

OK, sure! Let's set your new deadline on Monday.

Thanks for the progress report. I have thoughts on both of these challenges:

Figuring out how to get LLVM to use functions from another module.

To confirm, is the motivation for this the same as the runtime library section of the tutorial? That is, you want to generate LLVM code that contains call instructions to functions that you do not want to generate.

If so, I hope you consider the approach from the tutorial itself… that is, writing those in a completely separate .c file, calling them by doing getOrInsertFunction in the module you're generating, and then linking together the two .o files (i.e., not attempting to link together at the LLVM level).

Or maybe you meant something else entirely…

LLVM crashes with SIGILL if you try to build an invalid load or store instruction. This alone took us a day to figure out.

In case you don't have them already, debug builds of LLVM help a lot with this kind of thing. Unfortunately, brew install llvm gives you a release build (of course). I don't know any way of getting a debug build without building LLVM yourself from scratch, but seriously, it is worth it… you get tracebacks and assertion failures with nice error messages instead of mysterious SIGILLs and SIGSEGVs.

sampsyo / cs6120