Closed ayakayorihiro closed 1 year ago
All sounds good! TBH I think you should probably not bother comparing performance against current existing fancy/limited/experimental compilers… I'm not sure what you would learn from that, really. Comparing against "mainstream" implementations (i.e., GNU gawk or jq itself) is likely to be much more instructive.
Do you have a sense for how quickly you can decide what language you'll targeting? Like, within a couple of days, hopefully?
We has a discussion and decided that we will target awk
. There are several advantages to it over jq
, most importantly the fact that awk
's grammar has been published in yacc
format which would make parsing convenient.
Sounds good to me! Curious to see how it goes.
We have a parser and a lexer for the awk
language based mostly around the specification. The given grammar has a lot of redundancy to address precedence and dangling else's. So we resorted to simplifying the grammar itself and address those issues using Menhir's functionalities.
Since awk
is completely dynamic, we are thinking of going the route of CPython
where we represent every value using a pointer to a union, which can be a float, a string, or a hashmap (awk
"arrays" are just hashmaps from strings to values), and all built-in operations will be implemented in a runtime in C/C++ which do appropriate type-checking/conversion and the generated LLVM will simply call them as needed. We realized we also need a garbage collector in the runtime - we hope to use a ready-made garbage collector like boehm-gc.
We had a brief discussion about whether it was possible to implement static types in the language but given the language specification, it seems almost impossible to do without changing the language drastically.
All sounds good (including sticking with the dynamically-typed language). Since you mentioned that built-in operations will rely on your hand-written C/C++ functions, I want to suggest an approach that you may discard as too elaborate, but which might clarify the design of your compiler:
We decided to go with a stripped-down version of your suggestion.
We've run into some trouble with step 2; we're finding it hard to even cook up the boxed types we need in LLVM IR. We're now pivoting away to a new strategy:
Our thinking is that the Kaleidoscope guide looks more mature and easier to follow. In exchange for doing a little more work in AST-land, we're hoping to make the LLVM part easier.
I think this makes sense! Is the point that IR generation via the OCaml bindings is harder than it looks, whereas there are plenty of resources (Kaleidoscope among them) showing you how to do it directly in C++?
I suppose the only question is: what does "generate an AST in C++" mean? Does it mean you have thrown away your Menhir parser and are writing/reusing a parser that exists in C++ land? Or are you somehow transporting your OCaml parse tree into C++?
All of this is fine, but it occurs to me to mention that an intermediate instruction-based language could actually save you time… if you already have a Menhir parser going, then you could write OCaml code to do the heavy lifting of "linearizing" the code into instructions, and then the C++ part would just be responsible for translating each instruction into LLVM instructions.
Yup, the thinking is that we'll have more help if we do it directly in C++.
We are still exploring our options re: a C++ AST. As you can imagine, we don't want to throw away our Menhir parser. The hope is to employ PPX rewrites to automatically get to JSON, then slurp up the JSON in C++. Then we'd almost be in Kaleidoscope territory. We explored this some over the weekend but got stuck and decided to come back with fresh eyes.
If this fails, we'll find some way to port our Menhir work to Bison.
Got it. I guess my suggestion then boils down to this: don't bother finding a way to serialize the AST itself as JSON. Instead, serialize an intermediate format that is more "linear" to simplify the work you do on the C++ side. But you do you!
Got it, and this is very helpful. Thank you!
A brief update: the C++ route was not proving any easier, so we've reverted to the first plan I described here. We're making decent progress. It helps that there is a Kaleidoscope guide for OCaml as well.
Hi @sampsyo, sadly we aren't yet done. We've been at it fairly doggedly for the last week, but have run into a few gotchas along the way and are realizing we may have bitten off more than we could chew. So sorry that this request is coming after the deadline, but we were wondering if we could have the weekend to turn this in.
We put together a slightly more detailed progress report to give you a sense of where we are. Our WIP code is here.
DONE
$> cd brawn
$> for i in test/benchmarks/*.brawn; do echo $i; dune exec brawn $i $i.ll; done
CHALLENGES (that we've overcome)
TODO
OK, sure! Let's set your new deadline on Monday.
Thanks for the progress report. I have thoughts on both of these challenges:
Figuring out how to get LLVM to use functions from another module.
To confirm, is the motivation for this the same as the runtime library section of the tutorial? That is, you want to generate LLVM code that contains call
instructions to functions that you do not want to generate.
If so, I hope you consider the approach from the tutorial itself… that is, writing those in a completely separate .c
file, calling them by doing getOrInsertFunction
in the module you're generating, and then linking together the two .o
files (i.e., not attempting to link together at the LLVM level).
Or maybe you meant something else entirely…
LLVM crashes with SIGILL if you try to build an invalid load or store instruction. This alone took us a day to figure out.
In case you don't have them already, debug builds of LLVM help a lot with this kind of thing. Unfortunately, brew install llvm
gives you a release build (of course). I don't know any way of getting a debug build without building LLVM yourself from scratch, but seriously, it is worth it… you get tracebacks and assertion failures with nice error messages instead of mysterious SIGILLs and SIGSEGVs.
Who
@5hubh4m, @ayakayorihiro, and @anshumanmohan
What
We will implement an LLVM frontend for a small but useful programming language. The choices we are considering are:
awk
, which is a scripting language used to process text, and is a standard utility feature in many operating systems; andjq
, a popular JSON processing tool.We wanted a language that is simple enough that tasks like parsing do not become the majority of work, while still useful enough that substantial programs can (and have been) written in it. To this end, we may implement a convenient subset of a language that fulfils our criteria. We are primarily interested in
awk
, and are listingjq
as a fallback.How
LLVM has excellent facilities for writing language frontends including this incredible tutorial. We will write a frontend for our target and then use LLVM's backend to generate optimized machine code.
How We Will Measure Success
An important but unquantifiable goal of this project is didactic; we would like to teach ourselves LLVM via immersion.
A more quantifiable set of goals is:
Correctness. We will include a test suite to increase confidence in our implementation.
Features. We will compare our implementation with other existing compiler implementations of our language.
For
awk
that would be gawk, the GNU implementation, mawk, an interpreter forawk
with a focus on speed, lawk, an LLVM implementation ofawk
and frawk, another LLVM implementation that has advanced features like automatic parallelization.For
jq
it might just be the official implementation.Performance. We will compare our work against the above projects from a runtime perspective on reasonable benchmarks.