Project Proposal: Compiling Scheme to talx86 with CPS

andreyyao commented 2 years ago

What will you do? I will write a compiler from a subset of Scheme to talx86, a typed assembly language. The translation to intermediate representation will be in continuation passing style. There are several motivations for this choice. First of all, I think the idea of a typed assembly is fascinating as it allows the assembler to perform certain correctness verification. The choice of using Scheme is just that it is a minimal language with few core language constructs. Also there is a reference compiler by the talx86 people from Scheme.

How will you do it? I plan to use either Rust or Java to implement the compiler. I will first decide and then formally document on the set of language features from Scheme that I support. Then I will design an intermediate representation. After that I will write a lexer and parser from Scheme to the IR. The IR should be carry type information as well. Since Scheme is not statically typed this likely requires some type inference. s-expressions should make parsing not too difficult. Then I will convert things to CPS. After that I might run several optimizations. Finally if I will emit to talx86.

How will you empirically measure success? First of all, correctness of the overall compilation can be verified by emitting to traditional x86 instead of talx86, which should be consist of simply deleting the type annotations. For the IR stage it should be quite feasible to emit CPS Scheme code and then check for program equivalence. Apparently Scheme has continuations as first class objections? Not sure if that will end up being helpful. If I do decide to add in some optimizations for CPS, I will compare the before and after of optimizations on certain Scheme benchmarks.

Team members: Just me for now.

sampsyo commented 2 years ago

Sounds fairly substantial! You didn't say it explicitly, but I think you will be following the recipe from Morrisett et al. 1999, right? Or is the CPS route you're imagining different from that? If so, do you know exactly how to do type-preserving CPS to make this work?

You also said this:

The IR should be carry type information as well. Since Scheme is not statically typed this likely requires some type inference.

I strongly recommend that you do not try to implement nontrivial type inference. Just invent a variant of scheme that is explicitly typed. Focusing on type inference would be a distraction from the matter at hand, which is the type-preserving compilation. I recommend making the language as tacky and custom as possible in order to make the compilation easier.

Before you get started, please also endeavor to answer these questions about your empirical evaluation:

How will you check program equivalence?
What is the list of benchmarks will you use? Where will you get the code, or will you write the code yourself?
Aside from equivalence, will you measure any manner of performance/efficiency metric? If so, what?

andreyyao commented 2 years ago

Yes, I planned to be following the recipe from the Morrisett paper! I agree that doing type inference is probably too difficult. However I have been pulling my hair out trying to find a minimal explicitly typed (preferably functional) language... There is typed Scheme, but I can only find the language reference for it but not any existing tools. There is also typed Racket which seems much better supported, but the hard part will be choosing the appropriate subset of the language to support such that there is still enough existing benchmarks to use... Racket also seems a lot more complicated than Scheme, and typed Racket still allows implicit typing with type inference, which is a headache in my case.

As a response to your comment on Zulip about talx86, I might be better off inventing a typed RISCV or something? For now I think I will put optimizations aside and focus on type-preserving compilation. What do you think?

sampsyo commented 2 years ago

Yeah, I think your idea to basically invent your own typed language will work fine. Starting with typed Racket and doing a very (very) restricted subset of it will work fine. Can you, for example, think of a subset of typed Racket that is basically equivalent to System F with integers? That seems like the thing to do. Really, I think you'll need a very minimal language, at least to get started.

Compiling to something talx86-like seems possibly OK—I just would not make your project depend on any actual infrastructure from the TAL project (the type checker, etc.) in order to simplify the sheer hacking part of the project so you can focus on the "meat" of the work. So maybe you build your own type checker for talx86 code to demonstrate the success of type-preserving translation. Or sure, doing your own RISC-V thing would be OK too—just make sure that you don't sink too much time into making that work at the expense of the compilation itself.

andreyyao commented 2 years ago

Thanks for the feedback! Here's my revised proposal:

What will you do?

I will write a type-preserving compiler from a small subset of Typed Racket to RISC-V commented with type annotations. I will implement CPS transformation on the IR. I will also develop a simple type checker for the typed RISC-V. I will mostly refer to this paper for details about semantics as well as algorithm for CPS, etc.

How will you do it?

I plan to use Rust to implement the compiler. I will first decide and then document the exact subset of Typed Racket that I plan to support, ideally using BNF. For now it will be close to the version of system F from the Morrisett et al. paper, possibly with Boolean values as well. Then I will refer to the paper to devise a specification for annotating RISC-V assembly code with types, all using comments so as to ensure 100% compatibility with existing RISC-V and interpreter, etc.

Then I will define the AST and IR structure of the compiler. After that, I will writing a parser from the "System F" subset of Typed Racket. For this I plan to use nom, a parser combinator. The AST will be emitted back to Typed Racket source code to check for program equivalence during testing. Hopefully error reporting will come for free too in this case. Once I have the AST working I will implement emission to IR and CPS while preserving types.

Finally, I will write a parser and type checker for RISC-V with types.

How will you empirically measure success?

First of all, correctness of the overall compilation can be verified by program equivalence of source Racket code vs output RISC-V code. Both languages have existing support so umm yeah.

For testing I will try to find benchmarks in Typed Racket. If can't find enough, I will take benchmarks in plain Racket and annotate the types by hand.

Another requirement for success if that no valid Typed Racket programs should result in Illegal assembly...

Questions

Should I email the Morrisett paper (talx86) people about this? Also, should I go with x86 anyways because I would be able to assemble the output machine code and run it natively?

Team members: Just me for now.

sampsyo commented 2 years ago

Great! Sounds good overall. Some quick notes:

Instead of using nom, one possible advantage of using a Lisp as your input language is that it's just S-expressions. So you may be able to use something simpler like sexp to parse instead of writing your own parser.
Could be fun to email the TAL people about it! But maybe they'd be more interested in reading the outcome (i.e., once you've finished your blog post)?
I honestly do not have a strong opinion about whether it would be better to emit RISC-V or x86. The "pros" on the RISC-V side are that generating the code could be simpler and you could invent your own annotation language instead of painstakingly trying to conform to the exact form at of the talx86 infrastructure. The "pros" on the x86 side are that you could run the program natively (but you'd still need to do linking and stuff to get an executable, and an off-the-shelf RISC-V simulator like Spike might be just as easy) and that you could use talx86 instead of inventing your own stuff for the assembly annotation. Maybe just forge ahead with one ISA and give yourself permission to switch if it gets too complicated? It would be good, in that case, to start as early as possible so you can "de-risk" this critical part of the project before doing the bulk of the work…

andreyyao commented 2 years ago

In terms of parsing I think I will use a lexer called Logos and then write a parser from scratch like you said.
I was just worried about copyright issues and stuff, but I guess that won't really be a problem if I make sure to credit them explicitly.
I didn't know about Spike! Sounds like RISCV is the way to go then.

Edit 2: Racket's type system is not strict enough for my purpose. For example it allows expressions like (if b 0 "1") and (not 42)... For this reason I want to switch to SML instead of Racket. Both have type inference, so I have to manually add type annotations for whichever one I choose anyways...

sampsyo commented 2 years ago

Just to be specific, I guess what I'm suggesting is that you only support a "sublanguage" that is equivalent to System F with a couple of base types (like int and bool or whatever). No need to support any large subset of any language, such as the dynamism available in Racket.

In my view, the System-F-esque subset of Typed Racket or SML would be almost exactly the same for all the actual compiler effort in the project… just with different concrete syntax. So you might as well choose the easiest concrete syntax you can imagine so you can get to the "good part" (i.e., not parsing). IMO a Lispy syntax would be very convenient because you don't need an actual parser, but if you think it wouldn't be that hard to parse a tiny subset of SML with your favorite parser generator, go for it. But to re-emphasize: don't bother attempting to support a nontrivial subset of the language, at least for v1.0 of your compiler! Just focus on getting that minimal "System F subset" (with explicit type annotations) working, and add features only if you have time.

andreyyao commented 2 years ago

I understand what you're saying! I will definitely try to be minimalist at the beginning.

Racket's features are just too complicated and it's hard for me to separate out a system-F like subset of features that can still cover a good number of benchmarks. SML has formally defined semantics as well as syntax. More importantly I understand the ML family well which I think is helpful.

Racket is a cool language, but writing a prototype compiler for even a subset of it seems hard... I have already started parsing, but if it turns out to be too difficult I will definitely switch back to Racket like you suggested.

sampsyo commented 2 years ago

Sounds good!

andreyyao commented 2 years ago

Here is the Github repo for the project.

tl;dr: I finished the lexer and parser. Now onto type checking, which should be not too difficult.

Question: SML seems to allow the user to redefine any non-keyword variable(for types or values), which means stuff like type int = bool * bool is perfectly legal. It's also legal to do val not = fn x => x + 1, which redefines the boolean negation. Should I allow this so as to stick to the SML semantics, or should I just say "you can't do that" for now?

In terms of parsing, I guess should I do 1 or 2?

Treat pre-defined types and functions as keywords.
Treat them as any ordinary identifiers, and initialize the type checking context or whatever with the definitions. Or should I link with the standard library...

sampsyo commented 2 years ago

Awesome; nice work!! A couple of answers:

Yep, I'd recommend saying "you can't do that." This is one of many instances where it will make sense to restrict the language as far as possible so you have a simple language to focus on. (Remembering that the language itself is not the hard part here; the type-preserving compilation is the hard bit, and therefore that's what should get the majority of your cycles.)
I recommend idea 2 above. It should lead to a simpler parser (without needing to enumerate keywords in the parser as well) and will make things more flexible as you add more features.

andreyyao commented 2 years ago

Great, thanks! I decided to disallow shadowing identifiers from the SML basis library.

I want to implement idea 2, but how is it usually done? Does the compiler writer need to implement direct compiler translations for all the "base" functions?

sampsyo commented 2 years ago

Sure—basically, the idea is that your typechecker and interpreter have some manner of contexts/environments that user code can populate. For example, maybe there is a hash table for types that maps strings to objects representing types. Definitions like type foo = bar add a mapping for the string foo during type-checking. The idea would be for you to start up the type-checker with a non-empty environment that includes a mapping for int before the user's program gets checked.

sampsyo / cs6120