Closed andreyyao closed 1 year ago
Sounds fairly substantial! You didn't say it explicitly, but I think you will be following the recipe from Morrisett et al. 1999, right? Or is the CPS route you're imagining different from that? If so, do you know exactly how to do type-preserving CPS to make this work?
You also said this:
The IR should be carry type information as well. Since Scheme is not statically typed this likely requires some type inference.
I strongly recommend that you do not try to implement nontrivial type inference. Just invent a variant of scheme that is explicitly typed. Focusing on type inference would be a distraction from the matter at hand, which is the type-preserving compilation. I recommend making the language as tacky and custom as possible in order to make the compilation easier.
Before you get started, please also endeavor to answer these questions about your empirical evaluation:
Yes, I planned to be following the recipe from the Morrisett paper! I agree that doing type inference is probably too difficult. However I have been pulling my hair out trying to find a minimal explicitly typed (preferably functional) language... There is typed Scheme, but I can only find the language reference for it but not any existing tools. There is also typed Racket which seems much better supported, but the hard part will be choosing the appropriate subset of the language to support such that there is still enough existing benchmarks to use... Racket also seems a lot more complicated than Scheme, and typed Racket still allows implicit typing with type inference, which is a headache in my case.
As a response to your comment on Zulip about talx86, I might be better off inventing a typed RISCV or something? For now I think I will put optimizations aside and focus on type-preserving compilation. What do you think?
Yeah, I think your idea to basically invent your own typed language will work fine. Starting with typed Racket and doing a very (very) restricted subset of it will work fine. Can you, for example, think of a subset of typed Racket that is basically equivalent to System F with integers? That seems like the thing to do. Really, I think you'll need a very minimal language, at least to get started.
Compiling to something talx86-like seems possibly OK—I just would not make your project depend on any actual infrastructure from the TAL project (the type checker, etc.) in order to simplify the sheer hacking part of the project so you can focus on the "meat" of the work. So maybe you build your own type checker for talx86 code to demonstrate the success of type-preserving translation. Or sure, doing your own RISC-V thing would be OK too—just make sure that you don't sink too much time into making that work at the expense of the compilation itself.
Thanks for the feedback! Here's my revised proposal:
I will write a type-preserving compiler from a small subset of Typed Racket to RISC-V commented with type annotations. I will implement CPS transformation on the IR. I will also develop a simple type checker for the typed RISC-V. I will mostly refer to this paper for details about semantics as well as algorithm for CPS, etc.
I plan to use Rust to implement the compiler. I will first decide and then document the exact subset of Typed Racket that I plan to support, ideally using BNF. For now it will be close to the version of system F from the Morrisett et al. paper, possibly with Boolean values as well. Then I will refer to the paper to devise a specification for annotating RISC-V assembly code with types, all using comments so as to ensure 100% compatibility with existing RISC-V and interpreter, etc.
Then I will define the AST and IR structure of the compiler. After that, I will writing a parser from the "System F" subset of Typed Racket. For this I plan to use nom, a parser combinator. The AST will be emitted back to Typed Racket source code to check for program equivalence during testing. Hopefully error reporting will come for free too in this case. Once I have the AST working I will implement emission to IR and CPS while preserving types.
Finally, I will write a parser and type checker for RISC-V with types.
First of all, correctness of the overall compilation can be verified by program equivalence of source Racket code vs output RISC-V code. Both languages have existing support so umm yeah.
For testing I will try to find benchmarks in Typed Racket. If can't find enough, I will take benchmarks in plain Racket and annotate the types by hand.
Another requirement for success if that no valid Typed Racket programs should result in Illegal assembly...
Should I email the Morrisett paper (talx86) people about this? Also, should I go with x86 anyways because I would be able to assemble the output machine code and run it natively?
Team members: Just me for now.
Great! Sounds good overall. Some quick notes:
Edit 2: Racket's type system is not strict enough for my purpose. For example it allows expressions like (if b 0 "1")
and (not 42)
... For this reason I want to switch to SML instead of Racket. Both have type inference, so I have to manually add type annotations for whichever one I choose anyways...
Just to be specific, I guess what I'm suggesting is that you only support a "sublanguage" that is equivalent to System F with a couple of base types (like int and bool or whatever). No need to support any large subset of any language, such as the dynamism available in Racket.
In my view, the System-F-esque subset of Typed Racket or SML would be almost exactly the same for all the actual compiler effort in the project… just with different concrete syntax. So you might as well choose the easiest concrete syntax you can imagine so you can get to the "good part" (i.e., not parsing). IMO a Lispy syntax would be very convenient because you don't need an actual parser, but if you think it wouldn't be that hard to parse a tiny subset of SML with your favorite parser generator, go for it. But to re-emphasize: don't bother attempting to support a nontrivial subset of the language, at least for v1.0 of your compiler! Just focus on getting that minimal "System F subset" (with explicit type annotations) working, and add features only if you have time.
I understand what you're saying! I will definitely try to be minimalist at the beginning.
Racket's features are just too complicated and it's hard for me to separate out a system-F like subset of features that can still cover a good number of benchmarks. SML has formally defined semantics as well as syntax. More importantly I understand the ML family well which I think is helpful.
Racket is a cool language, but writing a prototype compiler for even a subset of it seems hard... I have already started parsing, but if it turns out to be too difficult I will definitely switch back to Racket like you suggested.
Sounds good!
Here is the Github repo for the project.
tl;dr: I finished the lexer and parser. Now onto type checking, which should be not too difficult.
Question: SML seems to allow the user to redefine any non-keyword variable(for types or values), which means stuff like type int = bool * bool
is perfectly legal. It's also legal to do val not = fn x => x + 1
, which redefines the boolean negation. Should I allow this so as to stick to the SML semantics, or should I just say "you can't do that" for now?
In terms of parsing, I guess should I do 1 or 2?
Awesome; nice work!! A couple of answers:
Great, thanks! I decided to disallow shadowing identifiers from the SML basis library.
I want to implement idea 2, but how is it usually done? Does the compiler writer need to implement direct compiler translations for all the "base" functions?
Sure—basically, the idea is that your typechecker and interpreter have some manner of contexts/environments that user code can populate. For example, maybe there is a hash table for types that maps strings to objects representing types. Definitions like type foo = bar
add a mapping for the string foo
during type-checking. The idea would be for you to start up the type-checker with a non-empty environment that includes a mapping for int
before the user's program gets checked.
What will you do? I will write a compiler from a subset of Scheme to talx86, a typed assembly language. The translation to intermediate representation will be in continuation passing style. There are several motivations for this choice. First of all, I think the idea of a typed assembly is fascinating as it allows the assembler to perform certain correctness verification. The choice of using Scheme is just that it is a minimal language with few core language constructs. Also there is a reference compiler by the talx86 people from Scheme.
How will you do it? I plan to use either Rust or Java to implement the compiler. I will first decide and then formally document on the set of language features from Scheme that I support. Then I will design an intermediate representation. After that I will write a lexer and parser from Scheme to the IR. The IR should be carry type information as well. Since Scheme is not statically typed this likely requires some type inference. s-expressions should make parsing not too difficult. Then I will convert things to CPS. After that I might run several optimizations. Finally if I will emit to talx86.
How will you empirically measure success? First of all, correctness of the overall compilation can be verified by emitting to traditional x86 instead of talx86, which should be consist of simply deleting the type annotations. For the IR stage it should be quite feasible to emit CPS Scheme code and then check for program equivalence. Apparently Scheme has continuations as first class objections? Not sure if that will end up being helpful. If I do decide to add in some optimizations for CPS, I will compare the before and after of optimizations on certain Scheme benchmarks.
Team members: Just me for now.