SnirkImmington commented 7 years ago

One of the technical debts of the Rust compiler is that its code does not take full advantage of Rust's ownership rules. This is partially through hacking over time but mostly due to being bootstrapped so early in Rust's development (it used @mut pointers way back when). There are a couple of ways they do ownership (Rc<T>, P<T>) with varying degrees of unsafety. The truth is simple: the Rust compiler is not written in "safe Rust". Obviously, this is very difficult (why create protosnirk if everything can be expressed in safe Rust) but I think it's possible to do better, most of the time. Obviously, hackiness eventually needs to be introduced as the language requires (not everything can be a perfect set of passes) but here are the basics:

String interning

In a patch or two, add string interning. Do it without lazy_static.
name fields either become Rc<String> or StringId
Issue with Rc<String> is reference count updates, having an index and asking for updates makes the cost more explicit
rustc has a gensym on its string pool, this might be worth exploring
StringPool should be passed around a lot but its mutability rules should be simple
Rust takes thread-local storage into account - it's probably not worth trying to sync pooling between threads

Tokens

Token's text becomes interned.

Tokenizing

Tokenizer's next() takes an &mut StringPool

AST

At this point, the AST depends on its StringPool
The parser can totally handle initializing the ScopeIndex of the AST without an extra pass, this could even be fixed in typeck
~~ScopeIndex should probably use a SmallVec or fixed-size type - it also needs a better name like ScopedId~~ Done in typeck branch
We could also give each AST node a unique index

In the next stages of compiling, we need to run checks in passes. Each check should add some extra set of data to the program. Many compilers do this through lowering passes, and produce new semantic trees or graphs of their represented program. However, I think we can take a relational approach: do things in more passes but have easier lifetime analysis. Produce multiple mappings instead of lowering data down.

Item collection

We need a first pass to correlate function names together so they can be called before being declared.

Visit all items, add them to a symbol table (specified in name check)

Name check

The first pass (currently parse::verify::smybol_checker) creates a mapping ScopeIndex -> Symbol that simply verifies that each item and variable declared actually exists. This can happen before an "imports" pass.

Ensure that variables are all declared and not duplicated
Symbol: should have a unique identifier (SymId), a ScopedId of where declared, and possibly metadata about its declaration: is the variable mutable, etc.
The name checker should borrow the string pool and be consumed into the SymbolMap.

~~Mutation checks are built into this right now (by changing Cell<bool> fields in Symbol). Moving it into its own check might simplify things but would result in more code and another pass.~~ This has been changed in the typeck branch.

Type check

Type inference in the typeck branch uses the same ScopedIds as variable identification. Types are resolved in a graph to unique nodes which are mapped to concrete types.

Compile

Compiling is the fun part where each of these tables is looked up from each other

Generate LLVM type from each type/cache these in a TypeId -> LLVMType
Make sure predefined functions are declared
Take in the ScopeIndex -> Symbol, SymId -> TypeId and use them to compile the module, building up a ScopeId -> LLVMValueRef
Compiler should also take in the StringPool to produce nice names

Errors

The errors will probably contain string ids
The error handlers will need to use the StringPool

SnirkImmington commented 6 years ago

I've updated this old issue given my progress in the typeck branch, but I think I may have to close this and reopen smaller/more organized items. String interning may be an important feature to add later when performance improvements are needed.

Timidger commented 6 years ago

Another solution is to bootstrap as late as possible.

Is there some solution to bootstrapping (other than being a good means of dogfooding the language) that I'm unaware of?

Go didn't bootstrap until 1.4 IIRC.

SnirkImmington commented 6 years ago

I had assumed the compiler would be more efficient with string interning (ASTs can have a lot of repeated identifiers), but looking around at other compilers I don't see a lot of string interning. It'll certainly never compare to time and space requirements of running LLVM.

Most of the things in this issue that are actually useful comments are things I'm doing in the typeck branch, and everything else is too long-term and vague to bother about. I may feel the need to rewrite the AST implementation in the future (possibly when user-defined structures and generics are taking shape), and I think a lowering pass will be required for lifetime handling and optimizations, but those aren't worth creating GitHub issues over right now.

I'm not concerned about bootstrapping - it can wait. It certainly couldn't happen anytime soon.

snirk-lang / protosnirk

Ownership patterns in the compiler #25