The first pass through implementing the compiler was good enough, but the approach isn't very robust. This prevents us from growing the codebase in a reasonable way without a lot of issues.
In particular, the following is hard right now:
implementing other types
everything was done with arrays of bytes in mind, so UTF-8 strings, numbers, custom data types, etc. are difficult to implement
error handling
the nom parsing framework doesn't generate the best error messages on its own, we also don't pass position information very well between stages of compilation
this prevents us from providing valuable error messages.
code generation
code generation is a very naive process
we are basically appending large amounts of Instruction enums to a Vec and then converting that to a string
we need something higher level to avoid using too much memory and to be able to generate smarter code more robustly (without so many silly bugs)
optimization
Complex optimizations are almost impossible because of the problems outlined above
Without context, it's difficult to make more interesting optimization decisions
As you can see, there are problems in nearly every stage of compilation.
This issue will outline a plan to address each of these issues and improve the overall quality of the code.
Plan
New Compilation Stages
Source - the brain source code to be compiled
Parsing --> AST
Position information is embedded into every node of the AST denoting the exact index in the source (and possibly the filename) where that node was found
This is the closest representation to the original syntax
// Forces us to enforce that variables are initialized before used
enum Item {
Defined {type_def},
Initialized {type_def},
}
Desugared AST with all names resolved and complete type information
Looks up names and determines the type of each
Any type inference is resolved here
Any size inference is resolved here
We check for type compatibility and make sure everything makes sense
Types are associated with some Rust struct which knows how to perform operations
The AST is immutable from this point on -- we don't want to make any changes
Complains if names defined in a non-global scope are used where they are not accessible
Operation IR --> Vec<Operation>
AST is transformed into a tree of operations
We define our own set of primitive operations which (for the most part) we can later translate into brainfuck
We use these operations to transparently handle memory without explicitly building the static memory layout here
These operations are easier to work with and more sophisticated than the default brainfuck instructions
A lot of the pitfalls are avoided like forgetting to close loops, dealing with relative movement and memory allocation
Rust/pseudocode Operation enum:
enum Operation {
/// Allocates the given size in bytes on the tape so that it is not used by any other code accidentally
/// The id is generated automatically and represents the position which will eventually be determined when the memory layout is generated
/// Nothing is guaranteed about the position other than that `size` consecutive cells including the position will be available for use without conflicts
/// This is also used to ensure memory is automatically dropped at the end of its scope (stack frame)
Allocate {id, size},
/// While most allocations can be dropped at the end of their scope, temporary cells
/// should be dropped as soon as possible so that they are available in the memory
/// layout again as soon as possible
/// This way we can avoid a lot of unnecessary move operations over cells
/// that aren't being used anymore
/// While this could be an optimization as well, temporary cells in particular are
/// *known* to have this property since we generate temporary cells in the compiler itself
/// The temporary cells are guaranteed to last for the duration of the given body
/// They are then freed immediately afterwards
TempAllocate {id, size, body: Vec<Operation>),
/// Frees the given memory id and all cells associated with it
/// Typically not used unless an explicit free is necessary before the end of the scope
/// Freeing means both marking those cells available and zeroing their values
Free {id},
/// Move to the position associated with a particular memory id
MoveTo {id},
/// Perform a movement that returns to the reference position (usually cell 0)
ReturnToReference,
/// Increment the value of the current cell by a certain amount (relative to whatever the current amount in the cell is)
Increment {amount},
/// Decrement the value of the current cell by a certain amount (relative to whatever the current amount in the cell is)
Decrement {amount},
/// Read bytes into `size` consecutive cells
/// Note: this generates both read instructions and move right instructions
Read {size},
/// Write bytes into `size` consecutive cells
/// Note: this generates both write instructions and move right instructions
Write {size},
/// Set the value of `size` consecutive cells including the current cell to zero
/// Note: this generates instructions to zero the value and move to the right
Zero {size},
/// Loop with the given operations as the loop body
/// A loose restriction on loops is that they should return to the reference position whenever that makes sense
/// In well behaved code, this can usually be done by inserting a ReturnToReference operation at the end of the loop body
Loop {body: Vec<Operation>},
/// Copy the cells at the position of sourceId to the position of targetId using a single temporary
/// cell at the position of tempId
/// Only copies up to the size of sourceId
/// sourceId and targetId must have the same size
Copy {sourceId, targetId, tempId},
/// Relocate the value at the position of sourceId to the position of targetId
/// Leaving zeros where the source data was originally
/// sourceId and targetId must have the same size
Relocate {sourceId, targetId},
}
Each optimizer walks the tree and returns a new one with the operations performed
Compilation --> Vec
Operations are transformed into Instruction instances representing the actual brainfuck instructions
Compilation can accurately keep track of position information and reliably generate code because it has the entire operation tree to interpret
Drops all values declared in loop bodies
Operations that can be composed of other operations should be converted first during generation
In other words, generation is a pipeline operation of first converting operations to lower level operations and then converting those lower level operations directly to instructions
Brainfuck Optimizations (to be added later)
Further brainfuck optimizations specific to streams of brainfuck instructions are performed
Brainfuck
The Instruction struct is transformed into a string representation which can then be written into a file
Representing Types
We need to do several things here:
Associate type names like "u8" or "[u8; 7]" with internal structs which represent those types
Be able to lookup method names like "write" in internal structs and type signatures
Be able to apply functions to internal type structs
Approach:
/// Operators are things like Add, Subtract, Concatenate, Call, Write, Read, etc.
#[derive(Clone, Copy)]
enum Operator {
/// Read values into the type using Read operations
Read,
/// Write the type using Write operations
Write,
/// Decrement the given type by 1
///TODO: This is essentially a placeholder for use before we get proper numeric support
///TODO: Eventually this will be replaced by a Subtract operator or something
Decrement,
/// Call a specific method name (e.g. len) on the type
/// The method name to call is the first argument in the args Vec
/// Return ApplyError::UnsupportedOperator if the given method name is not supported
/// Return ApplyError::UnsupportedArgumentTypes if the types of the given arguments are not supported
CallMethod,
}
trait BType {
/// Returns the name that can be used in brain code to refer to this type
fn type_name() -> &'static str;
/// Apply the given function or operator
fn apply(f: Operator, args: Vec<&BType>) -> Result<Vec<Operation>, ApplyError>;
}
enum ApplyError {
/// When the given operator isn't supported by this type
/// Example: Use this when CallMethod is passed with "len" but the type doesn't support len()
UnsupportedOperator,
/// When the types of the given arguments are not supported
UnsupportedArgumentTypes,
}
This is still representative of roughly what the compiler does. Some of the stages are different, moved or changed, but the overall structure and process is still the same.
The first pass through implementing the compiler was good enough, but the approach isn't very robust. This prevents us from growing the codebase in a reasonable way without a lot of issues.
In particular, the following is hard right now:
As you can see, there are problems in nearly every stage of compilation.
This issue will outline a plan to address each of these issues and improve the overall quality of the code.
Plan
New Compilation Stages
(AST, HashMap<Name, Item>)
Operation IR -->
Vec<Operation>
Rust/pseudocode Operation enum:
(Vec<Operation>, HashMap<id, Cell {position, size}>)
Representing Types
We need to do several things here:
Approach:
To Do