Internal Overhaul - Githubissues

It would be nice to (yet again) do a rewrite of the assembler internals, specifically parser-related files, to better separate concerns and create a more modular architecture. Something with multiple stages, refining internal structures as you go down the line. For example:

Stage	Output
Input	Source code string
Tokenizer	A list of lexical tokens
Top-Level Syntax Parser	A pure AST, with a node for each instruction invocation and directive, without any semantic analysis or instruction matching
Compile Branch Remover	Another AST, with `#if` nodes and their contents removed in case their conditions are not satisfied
Reference Collector	List of the names of all referenceable items: `ruledef`s, `bankdef`s, functions, labels, constants
Global Item Resolver	Resolved and validated items, such as `ruledef`s with nested `subruledef` references
Instruction Matcher	Instruction invocations marked with a list of all the matching rules
Instruction Resolver	Full binary encoding, by going through instructions in sequence, evaluating arguments and rule bodies to select the best match

Of particular interest is the last stage, the Instruction Resolver, which is the only stage required to be iteratively resolved in a loop (e.g. to resolve late label addresses). This means that all the previous stages don't have to run in the loop, improving efficiency. With a more structured approach like this, the assembler could also cache the binary encodings of instructions that don't depend on externally-changing values, further improving the efficiency of the loop. Or even, the assembler could defer writing the final binary until the very last iteration, when it has already decided the final values for late labels and the best matches for each instruction. (All these processes are executed repeatedly and wastefully in the current implementation.)

#include might become a problem in this scheme, but I think its behavior could at least be shoehorned into the Syntax Parser stage.

Also, in this scheme, #if blocks could only depend on compilation arguments, and not on label or constant values, since they're processed early in the chain, and don't have access to values which are sequence-dependent. Of course, there could also be another type of #if blocks for dynamic removal of code (but they wouldn't be interchangeable).

asm blocks within expressions are still difficult to deal with, but they may be able to be hoisted out of their expressions (leaving only an identifiable handle behind) and treated like regular instructions for the purposes of the Instruction Matcher and Resolver stages.

My main interest in an overhaul like this is just to better organize the assembler's internal code, and allow for new features more easily and cleanly. Speed and efficiency aren't much of a concern, since I believe the current implementation is already fast enough for most purposes, but are very welcomed benefits, of course. The time-memory trade-off usually kicks in here, but maybe the new scheme could even allow the assembler to be more economic in its memory usage, by harnessing the extra knowledge gained on each of the new assembling stages.

hlorenzi / customasm

Internal Overhaul #160