Closed samiam95124 closed 2 months ago
I believe this will happen in the 0.3 version.
This was changed to AMD64, 64 bit code generation.
The framing change was implemented. This initiative moves forward after v0.2 is committed.
v0.2 complete, this is all that is going on for a while.
The pmach module is 3400 (about), pgen is about 2400 at present. I think it is going to be around 5000 when the module is fleshed out. The stages are:
In pint, symbols in the original Pascal-P table was a fixed size table, and has the ability to forward reference by backpatching the build code. In pgen, we let the assembler do the work. Symbols are output to assembly source when they are defined, either as equates if the value is specified, or as program addresses if not. The symbols divided into near and far as in pint[1]. The near table is the original pint table. The far table lists all of the references from externals. The near table is fixed size, and enumerated by pcom. The near table can be defined or not, and thus can have forward references. The far table is never defined, and thus has no values. The near table had symbol names added in pgen. This wasn't strictly necessary, because it could have been made up at point of use, but it makes it simple to find the symbol name.
The GAS assembler allows the use of '.', '' and '$' characters. ' ' is valid Pascaline. '.' is used to demarcate module to local symbols and can be used to any depth, like module.procedure.variable, but right now is just module.symbol. '$' is used to demarcate type spagetti for overloads, and is only output on procedures and functions. Unfortunately, since Pascaline does not know in advance what routines are part of an overload group, we have to output all routines in their spagetti form, although we might come up with something clever like aliasing the first routine encounter.
[1] yes, named after the 8086 convention, which I hated with a passion, was forced to work on, and now just think its funny. Feh.
There are a couple of areas in the code where providing the value of a label before it is used helps. For example, in the sfr instruction, knowing that the function result is zero means pgen can just remove that instruction (of course that could also be done in pcom). In the msr instruction, there are several optimizations that can be done if the values are known in advance. The locals allocation can be skipped or presented with less instructions depending on size. The type of frame construction can be simplified, etc.
pcom was not particularly built like that, and I am probably the biggest offender of this. The labels were often value set after the fact because pint is inherently a 2 pass assembler, and it didn't matter. In fact, unresolved references were chained up in the binary there in lists in a method that must have been the talk of ETH (no it wasn't me).
As a result, there will be a couple of movements of the set point of label definitions just to take advantage of early definitions. Although pgen outputs to an assembler, and that technically makes it 4 passes (pcom, pgen, then 2 passes of assembly), to change the geometry of generated instructions makes pgen effectively a single pass module.
Nice theory (forward references) not true. Don't actually know the parameters of a mst until the body is complete.
Pascal-p6 relies on three algorithms to simplify its operation.
These happen to be the steps taken to generate code. The simplicity of the system is important for reasons:
In fact, for Pascal-P6, having a simple and easy to use generation system is more important than having an ideal system, in terms of optimized code generation, for the above reasons.
The first part of the algorithm is the simplest. Formation of expression trees is done by a simple gather and stack principle. Each of the operators gathers its operators and stacks the result. Register allocation is the hardest part. Tree structured allocations are said to be less efficient than linear mechanisms, such as SSA peephole. However, the tree structured evaluation of expressions means that each tree node can be considered as an isolated operation from register perspective. The operator considers the registers it needs against the registers available, and either makes due with the registers it has, or flags that it needs registers already in use, either because it lacks free registers, or because it needs a particular register in use (so called "demand" registers). This last case is common even in orthogonal machines, because of calling conventions.
Because of the tree structure, register overflow conditions can be handled by a simplified method. Entering any node that overflows is handled by pushing the overflowed register before processing the node, and restored after. This algorithm can be improved (say by lagging) at a later time.
The method used is referred to as "directed allocation". Each node specifically directs what register or registers it will use, both for results and any temps. Then it is up to the subnodes to accommodate that. What this does is makes it unambiguous where the results of a subnode will be. The subnode either arranges the result to be in the indicated register, or moves it there when it finishes. This has a tendency to resolve in a top down manner. If node A expects results in register X, it tells node B it expects register X, and so forth.
Finally, the generation of code is made easy by the fact that all register allocations are done prior. The generation step is just filling out code with the given registers. Further, a macro generation facility is used, wrtins(), which takes a string containing ASCII code strips to generate and fills in macros in the instruction to be generated.
The Pascal-P6 register allocation is built around single trees. However, its often necessary to "bridge" multiple expression trees together. The reason for this is that multiple trees occur at calls to subroutines. For this reason, a global exists, frereg, that gives the set of free registers. This is set to to all registers at the start, and as each register is allocated, they are removed from the free set. Each tree technically "unwinds" or frees its used registers from the set., so frereg is self clearing. But likely it is cleared at the start of each tree.
As a series of trees are processed, frereg is not updated, but rather holds the registers in use. This is then carried from tree to tree, and effectively bridges them
program test(output);
begin
writeln('Hello, world')
end.
#
# File generated by P6 Pascal AMD64/gcc 64 bit code generator vs. 0.2.x
#
# Header file locations
inputoff = 0
outputoff = 2
erroroff = 4
listoff = 6
commandoff = 8
# Logical file numbers for header files
inputfn = 1
outputfn = 2
errorfn = 3
listfn = 4
commandfn = 5
.text
#
# Code section
#
# 0: 1: !
# 0: 2: ! Pascal intermediate file Generated by P6 Pascal compiler vs. 0.2
# 0: 3: !
# 0: 4: o prtlab
# 0: 5: ! program test(output);
# 1: 6: :1
# 1: 7: !
# 1: 8: ! Program test
# 1: 9: !
.globl main
.type main, @function
main:
# Set up default files
movb $inputfn,globals_start+inputoff(%rip)
movb $outputfn,globals_start+outputoff(%rip)
movb $errorfn,globals_start+erroroff(%rip)
movb $listfn,globals_start+listoff(%rip)
movb $commandfn,globals_start+commandoff(%rip)
# 1: 11: 12: cup l test.3
call test.3
# 1: 12: 22: ret
ret
# 1: 13: !
# 1: 14: ! begin
# 1: 15: !
# 1: 16: l test.3
test.3:
# 2: 17: :2
# 3: 18: :3
# 3: 19: :3
# 3: 20: 11: mst 0 l test.6 l test.7
pushq $0
pushq $0
pushq $0
enter $0,$0
movq %rbp,%rax
addq $test.6,%rax
cmpq %rax,%rsp
je .+6
pushq $0
jmp .-7
# 3: 21: ! writeln('Hello, world')
# 4: 22: :4
# 5: 23: :5
# 5: 25: !
# 5: 26: 56: lca 12 "Hello, world"
# 5: 27: 5: lao 2
# 5: 28: 118: swp 8
# 5: 29: 123: ldci 12
# 5: 30: 118: swp 8
# 5: 31: 123: ldci 12
# 5: 32: 15: csp wrs
# 5: 32: 6: -> wrs
# 5: lao
leaq globals_start+2(%rip),%rdi
# 123: ldci
movq $12,%rsi
# 56: lca
leaq string1(%rip),%rdx
# 123: ldci
movq $12,%rcx
pushq %rdi
call psystem_wrs
# 5: 33: ! end.
# 5: 34: 15: csp wln
# 5: 34: 5: -> wln
popq %rdi
pushq %rdi
call psystem_wln
# 5: 35: 117: dmp 8
popq %rdi
# 5: lao
# 6: 36: :6
# 7: 37: :7
# 7: 38: !
# 7: 39: 14: retp 0
leave
addq $24,%rsp
popq %rax
addq $0,%rsp
pushq %rax
ret
# 7: 40: l test.6=-8
test.6 = -8
# 7: 41: l test.7=-32
test.7 = -32
#
# Constants section
#
string1:
.string "Hello, world"
.bss
#
# Globals section
#
globals_start:
.zero 88
As you can see, even a simple example like hello world can invoke a lot of the lower level mechanisms. The next level is to perform a procedure/function call with parameters. Then it will be possible to proceed to the ISO 7185 acceptance test.
So the way pgen works right now, and this is documented in the manual, is that parameter/function result passing is a hybrid between stack based passing and register based passing. I welcome comments on this.
The idea is that register based and stack based are not completely at odds. We develop the parameters into the registers they will be assigned to for the calling convention, RDI, RSI, RDX, RCX, R8, R9 and RAX for returns. Normally, these parameters would not go to stack, and parameters would not stack unless they overflowed, IE., are over 6 parameters. The difference is that in Pascal-P6:
The reason why it matters to Pascal-P6 is that its backwards compatible with pcom, pint, etc. They expect the parameters on stack and address the parameters and function result as such. Thus it would take a lot of changes in the way pcom/pint work to accomodate pgen[1].
When a Pascal procedure/function is called, it can get the parameters from either registers or the stack. At present, pgen cannot take advantage of registered parameters, but rather gets them from the stack, but when register caching is implemented, the parameters will form a "preload" of the caching system, and thus pgen will be able to take advantage of it.
The main drawback is that the calling convention is not %100 compatible with GCC/AMD64. it is compatible up until either greater than 6 parameters or a non-scalar return value. However, I would argue that %90 or more of the functions that Pascal would need to call fit this convention[2], and the number of times you need C to call Pascal with a non-scalar result are practically nil.
Finally, the most common use in C of functions with long argument lists are varargs functions, which have no parallel in Pascal.
[1] Plus there is the little matter of C putting things on the stack from right to left.
[2] Including all of Petit-ami, the Pascaline support library.
And the result:
samiam@samiam-h-pc-2:~/projects/pascal/pascal-p6$ pe test
Compiling test...
P6 Pascal AMD64/gcc 64 bit code generator vs. 0.2.x
Generating program
Program generation complete
/usr/bin/ld: /tmp/ccWJmrAk.o: in function `resetfn':
/home/samiam/projects/pascal/pascal-p6/source/psystem.c:1681: warning: the use of `tmpnam' is dangerous, better use `mkstemp'
samiam@samiam-h-pc-2:~/projects/pascal/pascal-p6$ ./test
Hello, world
samiam@samiam-h-pc-2:~/projects/pascal/pascal-p6$
Compare this with the first "hello, world" before the framing fix. in 2022-10-26. That was almost 7 months ago. Seems I always underestimate work.
--exec
Compiles to a runnable image. This is like --package, but uses pgen.
--nopascaline
Prevents running the pascaline regression tests. This is necessary for the v0.3 pgen, which is ISO 7185 only. Also useful for testing other compilers (GPC, FPC).
--native
Uses the base compiler to construct a binary and uses that to run tests. This is used to directly test the host compiler.
That's a good question. When it comes to pgen, the results are basically the same. The difference is that --exec tests what the next version will do, and native tests what the current compiler does do. The analogy is sawing the branch off that you are sitting on. You want to extensively test the next version of the compiler before you place that as the active compiler.
The progress in pgen gives some decision points for the future.
Decision point: pgen runs iso7185pat.pas
After this we can either go to:
This decision point is enabled by the fact that Pascal-P6 is written in ISO 7185 Pascal. Given the track record, I think it can be assumed that the self compile will be very difficult.
Next decision point:
The interface to Petit-ami needs basic container string parameters, thus it cannot be completed without basic containers. The good news is that this is a fairly minimal upgrade to ISO 7185 runnability. The basic improvements path is to bring pgen up to the same level as pint/pmach/cmach.
Next decision point:
No comment.
Actually, this list pretty much echos the roadmap in the v0.2 status.
This document covers the first article pgen, before advanced optimizations are added. The final version of this will be inserted into the Pascal-P6 document.
pgen is built on the pint assembler portion. This means that pgen is pint, but without machine simulation. It generates assembly files for gcc/gas to be directly executed.
pgen constructs expression trees, but not full graphs of the intermediate langauge. This means that intermediates are classified into three kinds:
Terminals, or operations that many consume zero or more expression trees, and don't generate results used elsewhere.
Expression operators, which have one or more branches that are also expressions.
Leaves, which have no subexpressions.
All of these intermediate types are represented in the "assemble" level, and (as in pint) all of the intermediates are represented in a case statement.
The job of an entry in assemble can be broken into three parts:
Read the intermediate instruction and parameters. This includes, or may include, p, one more more q parameters, labels, etc.
Construct expression trees as required.
Generate terminal code.
2 and 3 may or may not exist in assemble. However, it is considered "good form" to structure each case handler in that order.
The first part, reading of intermediate, starts at the top of the case statement, and assemble gives the preamble of intermediate instruction prints. The intermediate is copied to the output as assembler comments. In addition, pgen dumps expression trees as assembler comments as well. The idea is that the intermediate gives a single reference for how the code is constructed in terms of comments in the assembler file.
The intermediate instruction is read as numbers and possibily a label. The reads are written out, the label found with labelsearch routine is output. Then ends when the line is terminated. At that point, the p parameter and some number of q parameters are loaded. These are either used directly in the handler if it is a terminal, or else the getexp() routine places them into an expression tree entry.
Expression trees are constructed by a sequence (typical) of:
getexp(ep) popstk(ep^.r); popstk(ep^.l); pshstk(ep);
For a binary operator. So first an expression node is created with getexp(), and the parameters are written (implicitly) to the entry, then some number of stacked expression entries are placed in forks of the expression entry, then the new entry is pushed. This is a pretty standard methodology to construct a tree using a stack. The stack is just a series of linked expression nodes, and once the expression tree is constructed, the stack is no longer used.
Each expression node has a left and right fork, and a number of "extra" entries as required for complex nodes. If the operator is unary, it only uses the left fork. Binaries use left and right forks. Intermediates that are leaves don't use any forks.
Once the expression tree is built, and the terminal handler has all needed trees, the code generation can proceed. The first part is to perform register assignment on expression trees. The register assignment sets one or two registers in the root entry for each expression tree, r1 and r2. Most operations yield only one register and one result, but two registers may be required for things like "fat" pointers (base address and length/template pointer).
The next phase is each expression tree is processed with:
dmptrel(); genexp();
dmptrel() dumps the expression tree, formatted, into the assembly file as comments. This is a full tree dump including registers. The genexp() call then generates the code to process the expression. These two operations are interleaved until all of the expression trees needed are generated.
At this point, all of the operands generated by expressions are resident in their assigned registers. The terminal handler then generates the final code for the terminal intermediate.
Finally, the expression trees are torn down and recycled using detre(). pgen uses a closed recycle. Once an expression entry is created, it stays allocated, going back to the free list after use.
The wrtins() macro processor
To simplify the code generation, a macro line processor is used. The idea is that the string will look like the final generated assembly language line that is emitted:
wrtins20('movq $0,%rax ', ep^.p, 0, rgnull, rgnull);
The wrtins() routine processes several macros embedded in the line:
$0, $1 - Selects either the first or second constant number. %1, %2 - Selects either the first or second register. +0, +1 - Selects either the first or second constant number. $s - Selects a symbol name as constant
These are formats used in the AMD64 instruction set for GCC/GAS.
The macro processor automatically tabs the input instruction to form the output. The first word is tabbed by 8 spaces, and each subsequent space is tabbed to the next 8 spaces.
The macro processor only recognizes exact formats and numbers, and thus (as shown above) literal registers and numbers can be used.
To prevent wasting space with long lines of blanks, there are several macro processor calls according to string length:
wrtins10() wrtins20() wrtins30() wrtins40()
Register allocation
pgen uses a dirt simple register allocation method. In first article, spills are handled by pushing and poping registers to and from the stack in this version. There are various optimizations to reduce this later.
The basic register allocation call is:
assreg(ep, frereg, r1, r2);
Where:
ep - The expression tree to allocate. frereg - A set of registers that are available for use. r1 - The first register that will be used to deliver a result. r2 - The second register for the result.
assreg() is applied recursively all the way down the expression tree, in left first order. It uses a so-called "demand allocation" model. If the caller to assreg() needs to have the result in a given register or registers, it specifies that in r1 and r2. Otherwise it passes rgnull. The subexpression always obeys a demand if present, either arranging the result to end up in that register, or making a copy of the result and moving it to that location after processing the intermediate.
There are two sets of registers. The rdi,rsi,rdx,rcx,r8 and r9 registers are resevered for the calling convention. The rax,rbx,rgr11-rgr15, and rgxmm0-rgxmm15 registers are used to hold temporaries, or for any other use.
The frereg set gives the registers that are free at any given node for use. Each node can mark a register or registers in use, and then pass that set down. The flow for this is for each node to mark its working results as in use, then indicate via frereg what registers are in use for subexpressions. The frereg set starts each terminal with rax,rbx,rgr11-rgr15, and rgxmm0-rgxmm15 in it, meaning these registers are available for use.
There are three methods used to allocate registers. The first is to request a specific register or registers via assreg(). The second is to request a general register by specifying rgnull via assreg(). The type of register allocated is up to the node that fufills the allocation request, ie., floating point or integer. The third way is to get a general register via one of the allocation calls.
For other situations, the following calls exist:
resreg() Reserves a specific register. This call dosen't allocate for a specific use, so it a general purpose reservation system. If a register will be distroyed by an operation, then it will be reserved this way.
getreg() Allocates a general register. This is used when any register will do.
getfreg() Allocates a floating point register. Allocates from the rgxmm0 to rgxmm15 set.
Allocation for terminal intermediates
Terminal intermediates work differently than expression operators or leaves. Terminal intermediates "own" the register set, and thus do not need calls like resreg(). They simply use registers as they see fit. Each expression tree is processed, and the intermediate processor uses the register allocations of each tree to generate code. The trees either have a register or registers that were specifically set by the intermediate handler, or a general register. Whenever possible, intermediate handlers will specify general registers instead of specific registers to enable better code generation. There are two reasons this might not be possible:
The operation may require specific registers. For example a set operation might need the rdi or rsi regsiters on an AMD64 model to move or process the set.
The code is preparing a system or user call, in which case the calling convention dictates that operands be placed in a certain register.
Allocation for expression nodes or leaves
Each node handler for assreg() first arranges that its forks have the required or general register or registers. This is "result directed", in that each operator wants to steer the results to the registers requested by the caller, if there are such. The register or registers requested are then placed in that nodes r1 and r2 result registers. If the caller didn't specify a particular result register, then the assreg() handler either specifies one or gets a free register via one of the calls getreg() or getfreg(), for integer and float registers, respectively.
If the result register for the operation being processed does not match the requested result register, the handler is responsible for allocating a temporary register, copying the result to that and then to the result, and allocating the temporary register as required. In most cases, the arrangement of the result to the proper result is not complex.
As mentioned above, the call resreg() is available within assreg() to mark a register as used or distroyed. If the register is not free, and is not one of the r1 or r2 demand registers, it will be put into the "push" mask for the node. What the push mask does is indicate to getexp() that the register must be pushed at the start of encoding for the node, and restored after encoding for the node. This protects the contents of higher nodes.
When assreg() receives a demand register, this is removed from the free set to prevent it from being allocated. If the register is reserved, it won't be pushed to save it if it is a demand register, since it will be written by definition.
Another catagory of register reservation are the temporary registers, t1 and t2. These are allocated registers when a temporary register is needed, but it is not a specific register.
Register allocation as a separate pass
The register allocation in pgen is designed to be decoupled from the generation of code. That is, code generation does not rely on the method to allocate registers, so a newer better register allocation could be inserted while still keeping the main code generation. This allows for improvements (say register coloring) without refactoring the most difficult part of code generation, the actual end code generation.
To do this, expression entries are used to communicate what registers are to be used in code generation. This is done both explicitly and implicitly. Explicit register specification is done for the r1, r2 and t1 and t2 register specifications. Implicit register specifications are done when the operator uses a specific register, such as during idiv or imul for the AMD64 series, which requires rax and rdx registers. These registers are not specified, because they are understood to be used during the operation.
While constructing pgen, there are a lot (a lot a lot!) of optimizations I know about that aren't going in to the first version. Why? Because they are optimizations. IE, the functionality does not change with the optimization, it just runs faster/is smaller code[1]
I didn't/am not going to, file tickets for each of these things because.... I don't want to. It would just create a distracting mess right now. So with that introduction, here is the list.
[1] In fact its a common feature of compilers that in the early stages of optimization, usually faster goes hand and hand with smaller. [2] In the old days, I would have said this is complex, but I have done some pretty nice tree comparisons since those days.
I was going over some documentation last night and came to a realization, that the register allocation convention I am putting into Pascal-P6 is shortly going to run into a wall in the Windows 64 bit model.
Pascal-P6 is presently arranged around a 6 register calling convention for x64. It's a hybrid system. Pascal-P6 is not arranged around the x64 calling convention, for a good reason: Its a C based calling convention, which is based on pushing the parameters on the stack from right to left. C does this to make a single, and IMHO rather stupid pet trick work, which is the varargs convention work out. It means that C programs can look at the first argument on the stack, and if it is varargs, and a string, it can determine what the following parameters are from the format string. This is one of things that makes me angry when people talk about calling conventions and APIs as being language independent. They are not. Virtually all of the current calling conventions and APIs are highly biased towards C.
The problem with a right to left parameter convention is that the encounter order is unnatural. You don't parse from right to left, you parse from left to right. Doing it right to left implies that you store all of the calculations and output them when the parameter list is done. pcom, pint, pmach and cmach do not do this. pgen can do this, but its really just a side effect of how it works.
The nice thing about register based calling conventions is that I was able to work out a hybrid that makes it work even if Pascal-P6 is still using a left to right stack based convention, but as long as it can "preload" the parameters to registers before stacking them, and put the function result back to stack after the call, it works for both pint and pgen, and also works for the x86 register based calling convention AS LONG AS THE NUMBER OF PARAMETERS IS LESS THAN OR EQUAL TO 6. Its less than or equal to 6 because that is the number of registers in the register based x64 bit calling convention. And yes, I checked, and ARM64 and RISC-V 64 have at least that many registers in their calling convention.
Six is a nice number, because it works for all of pgen's system calls, and (I believe) works for all of the Petit-Ami (Pascal-P6's OS support library) calls as well. What happens if you can't use the native calling convention? IP Pascal was designed to pass 6 parameters in registers even on a 32 bit machine. It uses what are called "thunks" to call outside calling conventions. Why? Because IP Pascal was a high optimization compiler in the 32 bit days when everyone else was using only a two register calling convention. It works, and IP Pascal even has an automated thunk generator. The drawback is that IP Pascal has to have a large infrastructure just to support its external interfaces. This is something I wanted to avoid with Pascal-P6.
Since virtually all OSes today use 64 bit encoding, and embedded 32 bit processors are basically all going to ARM these days, I figured there was no problem using the hybrid convention. Unfortunately, it turns out Windows decided to go their own way, and naturally, also decided to cripple the convention as well. They support only 4 registers, and require a stack pad from the caller to spill them into, because, you know, their register convention wasn't difficult enough.
So I looked. There is at least one of the Pascal-P6 support library routines that goes over 4 parameters, and there are likely more in Petit-Ami. Thus, not now, but soon, Pascal-P6's register convention is going to have to change.
pgen builds a lot of flexibility in to handle parameter lists, because it sucks up the entire parameter list as a series of expression trees. This wasn't because of any parameter passing convention, its just a side effect of the way it works. So pgen can reorder parameter lists, or even compensate for some being in registers, and some on the stack. So why is it a problem?
Because that parameter convention has to be accounted for in pcom. Meaning that pcom now has to be tailored for a particular OS. Thus we have to have option flags per operating system, like --windows, --linux, or something tricky like --register --depth=6, etc. It also means, and this is to me unforgivable, that intermediate decks from pcom aren't going to be able to run anywhere. A pgen deck is not going to run in pint, etc.
Sorry, I just love that movie. My favorite story from Sergio leone was that during the making of the movie, they passed an old tree that was twisted and without leaves. Sergio thought it was perfect for the set, and so the next day they returned with a truck and workers in uniforms who pretended they were city workers, and told the owners they were removing the tree for safety reasons. Movie making on a budget.
Anyways, the good part of this fix that, if implemented, it would take the restrictions off the calling convention, since after the change, Pascal-P6 would be completely compatible with the C based calling convention.
The bad and the ugly, besides the incompatability with pint as mentioned above, is that pcom and Pascal-P6 starts to build up a series of not only processor specific dependencies, but also operating system specific dependencies.
User routine calls need to be separated out into procedures and functions. The calls to procedures and functions are ambiguous with the cup instruction, which is used for both. The sfr could be used to differentiate, since it is 0 for procedures, but see previous discussion about why that is not defined at the sfr instruction time.
The simplest answer seems to be to introduce a new instruction, cuf or call user function, to complement cup or call user procedure. A similar course of action needs to be done for each of cip or cuv (call indirect procedure and call vector procedure). This tells pgen clearly that it is a non-terminal.
pint does not need to use up instruction space to represent these new instructions, since they can simply be aliased to the current instructions. The reason pint does not care about function/procedure is that the execution of the code provides that context.
The other issue that came up was what entry to use to represent the result type in pgen. I believe the cuf instruction can represent that, since pgen does not know or care about types -- the generated instructions do. Thus the r1 and r2 registers for a cuf entry represent the result types.
Calls in Pascal-p6 break down into procedures, functions, and system calls. A procedure call is terminal, a function call is not. The reorganization establishes a data structure for procedure and function calls.
The head entry for procedure and function calls are the cup and cuf entries. The construction handler removes the parameters to the procedure or function from the expression list to a list of parameters using the .pl link of the cup or cuf entry. It knows how many parameters to remove from the expression stack by the parlvl variable. This is set by the sfr instruction to the current stack count. When expression entries are added, the current stack level subtracted from the parlvl marker will give how many parameters to pull from the stack. These are then placed in the parameter list in order.
Its important to understand that the inclusion of expression entries in the parameter list is in addition to any other list such entries are on. They may be operators or leaves. They could have left and right entries. They could be in the expression list or other common list (although those two are mutually exclusive).
Each of the parameters can be registered or not. Not being registered indicates they are common stacked elements. The cup or cuf entry can carry a register, which in the cuf case indicates it has a registered function result.
As per other operands, cup or cuf entries have two phases:
For register assignment, each of the 1 to 6 primary parameters, from right to left, are given registers. The other (overflow) praameters are not given registers. Each parameter that is registered is pushed to stack. This way, all parameters exist on stack in left to right order, but copies of the first 1 to 6 parameters, in order, are left in registers.
The result, if registerable, is assigned to rax. It is not specifically calculated, but rather assumed to be the result of the function.
Working on rationalizing the locals allocation, which is wrong and broken since the framing fix. It does not affect the operation because the locals accounting is too high (to much locals space), which is not fatal.
To debug this I need #39, add word sizes to dump command, and probably dump with a sidebar in ASCII as well.
I am also leaning towards fixing the per-CPU calling convention fix as described previously. IE., lets have the showdown now while I am hip deep in framing issues.
One thing that would be useful is a command like:
sd (stack dump) 0000000000000000 0 ...
Print the contents of the stack in hex and decimal, default about 10 levels deep.
I decided that I wanted the improvements in the debugger to help look at locals allocation. This is tickets:
and perhaps:
Note the new dump formatting needs documenting.
Need to put these in documentation and debug regression test.
Implemented general stack dump command.
Moving on to doc and test.
Doc and test done. Somewhat of a sidetrack, but valuable. Working on checking the various framing mechanics of pgen. After that I plan to implement calling convention standard calling.
Refactored the mst instruction. Its still the same basic format:
mst level locals lstack
However the locals and lstack, or locals allocation and locals stack extent, were offsets. I changed them to values, meaning that the locals location in the code is the amount of bytes in the allocation, not an offset of mp, and lstack, which shows the maximum amount of locals stack required is also in bytes.
This change does not require any more work in the code, but makes the machine code more readable.
The locals allocation was fixed, it was allocating the locals space with the mark added in, meaning it allocated more than required.
Framing checks coming to and end. Need implementation for ipj, cip and rip instructions (intraprocedure jumps, call indirect procedure/function and return). This will allow running of the ISO7185 tests.
So I am in a phase of the work where register allocation issues come to the top. The register allocator in pgen is designed to be very simple, but does not necessarily produce the optimum code. Its a compromise.
First, the allocation is tree structured. The "trees" are expression trees. Building trees is very simple, and is based on stacking unary or binary operator entries. This immediately classifies the intermediate operators in Pascal-P into three classes:
If we just processed the intermediate to target machine code linearly, that is, one intermedate operator at a time, the way the Pascal-P intermediate was designed to be executed[1], then we would use the classic calls:
getreg() Get a register. putreg() Put a register. plcreg() Get a specific register.
Actually, you should only need getreg() and putreg() on an orthogonal machine, that is, a machine where all the registers can be used equally with all instruction types. However, that is theory. The reality is that calling conventions create specific uses for specific registers, and many machines have other requirements, like a specific register to take division results, etc. However, orthogonal machines are a good idea if you intend to efficiently compile for them, and this was true long before RISC[2]. Thus, whereas getreg() and putreg() get and put any register in the machine, plcreg() gets a specific register in the machine.
This scheme exists in a lot of compilers, including my first Pascal compiler designed back in 1980. The trick is that the complexity is within the calls themselves. getreg() and plcreg() go through whatever mechanics they need to in order to arrange a register for use.
So pgen collects the intermediate into trees, then runs a register allocation on those trees, then generates code for the trees. It does that while reading the intermediate. It collects trees, processes them, then does the next tree, etc, until the intermediate is completely read and compiled into machine code. The generate pass knows absolutely nothing about the register generation pass. It accepts a tree where each operator entry has a series of entries for the registers needed to generate assembly code, and it does not care how they got there. Thus if later we want to change or completely replace the register allocation algorithm, that won't require changes to the generate pass.
The register allocation can be divided into terminal and other entries. The difference is that terminal operators OWN the machine. They can scribble on any registers they want. There is no restriction on their use. They do use getreg() style calls, but it is just to let the expression tree processor know what registers to use or avoid using. The getreg() calls become:
getreg() Get a register resreg() Reserve a specific register
resreg() takes the place of plcreg(). The reason for the name change is that resreg() is just reserving the register for use. Its not placing the register or anything like that. It has no code generation action.
putreg() does not exist here because terminals, which again own the machine, don't care what happens to registers after they use them[8]. Expression tree operators don't need putreg() because their register reuse is handled automatically, as we will see.
The other difference between terminals and other operators is that terminals both allocate registers and generate code in one step. There are not two passes for terminals.
Register allocation starts, always, with a terminal, which (by definition) contains the expression tree that will be encoded. There is a set of registers indicated by frereg, that starts with all the registers that can be used to process results. In the case of AMD64, those registers don't include the calling registers, but this is just for convenience, not necessity. It will result in less operands being moved out of the way of calling convention registers[9].
So each terminal starts with a handful of registers, and uses the registers as they see fit, then generate code. Terminals don't even need to track register use, but they usually do so that the free register set can be passed to the tree operator register allocation pass.
The operator register allocation for trees has a single entry of assreg(tree, frereg, reg1, reg2). It takes the tree to allocate registers for, the current set of free registers, and zero, one or two registers that are "demanded" as a result[7]. All tree expression operators have a result, can be in one or two registers. The reason its up to two registers is because of "fat" pointers (pointers with templates) or double precision results like linteger. assreg() can either take specific register for the result or rgnull, in which case the caller to assreg is saying "I don't care what register you return the result in, use what you want". Specific demands are used to put results into registers because they are caller parameter registers or because a specific register is needed for the next operation[6]. Note that the number of registers and the type of register (integer or float) is according to the type of operator being processed.
assreg() accepts the set of free to use registers, and passes that down recursively as it calls itself for each subtree. Its a copy, so if assreg() is called for a subtree, if that subtree alocates a register, assreg() does not care, as will be shown. It does pass register allocations down, that is, if it uses (say) r1 and r2, then those are removed from the free register mask and that passed to the subnode.
For each node that assreg() processes, it gets the current set of available registers from its caller, allocates the registers it needs then passes the resulting allocation down in the tree by calling assreg() on all of its subnodes. Thus assreg() knows what registers it can or cannot use. What happens if it can't do its work with the set of registers it receives? This could happen for two reasons:
So what does it do? Here we are back to getreg() and resreg(). If getreg() runs out of registers, or if resreg() finds the register needed is already in use, the register is added to the "push mask" for the operator entry currently being processed. When the tree is being encoded, all of the registers in the push mask are pushed onto the stack before the code for the operator is generated, then all the registers in the push mask are restored from stack after the code for the operator is generated. When a register is added to the push mask it is freed up for use AT THAT NODE, because the contents are saved and restored automatically, UNLESS it is one of the result registers. Why not a result register? Because the purpose of the operator is to put a result in that register, so by definition it cannot be in use by the caller[3][4].
That's it. The system is very simple, even if it does not always look like it. As with many things, its simple if you understand it, not so much if you don't.
[1] Yes, it was. Operators like dup (duplicate stack) and dmp (dump from stack) make it clear. [2] I could write a book about how and why non-orthogonal machines exist. [3] You might have identified the catch 22: why would an operator allocate a result register when it is going to get overwritten? The reason is because an operator may call assreg() on other trees. [4] The algorithm can generate multiple redundant push/pops. It may get replaced by a direct spill system, where values are stored in local temporaries. [5] I prefer the term "terminal operator" to root because it is more descriptive. A terminal operator is the end of the generation of code for an expression tree. Its over, its done, its "terminal". [6] The last time I implemented an algorithm similar to this it didn't require results in any specific register, but instead moved the result to the required register. The result was a lot of unnecessary moves. The demand algorithm tends to propagate the result required register down to the bottom of the tree, resulting in less moves. [7] I call it a "demand" model because the operator being processed must put the result in that register, even if it has to perform the operation in other register(s) and move it there at the end. [8] This may change with further optimizations. [9] That's actually probably not true. The calling conventions for newer processors with lots of registers (16-32) usually set aside the call parameter registers for good reason. In the old days, with 8 or even fewer registers, not using all of them for calculations could not be afforded. With today's machines, calculations rarely spill, and rarely need all of the registers.
The cuf and cup handlers are kinda a mess right now. Part of this stems from having calls be both in terminal handlers and expression handlers. It will take a few days to work out. The good news is I am getting the register allocation cleaned up, as you can probably tell from the above post.
Quick status: the call handlers were refactored, but the code is not yet tested. There are vestiges of the old code being called by handlers such as cip (call indirect procedure), but these will be refactored soon.
The basics of the change is that all calls of all types (and there are a lot of them) will be done by genexp(). The reason why is that trying to handle things like csp both as a function and a procedure was creating the need for two different sets of procedures (more on why terminal handlers need to be different in a moment). The code is smaller, and looks and works better if genexp() does all the work.
The current path is (IMHO) to do a complete implement/test cycle of all call types, and then implement the AMD64 calling convention option. I believe this will remove downstream work by not having to refactor code later.
So why have terminal code at all? Terminal mode code (see above descriptions if you need help with this) essentially frees up the terminal handlers to just create code as they see fit with few restrictions. This includes the need to split into register assignment and code generation phases. Generation via assreg() and genexp() means that the description of the node to be coded has to be "packaged" into the expression entry. That is, all the parameters, registers, symbols, etc, that the node uses have to be present in the expression entry. This is more work and more obscure code (yes, I know it is already pretty obscure).
Thus I see terminal mode code sticking around.
There are a lot of opportunities to copy symbols from the intermediate to the output. They are pretty much no-ops to the code[1], so they can be copied from the input to the output as they appear. They break into two cases:
Block labels can appear as program location symbols (myloc:). The rest appear as equates. This pretty much matches their use in GDB. You say x/... (examine) and give an assembly address, and that is about all you can do without types. Location equates also end up as addresses, but the assembler determines their location.
With block labels, the block begin location pretty much marks the location of the entry code for that block, be it a program, procedure, function or other kind of block. For programs and other module type blocks, the answer is less clear. There IS entry code, but the code is a bit of spagetti that executes the module stacking system (described in the manual). This can be left "as is" for now. If we tried to make the label actually indicate the block start, it would be ambiguous in any case. Does it refer to the startup or the shutdown block?
Thus the net algoritim is simple. Each block or common is output at its point of encounter. Blocks get the location form (label:) and common symbols get equates.
In the long run (maybe reallly long run), we can think about converting types into gdb format. The information is all there, its just not in gdb format. I have also seen complaints from the GPC people that the format is not particularly Pascal friendly, so there is that.
[1] Yes they are. All of the common equates are just numbers to Pascal-P, and program locations are enumerated by the local labels system. Only external labels have actual meaning.
Symbols in the assembly file come in three groups:
block symbols are programs, modules, procedures, functions, etc. They look like:
test.myfunc
Ie., myfunc nested within program test. This goes to any level. '.' is a valid character in GAS, so this is actually one big symbol.
A module (program, module, etc). actually points to the entry code. In the present system this does some setup work, like setting up the standard file handles, then it does the IP Pascal[1] classic stacking sequence of:
call startup call next call shutdown
The startup and shutdown are the startup and shutdown blocks, as in blocks of code. Programs don't have exit blocks you say? Yes, actually they do, its just empty. The call next is to call the next module in series, because that is how Pascal-P module stacking works. Thus, don't expect module symbols to lead directly into the main code. Its a little tricky, but that's assembly level coding for ya.
procedures and functions can carry overload names. For convienence, the first overload encountered gets a name without the type spagetti. So these names are equivalent:
test.myproc test.myproc$p
Each of the overloads present are given an "encounter number", meaning if it is the first, second, etc, version of the procedure or function in the file. pgen does not know or care how many overloads there are, so every one of them gets a number. Thus:
test.myproc test.myproc$p test.myproc$1
Are all the same procedure. Numbered procedure/function overloads will never be in conflict with other names because they always have $p or $f at the start of their overload signatures.
So for an overloaded procedure:
program test;
procedure a; begin end; overload procedure a(i: integer); begin end;
begin end.
The symbols are:
test.myproc test.myproc$p test.myproc$p_i test.myproc$1 test.myproc$2
Where $1 refers to the procedure a without parameters, and $2 refers to the procedure a with a single parameter.
Note that encounter order is arbitrary. If the source order of overloads gets changed, then so will the assembly order. Thus you cannot count on $1 always referring to the parameterless procedure a. Equally, test.myproc is the first overload in encounter order, so you can't count on that always being the parameterless procedure a, either. The numbering of overloads is just a convenience for users.
Note also that these symbols are just like regular symbols to GAS/GDB. '$' is just another character.
Global symbols are direct addresses of global variables at fixed addresses. They are coined by module name like:
test.myvar
Obviously you will never see globals nested any more than one level deep.
Local symbols are offsets. They must be understood as offsets to the mp pointer, which in AMD64 is resident in rbp, the framing register. Parameters are also considered local. Parameters carry positive offsets, and local variable have negative offsets. They are useless without considering their use as an offset to the mp[2].
Locals are also coined:
test.myfunc.thisvar
Is a local to the function myfunc, which is in the module test. Note that if the function is overloaded, then the name of its locals carry that name as well:
test.myfunc$p.thisvar
As much as possible, I try to drop them at the same place they were in the intermediate. However, that does not always work. For example, the procedure and function declarations would not work if they were dropped like that, because in the case of nested procedures and functions, they would be referencing the included routines, not their actual blocks. Thus they get relocated.
Where are the various aliases of blocks mentioned above? They all get dumped. Thus a typical routine start is:
test.myproc: test.myproc$p: test.myproc$1:
Why? These are all aliases of a common procedure with no parameters. We're not trying to make the assembly file neat, but rather to make debugging convenient. Thus all of these aliases are available at run/debug time.
[1] Yes, that's where it came from, IP Pascal. I'm well aware that GCC uses a different scheme to do startups using pragmas,, which I had to use in Petit-Ami. Guess what? IP Pascal's scheme is simpler and better, so Pascal-P uses it. Actually, more correct, it uses a mix of both. [2] GDB probably has mechanisms to automate this as part of source level debugging, which I may implement at some point, as one of my programmer friends used to say "in my copious spare time".
Just some comments on the "assembly symbols" plan:
A PS here is that I think the idea of "prettying up" the assembly code has to be resisted. Assembly language is not supposed to be pretty. Pretty obscure, perhaps.
Overview: The final adaption to the AMD64 calling convention is based on three things:
Note none of this affects csp/system calls, since those are register only in any case, AT LEAST UNTIL WE REACH WIN64, because they didn't give us enough registers to make certain calls.
The problem with register based calling conventions is that if you aren't really tricky, you have to store the parameters passed in registers at some point so you won't lose them during register assignment. Enter the Microsoft plan. The way this works is that you add a block of locals to the routine sufficient to store the registers. Then, each registered parameter is stored in the local block on entry. You can count on the idea that each parameter will be one word, otherwise it wouldn't be registerable. Thus each store is just a word, and the number of words can be determined at generation time.
For the first pass, without load optimization, the algorithm becomes to assign addresses to the parameters according to their stored location. Each on stack parameter gets an address in the parameter stack area. Each registered parameter get an address in the local store block. Thus to the code inside the routine, it makes no difference how the register allocation is done.
If you notice, I like decorated assembly files. If you don't want that, its pretty simple:
> cat test.s | sed '/^#/d'
And that strips the comments out of the file.
Why not an option for this? Welll, most users won't care, it would take a lot of code to do that, and I don't personally have a use for that. And it is easy to remove from the final file by other means, like above.
test.x or similar labels don't work, since it thinks you are trying to take a field of structure test. You have to do it like:
p (long long)'test.x'
IE., put it in single quotes.
You need to cast everything to let gdb know what the types are.
The above symbols output was done. GAS can output a lot to the assembly file, including:
In short, GAS can take a lot and pass it to GDB. I am going to kick this into the way future
To complete the ep part of mst, we either needed to allocate a register or use a stack location. However, the mark contains a copy of the current ep. Thus we need the following actions:
This needs to be done in conjunction with a review of why and how we are using the ep and previous ep in the code.
PS. On review this is just used to restore the ep on returns. This is not required if we don't have an ep register. Thus the previous ep becomes a dummy value and it not used. It has to stay for now since it is part of the allocation, but could be optimized out at a later time.
PSPS.
The use of that variable is minimal in the code. Most of the references to it are maintaining it as a copy. Thus it is a candidate for elimination as a register. It stays as a mark element.
So eliminating the previous ep from the mark brings the mark local storage to 2 locations, the current ep and the stack bottom. Why is the mark local storage still there?
Examining the code further, I don't think we need the ep at all. The ep only used one time, at mst time, to check stack overflows. mst does not need to keep that, it can find the ep by stack offset, use it to perform the check, then discard it.
This takes the mark down to one location, the stack bottom, which is used to reset the stack after a long jump.
Alright, on further examination, there was an error put into the code (by me) between Pascal-P4 and Pascal-P5. The dynamic space allocator checks heap overflow against the sp (stack pointer) register. This is wrong, it should check against ep. In other words, it should check against the maximum possible extent of the stack, which is ep, not the current stack pointer. In fact, it was this way in Pascal-P4.
Thus, the ep is a valuable piece of information and should stay. There are (after this proposed change) in fact three items of data in the mark, the stack bottom, the ep (stack extent pointer) and the ra or return address. The offset equation for the RA is not defined because it's only use is to return to the caller, and in that case we are collapsing the frame and don't need to access it via the MP.
pgen has almost produced a complete translation of iso7185pat.pas. Its getting exciting. Its been about 30 years since I last brought a compiler online.
There's a couple of refactors above that are going to get delayed, including restructuring the calling convention, changing the way EP works, further optimizations, etc. This probably extends to 0.3. It will increase the net work required after 0.3, but I am anxious to get a working version of the compiler going.
Sooooo, transparent operators began innocently enough. The operator "chk" takes a single operand, checks it for bounds and leaves it on the stack. Another way of saying that is that chk is a single in, single out operator. The trouble is, that system grew, and I am the one that grew it. Now we have ckv and ivt instructions. The issue is they take two operands in, and give the same two operands back, or you could say they are dual in, dual out operators.
The issue is that to form graph structures, trees need to have single out operators. Any number in is ok, but multiple out operators are not ok. Trees don't have more than one branch converging to more than one branch, that is the nature of a tree. There are operators that deliver more than one component, like pointers to containers, which consist of an address/length pair, but that is treated as a single result of that operator.
Even that would be ok, but chk and ckv/ivt are considered "transparent" operators. That is, they take in operands but leave those same operands unchanged on the stack. Thus they can be and are inserted optionally into the intermediate stream according to if the check they provide is enabled.
This is easy on a stack architecture. Two operands in on the stack, two out, no problem. On a graphed code generation architecture, its a problem. In pgen, the stack architecture gets evaluated only once, when it is converted to a tree on input.
Ideally, the solution to this problem should be the same as the name of the operator, that is "transparent". It should be possible to add or subtract the operator without change to the functionality.
Transparent operators work, or at least are no longer blocking the translation.
Stack bottom checks are problematic. All terminals have or had them, and there are clearly terminals that are issued on non-zero stacks. ujp is an example.
In the bigger picture, stack bottom checks likely have to all be removed. They were good for early checks on the code, since they caught cases of stack imbalance. However, there are cases like "with" statement that expect to cache results on the stack.
The plan is to analyze the current failure and then remove the checks.
iso7185 is about 1/4 translated at this point. This is slow going.
Stack bottom check resulted from case statement. It keeps the selector on stack, so that was messing up the bottom check. It would not have encoded properly in any case, since it was very much stack dependent. Unlike some other cases of this, I decided to clean this one up. The answer was to store and load the selector from a local variable like for loop does.
I verified the with statement keeps data on the stack, but I think I have a different way of handling that that makes use of the with statement start and end instructions.
Thus the stack bottom checks stay in for now.
At 1892 out of 3380 lines in iso7185pat.pas. Slow going but making progress. Next after that is fix bad assembly statements, then into functional verification.
Starting off the next project, which is a 80386 (32 bit) code generation facility for P6. This is very preliminary, and the code and the documentation are subject to change. If you would like to help, that would be encouraged. I have committed a preliminary code module, pgen_gcc_i80386.pas. The documentation has also got a new section that discusses the project, so if you want to know more, read that.