Closed samiam95124 closed 2 months ago
These variables count off levels on the stack. It is presently used to count parameters to procedures and functions. The issue is that this cannot support nested functions/procedures without modification.
I think when I coded this I was trying to make it so that the intermediate would never have to be modified to make pgen work. I think less so now.
The most straightforward solution is to pass the number of parameters as an instruction field for cup, cuf, cip and cif.
7 sample_programs/hello.pas PASS
25 sample_programs/roman.pas PASS
42 sample_programs/qsort.pas PASS
101 sample_programs/match.pas PASS
108 sample_programs/prime.pas PASS
339 sample_programs/fbench.pas FAIL
856 sample_programs/drystone.pas FAIL
973 sample_programs/startrek.pas FAIL
1352 sample_programs/basics.pas PASS
1939 sample_programs/pascals.pas FAIL
7 sample_programs/hello.pas PASS
25 sample_programs/roman.pas PASS
42 sample_programs/qsort.pas PASS
101 sample_programs/match.pas PASS
108 sample_programs/prime.pas PASS
339 sample_programs/fbench.pas PASS
856 sample_programs/drystone.pas FAIL
973 sample_programs/startrek.pas PASS
1352 sample_programs/basics.pas PASS
1939 sample_programs/pascals.pas FAIL
Two more to go.
I have been thinking about annotation for quite a while. Annotation in the case of Pascal-P6 means we output definitions, in english, about what the intermediate instructions do. It saves having to go look it up, and hopefully makes reading and working with the intermediates easier.
Here's a clip:
! {Return a random number between two bounds}
! const a = 16807;
b f random@f_i_i
s low p 48 i
s hi p 40 i
:86
:87
! m = 2147483647;
:88
! var gamma: integer;
:89
! BEGIN
s gamma l -24 i
! gamma := a*(rndseq mod (m div a))-(m mod a)*(rndseq div (m div a));
l startrek.24
:90
:90
mst 1 l startrek.25 l startrek.26 ! Mark(frame) stack
:91
ldci 16807 ! Load constant(t)
ldoi 88 ! Load global value(t)
ldci 2147483647 ! Load constant(t)
ldci 16807 ! Load constant(t)
dvi ! Divide integers
mod ! Modulo integers
mpi ! Multiply integers
ldci 2147483647 ! Load constant(t)
ldci 16807 ! Load constant(t)
mod ! Modulo integers
ldoi 88 ! Load global value(t)
ldci 2147483647 ! Load constant(t)
ldci 16807 ! Load constant(t)
dvi ! Divide integers
dvi ! Divide integers
This seemed like a fairly inexpensive way to do it. I put it in pcom, which means that each of the downstream processors can simply pass it through.
Annotation can get out of hand. I could annotate all of the generated assembly code, but that would:
The way it is, it actually reduces the amount of work for pgen creators, since you don't have to look up the meanings of intermediates in the book all of the time.
The other factor is it expands the output files. I note again that all of this can be stripped easily (see above). However, I think it is reasonable and manageable.
You will notice the (t) lines above. pcom has many instructions that break out by type, integer, real, boolean, etc. This would have been a lot of work for pgen to break out each type. In any case, the type endings are very regular (see the manual), and so it makes a lot of sense to have only one annotation per instruction, and then you look at the type ending where present.
One of the anomalies with the code in pint, and by extension pmach and pgen, which are derived from the same code, is the odd mechanisim for lookahead used by the code. It centers around the routine "getnxt", which loads the next character in the input to ch. Its a simple form of lookahead. However, it does not interact with the Pascal built-in read routines. For example:
while (ch = ' ') and not eoln(prd) do getnext;
As a skip spaces formulation will fail. Why? Because the next read, say, read(a) where a is an integer will return garbage. This is because the skip space will eat the first character of the number as far as the Pascal I/O subsystem is concerned. It does not know about your loading the next character into ch.
But Pascal is famous for having built in lookahead on files, right? WTF? The meme is f^ where f is a file. In fact, that little feature caused a lot of issues with early Pascal before it got solved, because it causes issues with interactive files (the console, fer example!). It got solved by a mechanism called "lazy I/O", which is a whole 'nuther story I won't get into here.
Ummm, so why doesn't the code use file buffer variables? Well, the simplest explanation is that the code is derived from Pascal-P4, and Pascal-P4, as I am sure everyone is tired of me saying, is a subset of Pascal, not complete Pascal. And file buffer variables is one of those things that got subsetted. Pascal-P4 was "vaguely" designed to self compile. It dosen't really, but that, again, is a 'nuther story. Self compile got fixed in Pascal-P5 by some dude named Scott Moore (vs. 1.0 of Scott Franco).
So that is why pgen has the kluky lookahead. Fixing that would require rewriting that code.
7 sample_programs/hello.pas PASS
25 sample_programs/roman.pas PASS
42 sample_programs/qsort.pas PASS
101 sample_programs/match.pas PASS
108 sample_programs/prime.pas PASS
339 sample_programs/fbench.pas PASS
856 sample_programs/drystone.pas PASS
973 sample_programs/startrek.pas PASS
1352 sample_programs/basics.pas PASS
1939 sample_programs/pascals.pas FAIL
One more to go.
For anyone keeping score here, the test series is covered in bin/regress. The next steps are p2, p4, p5 and p6 (self compile), then the self run series, ie, "not a small amount". The good thing (or bad thing depending on your point of view), is that pgen gets a proper workout before a true self compilation.
How do I feel about it? That does not really work. I am my worst pessimist. I am always expecting the whole thing to not work, for the whole concept to be flawed, and the program needing to be thrown away. I am always amazed when anything I do works. I think the only objective analysis I can do is to think of my work as being done by someone else. "gee, that guy has 30 years of compiler programming, he seems to know what he is doing", etc. The other way is comparitive analysis. The register allocation is similar to that used in IP Pascal, which was a very solid compiler, so there is no reason it would not work here[1].
Another concrete result of this work is I have gained tons of experience debugging this thing with GDB. I did do a writeup of this, but realized I don't have the time to invest in it properly right now. Frankly I mainly want to get this self compiling before my next assignment.
[1] A similar situation was register allocation on the Z80 compiler back in the early 1980's. I tried to prove that it worked correctly but was unable. It did, however, work, and I never caught it doing an incorrect encode.
7 sample_programs/hello.pas PASS
25 sample_programs/roman.pas PASS
42 sample_programs/qsort.pas PASS
101 sample_programs/match.pas PASS
108 sample_programs/prime.pas PASS
339 sample_programs/fbench.pas PASS
856 sample_programs/drystone.pas PASS
973 sample_programs/startrek.pas PASS
1352 sample_programs/basics.pas PASS
1939 sample_programs/pascals.pas PASS
Sample programs passes complete. Next is P2.
p2 and p4 fail. So that's next.
P2 in pcomp was found to be throwing an error on a valid tag assignment. Will debug.
There were some issues created by recent changes in the regression, but those were fixed and short regression runs again. Long regression tests will be rerun.
Debug is greatly aided by gdb, especially using gdbtui mode. For any segfault using bt (backtrace), you get valid file line numbers in the .s file. You can break at a given line in the .s file, or use routine names. For gpc sourced files, you get line numbers going back to the original Pascal source even after the macro processor due to GPC accepting macro line sets.
Typically it is very easy to find your way around the assembly files due to the rich information found there. For any generated instructions, you can look backward to find the intermediate and source lines that generated it. To find a given source line, search for ":line".
One thing I have used is to compare the operation of target binaries with the equivalent generated by packaged mode apps (--package on most scripts). This gives a binary that has equivalent behavior to pgen binaries. This is enhanced by the fact a lot of code is in common between cmach.c and pgen. For example this A-B comparison mode was used to debug pascal-s.
In the long run the .s file can be extensively annotated to provide symbols like locals to gdb. This is future work.
Frankly no. Debugging with pint debug mode is the way to go, it gives you superior views into the target code. That was always the plan going forward. However its obvious that there is a lot you can do with gdb, and so I expect those improvements to happen over time.
This fix was done, but it is mostly useless. Why? Because the main code only throws 8 errors at this time. Most of the errors come from the psystem module itself, so most errors will just list the psystem module and the line it occurred on. ie, pretty useless information.
The fix for this would be to pass that information to psystem as parameters, but that would push several routines over the 6 parameter limit (see calling convention issues above). Thus, "it ain't a-gonna-happen" anytime soon.
The other problem is case value errors cannot know the module and line, because they are processed external to the offending module (see above). There is a fix for this, but it would involve generating a line number equivalence table in pgen, ie., not trivial. So later... later.
The result is really the same as before, if you get an error, and it is not obvious where it happened, warm up gdb and get a backtrace, then be ready to translate the information found there.
So one interesting fact about the original Pascal-P5 (and lower) compiler is that it discards all symbols when it creates its intermediates. It simply generates all variable references as offsets, and marks program locations with numeric labels. This was modified a bit with modularity in Pascal-P6. I thought a bit for how to do this with intermodule references, but came to the conclusion that any numeric solution was unworkable, If the user changed the order of anything, the numbers would recalculate, and things would break. Thus Pascal-P6 uses a method of "near vs. far" labels. Near labels are the old, Pascal-P1-P5 intramodule labels, and far labels go between modules. Near labels are numeric, and far labels are symbolic, using the name of the module.
This changed a bit with the debug mode in Pascal-P6 pint. All of the symbols are now shipped to the back end via the intermediate as (effectively) "annotations". Each symbol entry passed contains not only the name of the symbol, but also it's type. The reason I call them "annotations" is because they are not required to run the program. They could be stripped off and the program would still function. Far labels are the one exception to that rule.
So the "rich assembly" initiative idea is to enhance the output format of .s or assembly files. One part of that was to annotate or comment the code. The first part of this was to annotate the intermediate, and pgen carries that information through to the output .s file. The second part was to annotate the machine code itself. The final part is to replace the numeric labels of variables and program jump/call targets with meaningful symbols as possible. For example, here is a routine from "startrek.pas"
FUNCTION random (low, hi : INTEGER) : INTEGER;
{Return a random number between two bounds}
const a = 16807;
m = 2147483647;
var gamma: integer;
BEGIN
gamma := a*(rndseq mod (m div a))-(m mod a)*(rndseq div (m div a));
if gamma > 0 then rndseq := gamma else rndseq := gamma+m;
random := rndseq div (m div (hi-low+1))+low
END {of random};
expp$_tmpspc = 0
startrek.19 = 40
startrek.20 = 24
startrek.random.low = 48
startrek.random$f_i_i.low = 48
startrek.random.hi = 40
startrek.random$f_i_i.hi = 40
startrek.random.gamma = -24
startrek.random$f_i_i.gamma = -24
startrek.24:
startrek.random:
startrek.random$f_i_i:
pushq $0 # place current ep
pushq $0 # place bottom of stack
pushq $0 # place previous ep
enterq $0,$2 # enter frame
movq %rsp,%rax # copy sp
subq $startrek.25+random$_tmpspc,%rax # find sp-locals
1:
cmpq %rax,%rsp # check have reached stack
je 2f # skip if so
pushq $0 # push 0 word for locals
jmp 1b # loop
2:
movq %rsp,16(%rbp) # set bottom of stack
andq $0xfffffffffffffff0,%rsp # align stack
pushq %rbx # save protected registers and keep aligned
pushq %r12
pushq %r13
pushq %r14
pushq %r15
pushq %r15 # second push aligns
movq $16807,%r12 # load quad constant
movq startrek.rndseq(%rip),%rax # load global quad
pushq %rax # save used register
movq $2147483647,%rax # load quad constant
movq $16807,%r13 # load quad constant
xorq %rdx,%rdx # clear upper dividend
subq $0,%rax # find sign of dividend
jns 1f # skip positive
decq %rdx # set sign of upper dividend
1:
idivq %r13 # divide integer
movq %rax,%r13 # move result into place
popq %rax # restore used quad register
xorq %rdx,%rdx # clear upper dividend
subq $0,%rax # find sign of dividend
jns 1f # skip positive
decq %rdx # set sign of upper dividend
1:
idivq %r13
imulq %rdx,%r12 # multiply integers
movq $2147483647,%rax # load quad constant
movq $16807,%r14 # load quad constant
xorq %rdx,%rdx # clear upper dividend
subq $0,%rax # find sign of dividend
jns 1f # skip positive
decq %rdx # set sign of upper dividend
1:
idivq %r14
movq %rdx,%r13 # place result
movq startrek.rndseq(%rip),%rax # load global quad
pushq %rax # save used register
movq $2147483647,%rax # load quad constant
movq $16807,%r14 # load quad constant
xorq %rdx,%rdx # clear upper dividend
subq $0,%rax # find sign of dividend
jns 1f # skip positive
decq %rdx # set sign of upper dividend
1:
idivq %r14 # divide integer
movq %rax,%r14 # move result into place
popq %rax # restore used quad register
xorq %rdx,%rdx # clear upper dividend
subq $0,%rax # find sign of dividend
jns 1f # skip positive
decq %rdx # set sign of upper dividend
1:
idivq %r14 # divide integer
imulq %rax,%r13 # multiply integers
sub %r13,%r12 # subtract integers
movq -16(%rbp),%rbx # get display pointer
movq %r12,startrek.random.gamma(%rbx) # store qword
movq -16(%rbp),%rbx # get display pointer
movq startrek.random.gamma(%rbx),%rbx # fetch local qword
movq $0,%r12 # load quad constant
cmp %r12,%rbx # compare
setg %bl # set greater
movsx %bl,%rbx # sign extend boolean
orb %bl,%bl # move boolean to flags
jz startrek.28 # go if false
movq -16(%rbp),%rbx # get display pointer
movq startrek.random.gamma(%rbx),%rbx # fetch local qword
movq %rbx,startrek.rndseq(%rip) # store quad to global
jmp startrek.29
startrek.28:
movq -16(%rbp),%rbx # get display pointer
movq startrek.random.gamma(%rbx),%rbx # fetch local qword
movq $2147483647,%r12 # load quad constant
add %r12,%rbx # add integers
movq %rbx,startrek.rndseq(%rip) # store quad to global
startrek.29:
movq startrek.rndseq(%rip),%rax # load global quad
pushq %rax # save used register
movq $2147483647,%rax # load quad constant
movq -16(%rbp),%r13 # get display pointer
movq startrek.random.hi(%r13),%r13 # fetch local qword
movq -16(%rbp),%r14 # get display pointer
movq startrek.random.low(%r14),%r14 # fetch local qword
sub %r14,%r13 # subtract integers
movq $1,%r14 # load quad constant
add %r14,%r13 # add integers
xorq %rdx,%rdx # clear upper dividend
subq $0,%rax # find sign of dividend
jns 1f # skip positive
decq %rdx # set sign of upper dividend
1:
idivq %r13 # divide integer
movq %rax,%r13 # move result into place
popq %rax # restore used quad register
xorq %rdx,%rdx # clear upper dividend
subq $0,%rax # find sign of dividend
jns 1f # skip positive
decq %rdx # set sign of upper dividend
1:
idivq %r13 # divide integer
movq %rax,%r12 # move result into place
movq -16(%rbp),%r13 # get display pointer
movq startrek.random.low(%r13),%r13 # fetch local qword
add %r13,%r12 # add integers
movq -16(%rbp),%rbx # get display pointer
movq %r12,56(%rbx) # store qword
popq %r15 # undo alignment push
popq %r15 # restore protected registers
popq %r14
popq %r13
popq %r12
popq %rbx
leave # undo frame
addq $24,%rsp # remove frame data
popq %rcx # get return address
addq $16,%rsp # remove caller parameters
popq %rax # get qword result
pushq %rcx # replace return address
ret # ret
random$_tmpspc = 0
I picked that one because it is a relatively short routine. random has a mix of numeric jump labels and variable symbols, both local and global (rndseq is global). Those jump labels are short jumps that are anonymous and will never have symbols associated with them.
The first fact is that all of the symbols (not numerics) are fully qualified. Thus it is "startrek.random.gamma", meaning gamma as a local to random, which is a function in the module startrek. There are such things as assemblers that can understand nested scopes, but frankly I have no idea if gas can use such things and would not use them if it did. The reason why is I would prefer pgen not be that dependent on a particular assembler and its special constructs.
The second fact is that not all numeric labels get resolved to symbols. That statement:
movq %r12,56(%rbx) # store qword
Above is used to store the function result. As far as the code is concerned, it is an anonymous location. In fact, there are lots of anonymous locations in the code. Besides function results, there are anonymous variables kept at compile time, such as the base address of a "with" reference, the parameters of a for loop, etc. To this pgen adds various temp variables allocated on the stack. Thus, not everything has a name.
A third fact is that jump labels get their module name. "startrek.29:" above is the 29th jump label in the module "startrek". Its not "startrek.random.29:" because it is not ambiguous in the module. The generated labels are unique within the module, not within the routine.
Yes, but as mentioned they can be stripped of the line comments. I committed a script to do that, strips (strip .s files) to do that. Frankly it would expand the code a lot to add an option to strip those. Does the extra text slow down the assembly process? Well, yes, I think it does. I don't actually consider it a problem, and I am not sure the extra speed would justify stripping the file in advance (although it would be an interesting experiment.
The bottom line is that rich assembly cuts down on the "runs", as in running to the manual each time to look up what a particular construct does.
The last commit added the ability to replace compiler labels with symbolic labels for procedure/function calls. Meaning that (for example):
# 365: 2705: 12: cup l startrek.215 1 ! Call user procedure
gets replaced with:
call startrek.printgalaxy # call user procedure
(from sample_programs/startrek)
The first form is the compiler label from the intermediate. Its actually a numeric label, 215, made to look like a symbol. That must be, because using numbers for labels in assembly is either illegal or has a completely different meaning, like a local label. Both are unwanted. Compiler labels are just an enumeration of all the labels needed in one compile session (one file).
Its a good question. The idea was that the intermediate gets preserved "as is" when passed through to the .s file as comments. In any case, nobody is debating the intermediate could be more readable. Its more like remaking that is practically like remaking the whole compiler.
So the label "startrek.printgalaxy" is a qualident, or "qualified identifier". It tells you that the routine "printgalaxy" is a procedure or function within the module "startrek". What would happen if printgalaxy were part of an overload group?
Well, you could use the typed name for the routine, which is "startrek.printgalaxy$p_i_i_i_b". No I am not kidding. It means printgalaxy is a procedure, with three integer parameters followed by a boolean parameter. Using that name makes any number of overloads non-ambiguous. However, trying to use such names can get tedious. How about, for example, the routine "distance" from the same program:
startrek.distance$f_x_0_7_i_x_0_7_i_r_x_4_x_0_7_i_y_0_x_0_7_i_
Yes, that is valid type spaghetti. I call it a "crushed" type specification, because the original is:
distance@f_x(0,7)i_x(0,7)i_r(x:4:x(0,7)i,y:0:x(0,7)i)
Yes, I can feel your eyes glazing over. Its a valid type specification for the function. It does contain characters that are not valid in GAS, so they got replaced in the .s file, and hence the symbol type is "crushed".
This is not even one of the longer examples in the .s file. I don't really want to use those to identify routines in the .s file. So we have the type spaghetti version and the compiler label version of the routine name. Both are obscure. The type spaghetti version would be a bear to type out, say during debugging. So a better way is needed.
So with the ability to have overloads, symbols be having to be unique. So what pgen does is:
startrek.distance
startrek.distance$2
startrek.distance$3
startrek.distance$4
...
It numbers the different routines, in order of encounter in the intermediate. We drop the number following the symbol if the encounter number is 1, because since it is the only one of the overload group so treated, its still unique. This system has the advantage that if the routine is not overloaded, you can just use its name. No typing information required.
Now the issue with this system is we have no idea what the different types of these overloaded routines are. You can look them up in the .s file by searching for their typed (spaghetti) form, or by the source .pas file by encounter order. The advantage of the system is that you can use a short name for the routine instead of the long typed name. And as far as the compiler/GAS, there is no issue selecting which overload is meant. They are unique. Finally, the system matches pint debug mode, so it all ties together.
If you look at your .s file, you'll note that of all the different forms of procedure/function symbols, they all appear in the .s file, the compiler label, the fully typed symbol, and the shorthand forms. In other words, you can pick and choose what form you want to use. Usually this only applies if you are debugging the code, say with gdb.
Ok, so here is a live debug log. After the last round of fixes, the regression fails at regress line 49:
basic.lst
Basic interpreter vs. 0.1 Copyright (C) 1994 S. A. Moore
Ready
*** Runtime error: basic:2689:Value out of range
Looking at basic.pas:
2687: i1 := li; { restore label position }
2688: getlab(lab, false); { get the variable label }
2689: vi := fndvar(lab); { find that in the table }
2690: if vi <> 0 then { variable exists }
The "vi := fndvar(lab);" is a complex call, but by the type of error, I can already tell that it occurred on assignment to vi, because the function call would not have a range check in it.
Looking at basic.s, the offending source line can be found by searching for :2689 (the source line marker for 2689):
# 2688: 13241: ! vi := fndvar(lab); { find that in the table }
# 2689: 13242: :2689
# 2689: 13243: 245: sfr l basic.1037 ! Set function result
# 2689: 13244: 4: lda 2 -368 ! Load local address
# 2689: 13245: l basic.1037=8
basic.1037 = 8
# 2689: 13246: 246: cuf l basic.739 1 0 ! Call user function
# 2689: 13247: :2689
# 2689: 13248: 199: chkx 0 255 ! Bounds check(t)
# 2689: 13249: 195: strx 2 -376 ! Store local(t)
# assigning: 199: chkx 0 rs: [] ~rf: [rbx]
# assigning: 246: cuf 1 dr1: r12 rs: [] ~rf: [rbx, r12, r13]
# assigning: 4: lda -368 dr1: rdi rs: [] ~rf: [rbx, rdi, r12, r13]
# assigning~: 4: lda -368 r1: rdi rs: [] ~rf: [rbx, rdi, r12, r13]
# assigning~: 246: cuf 1 r1: r12 rs: [] ~rf: [rbx, rdi, r12, r13]
# assigning~: 199: chkx 0 r1: r12 t1: r13 rs: [] ~rf: [rbx, r12, r13]
# expr:
# 199: chkx 0 r1: r12 t1: r13 rs: []
# Left:
# 246: cuf 1 r1: r12 rs: []
# call start:
# 245: sfr 0 rs: []
# Parameter:
# 4: lda -368 r1: rdi rs: []
# generating: 246: cuf 1 r1: r12 rs: []
# generating: 245: sfr 0 rs: []
subq $basic.1037,%rsp # allocate function result on stack
# 4: lda -368 r1: rdi rs: []
# generating: 4: lda -368 r1: rdi rs: []
movq -16(%rbp),%rdi # get display pointer
lea basic.keycom.lab(%rdi),%rdi # index local
pushq %rdi # save parameter
call basic.fndvar # call user procedure
movq %rax,%r12 # place result
# generating: 199: chkx 0 r1: r12 t1: r13 rs: []
movq $0,%r13 # load low bound
cmpq %r13,%r12 # compare
jge 1f # skip if greater or equal
leaq modnam(%rip),%rdi # load module name
movq $2689,%rsi # load line number
movq $ValueOutOfRange,%rdx #
call psystem_errore # process error
1:
movq $255,%r13 # load high bound
cmpq %r13,%r12 # compare
jle 1f # skip if less or equal
leaq modnam(%rip),%rdi # load module name
movq $2689,%rsi # load line number
movq $ValueOutOfRange,%rdx # load error code
call psystem_errore # process error
1:
# generating: 195: strx
movq -16(%rbp),%rbx # get display pointer
movb %r12b,basic.keycom.vi(%rbx) # store byte
So that is the entire encode for line 2689. The routine is basic.fndvar (the qualified form above). The result comes back in rax, but gets moved to r12, and then is compared for bonds between 0 and 255. Looking at the definition of vi:
maxvar = 255; { maximum number of variables in program }
...
varinx = 0..maxvar; { index for variables table }
...
vi: varinx; { index for variables table }
So the range of 0..255 makes sense. Let's see what fndvar() is returning. You can't really debug a script like testprog, so I strip out what testprog is doing, namely to run basic with redirected input:
samiam@samiam-h-pc-2:~/projects/pascal/pascal-p6$ basic/basic < basic/prog/lunar.bas
Basic interpreter vs. 0.1 Copyright (C) 1994 S. A. Moore
Ready
*** Runtime error: basic:2689:Value out of range
samiam@samiam-h-pc-2:~/projects/pascal/pascal-p6$
So that checks out, it reproduces the error. Now let's do it in gdb. Note that there is a trick with the run command for redirect:
GNU gdb (Ubuntu 9.2-0ubuntu1~20.04.1) 9.2
Copyright (C) 2020 Free Software Foundation, Inc.
License GPLv3+: GNU GPL version 3 or later <http://gnu.org/licenses/gpl.html>
This is free software: you are free to change and redistribute it.
There is NO WARRANTY, to the extent permitted by law.
Type "show copying" and "show warranty" for details.
This GDB was configured as "x86_64-linux-gnu".
Type "show configuration" for configuration details.
For bug reporting instructions, please see:
<http://www.gnu.org/software/gdb/bugs/>.
Find the GDB manual and other documentation resources online at:
<http://www.gnu.org/software/gdb/documentation/>.
For help, type "help".
Type "apropos word" to search for commands related to "word"...
Reading symbols from basic/basic...
(gdb) b 55228
Breakpoint 1 at 0x1b5e7: file basic/basic.s, line 55228.
(gdb) run < basic/prog/lunar.bas
Starting program: /home/samiam/projects/pascal/pascal-p6/basic/basic < basic/prog/lunar.bas
Basic interpreter vs. 0.1 Copyright (C) 1994 S. A. Moore
Ready
Breakpoint 1, basic () at basic/basic.s:55228
│ 55226 pushq %rdi # save parameter │
│ 55227 call basic.fndvar # call user procedure │
│B+>55228 movq %rax,%r12 # place result │
│ 55229 # generating: 199: chkx 0 r1: r12 t1: r13 rs: [] │
│ 55230 1: movq $0,%r13 # load low bound
(gdb) i r
rax 0x55555556f500 93824992343296 Breakpoint 1, basic () at basic/basic.s:55228
rbx 0x1 1
rcx 0x55555556f5e7 93824992343527
rdx 0x5555555f4c40 93824992889920
rsi 0x7fffffffdc07 140737488346119
rdi 0x7fffffffdbe0 140737488346080
rbp 0x7fffffffdd50 0x7fffffffdd50
rsp 0x7fffffffdb70 0x7fffffffdb70
r8 0x7fffffffdc14 140737488346132
r9 0x7fffffffdbf0 140737488346096
r10 0x5555555b4bc6 93824992627654
r11 0x7ffff7e54be0 140737352387552
r12 0x9 9
r13 0x6c 108
r14 0x50 80
r15 0x6c 108
rip 0x55555556f5e7 0x55555556f5e7 <basic+149>
eflags 0x202 [ IF ]
cs 0x33 51
ss 0x2b 43
ds 0x0 0
es 0x0 0
fs 0x0 0
gs 0x0 0
(gdb)
Note that I cleaned up the printout a bit, gdbtui mode is not really designed for cut and paste.
Note that 55228 is the line number of basic.s, not basic.pas.
The result above, in the rax register, is 0x55555556f500. That isn't a valid return value, and in fact looks like a global address (address in the .bss space).
So fndvar() is a short function, and returns varinx type. Since it didn't fail within fndvar(), my best theory is that fndvar() worked, but the mechanism to return the result didn't work.
Thus looking for:
basic.pas:2172: fndvar := fi { return variable index }
# 2171: 10311: ! fndvar := fi { return variable index }
# 2172: 10312: :2172
# 2172: 10313: !
# 2172: 10314: ! end;
# 2172: 10315: 193: lodx 2 -24 ! Load local value(t)
# 2172: 10316: 199: chkx 0 255 ! Bounds check(t)
# 2172: 10317: 195: strx 2 48 ! Store local(t)
# assigning: 199: chkx 0 rs: [] ~rf: [rbx]
# assigning: 193: lodx -24 dr1: r12 rs: [] ~rf: [rbx, r12, r13]
# assigning~: 193: lodx -24 r1: r12 rs: [] ~rf: [rbx, r12, r13]
# assigning~: 199: chkx 0 r1: r12 t1: r13 rs: [] ~rf: [rbx, r12, r13]
# expr:
# 199: chkx 0 r1: r12 t1: r13 rs: []
# Left:
# 193: lodx -24 r1: r12 rs: []
# generating: 193: lodx -24 r1: r12 rs: []
movq -16(%rbp),%r12 # get display pointer
movzx basic.fndvar.fi(%r12),%r12 # fetch local byte
# generating: 199: chkx 0 r1: r12 t1: r13 rs: []
movq $0,%r13 # load low bound
cmpq %r13,%r12 # compare
jge 1f # skip if greater or equal
leaq modnam(%rip),%rdi # load module name
movq $2172,%rsi # load line number
movq $ValueOutOfRange,%rdx #
call psystem_errore # process error
1:
movq $255,%r13 # load high bound
cmpq %r13,%r12 # compare
jle 1f # skip if less or equal
leaq modnam(%rip),%rdi # load module name
movq $2172,%rsi # load line number
movq $ValueOutOfRange,%rdx # load error code
call psystem_errore # process error
1:
# generating: 195: strx
movq -16(%rbp),%rbx # get display pointer
movb %r12b,48(%rbx) # store byte
GNU gdb (Ubuntu 9.2-0ubuntu1~20.04.1) 9.2
Copyright (C) 2020 Free Software Foundation, Inc.
License GPLv3+: GNU GPL version 3 or later <http://gnu.org/licenses/gpl.html>
This is free software: you are free to change and redistribute it.
There is NO WARRANTY, to the extent permitted by law.
Type "show copying" and "show warranty" for details.
This GDB was configured as "x86_64-linux-gnu".
Type "show configuration" for configuration details.
For bug reporting instructions, please see:
<http://www.gnu.org/software/gdb/bugs/>.
Find the GDB manual and other documentation resources online at:
<http://www.gnu.org/software/gdb/documentation/>.
For help, type "help".
Type "apropos word" to search for commands related to "word"...
Reading symbols from basic/basic...
(gdb) b 41603
Breakpoint 1 at 0x164ca: file basic/basic.s, line 41603.
(gdb) run < basic/prog/lunar.bas
Starting program: /home/samiam/projects/pascal/pascal-p6/basic/basic < basic/prog/lunar.bas
Basic interpreter vs. 0.1 Copyright (C) 1994 S. A. Moore
Ready
Breakpoint 1, basic () at basic/basic.s:41603
(gdb) i r
rax 0x7f0 2032 Breakpoint 1, basic () at basic/basic.s:41603
rbx 0x7fffffffdb38 140737488345912
rcx 0x5 5
rdx 0x0 0
rsi 0x7fffffffdc07 140737488346119
rdi 0x7fffffffdbe0 140737488346080
rbp 0x7fffffffdb38 0x7fffffffdb38
rsp 0x7fffffffdae0 0x7fffffffdae0
r8 0x7fffffffdc14 140737488346132
r9 0x7fffffffdbf0 140737488346096
r10 0x5555555b4bc6 93824992627654
r11 0x7ffff7e54be0 140737352387552
r12 0x0 0
r13 0xff 255
r14 0x50 80
r15 0x6c 108
rip 0x55555556a4ca 0x55555556a4ca <basic+90>
eflags 0x293 [ CF AF SF IF ]
cs 0x33 51
ss 0x2b 43
ds 0x0 0
es 0x0 0
fs 0x0 0
gs 0x0 0
(gdb)
So the result is in r12b (the low byte of r12), which is 0.
So one issue already is that it is storing a byte, as it should, since the return type is varinx, which is a byte value. Looking at the exit code for the fndvar routine:
# 2174: 10322: 128: reti 8 ! Return from procedure/function(t)
# generating: 128: reti
popq %r15 # undo alignment push
popq %r15 # restore protected registers
popq %r14
popq %r13
popq %r12
popq %rbx
leave # undo frame
addq $24,%rsp # remove frame data
popq %rcx # get return address
addq $8,%rsp # remove caller parameters
popq %rax # get qword result
pushq %rcx # replace return address
ret # ret
Its using a reti, which is "return integer", which takes a whole 64 bit return. Since the result is byte, it would try to pick up the whole 64 bits, of which only the low 8 bits are valid.
So this looks like it came from the original intermediate, lets verify:
lodx 2 -24 ! Load local value(t)
chkx 0 255 ! Bounds check(t)
strx 2 48 ! Store local(t)
:2173
:2174
!
vbe ! Variable reference block end
reti 8 ! Return from procedure/function(t)
l basic.740=24
l basic.741=16
And the answer is yes it did. Bummer. Looking at the original call setup (from above):
subq $basic.1037,%rsp # allocate function result on stack
sfr is "set function result", or allocate it on stack.
Note that it simply moves the stack pointer down, does not attempt to clear storage. Now looking at the pint code for that:
245 (*sfr*): begin getq;
{ allocate function result as zeros }
for j := 1 to q div intsize do pshint(0);
{ set function result undefined }
for j := 1 to q do putdef(sp+j-1, false)
end;
It clears the result, but sets it undefined. So this is a real compiler bug, and this explains why it didn't fail in pint.
Another thing to look at is if the function result address and the address stored to above are the same. I annotated that in the .s:
movb %r12b,48(%rbx) # store byte 0 to @0x7fffffffdb38
Ok, lets match it to the original sfr:
(gdb) b 55221
Breakpoint 1 at 0x1b5d2: file basic/basic.s, line 55221.
(gdb) run < basic/prog/lunar.bas
Starting program: /home/samiam/projects/pascal/pascal-p6/basic/basic < basic/prog/lunar.bas
Basic interpreter vs. 0.1 Copyright (C) 1994 S. A. Moore
Ready
Breakpoint 1, basic () at basic/basic.s:55221
(gdb) si
basic () at basic/basic.s:55224
(gdb) i r
rax 0x55555556f5d2 93824992343506
rbx 0x1 1
rcx 0x5 5
rdx 0x5555555f4c40 93824992889920
rsi 0x1 1
rdi 0x25 37
rbp 0x7fffffffdd50 0x7fffffffdd50
rsp 0x7fffffffdb68 0x7fffffffdb68
r8 0x7fffffffdc14 140737488346132
r9 0x7fffffffdbf0 140737488346096
r10 0x5555555b4bc6 93824992627654
r11 0x7ffff7e54be0 140737352387552
r12 0x9 9
r13 0x6c 108
r14 0x50 80
r15 0x6c 108
rip 0x55555556f5d6 0x55555556f5d6 <basic+132>
eflags 0x212 [ AF IF ]
cs 0x33 51
ss 0x2b 43
ds 0x0 0
es 0x0 0
fs 0x0 0
gs 0x0 0
So rsp is 0x7fffffffdb68 the byte store is: 0x7fffffffdb38 so that does not appear to match either.
OK, so who saw my mistake there? Raise your hand.
Back to the original store instruction:
movb %r12b,48(%rbx) # store byte 0 to @0x7fffffffdb38
That is an offset instruction to rbx. Add 48 decimal to 0x7fffffffdb38, and you get 0x7FFFFFFFDB68, and that is the correct address.
So apparently the reti vs. retx (return x or byte value) is the issue here.
Alright, now we are in the compiler pcom.pas. The code that generates returns on functions is:
if fprocp^.idtype = nil then gen2(42(*ret*),ord('p'),fprocp^.locpar)
else if fprocp^.idtype^.form in [records, arrays] then
gen2t(42(*ret*),fprocp^.locpar,fprocp^.idtype^.size,basetype(fprocp^.idtype))
else gen1t(42(*ret*),fprocp^.locpar,basetype(fprocp^.idtype));
In "body", line 9355.
So it says the first case is a procedure, because the result type is nil. The second case is a structure, and the last case is scalar types. It calls gen1t() which uses the last parameter as the function result type. But we take the base type of that with basetype()! WTF. Looking at basetype(), it only has one job, and that is to peel off any and all subranges off the type given. So what is a byte type? Its a subrange. So it is converting retx to reti.
Remove the basetype() call, recompile, now we get:
! end;
lodx 2 -24 ! Load local value(t)
chkx 0 255 ! Bounds check(t)
strx 2 48 ! Store local(t)
:2173
:2174
!
vbe ! Variable reference block end
retx 8 ! Return from procedure/function(t)
Now the return is correct, retx. Running testprog, does it now run?
Hell no, but it did pass that issue. Now it has another issue.
Since that is a change to the original compiler, the change needs a full regression, then I will proceed to the next issue.
Ah, well. I hope that walkthough gave some insight about how I debug the compiler.
7 sample_programs/hello.pas PASS
25 sample_programs/roman.pas PASS
42 sample_programs/qsort.pas PASS
101 sample_programs/match.pas PASS
108 sample_programs/prime.pas PASS
339 sample_programs/fbench.pas PASS
856 sample_programs/drystone.pas PASS
973 sample_programs/startrek.pas PASS
1352 sample_programs/basics.pas PASS
1939 sample_programs/pascals.pas PASS
basic/basic.pas PASS
p2 PASS
p4 Fail
p5 Fail
p6 Fail
7 sample_programs/hello.pas PASS
25 sample_programs/roman.pas PASS
42 sample_programs/qsort.pas PASS
101 sample_programs/match.pas PASS
108 sample_programs/prime.pas PASS
339 sample_programs/fbench.pas PASS
856 sample_programs/drystone.pas PASS
973 sample_programs/startrek.pas PASS
1352 sample_programs/basics.pas PASS
1939 sample_programs/pascals.pas PASS
basic/basic.pas PASS
p2 PASS
p4 PASS
p5 Fail
p6 Fail
It was inevitable that this blog, and it is increasingly a blog, would have some not-so-pretty elements in it. And I encountered one last night with the compilation of Pascal-P5. Here is the dirt:
Pascal-P5
procedure constant(fsys: setofsys; var fsp: stp; var fvalu: valu);
var lsp: stp; lcp: ctp; sign: (none,pos,neg);
lvp: csp; i: 2..strglgth;
begin lsp := nil; fvalu.intval := true; fvalu.ival := 0;
if not(sy in constbegsys) then
begin error(50); skip(fsys+constbegsys) end;
...
The whoops is that "fvalu.intval := true" line there. It sets the tagfield of a valu structure. And it generates an error:
samiam@samiam-h-pc-2:~/projects/pascal/pascal-p6$ p5/source/pcom.mpp
P5 Pascal compiler vs. 1.4
Pascal-P5 complies with the requirements of level 0 of ISO/IEC 7185.
1 -56 (*$l-
*** Runtime error: psystem:4359:Change to VAR referenced variant
What happened? Well, it is all right there. The parameter fvalu gets passed var (variable reference). Because there is an outstanding reference to the entire parameter fvalu, it triggers a fault based on changing the variant in fvalu with an outstanding reference. The basis for this is the ISO 7185 error:
D.2 [6.5.3.3](https://github.com/samiam95124/Pascal-P6/issues/1#6.5.3.3%20Field-designators)
It is an error unless a variant is active for the entirety of each reference and access to each component of the variant.
As with a lot of things in the standard, this is up for debate.
Question: Why didn't this fail in Pascal-P4?
Turns out there is a story there (there usually is). The Pascal-P4 version is:
procedure constant(fsys: setofsys; var fsp: stp; var fvalu: valu);
var lsp: stp; lcp: ctp; sign: (none,pos,neg);
lvp: csp; i: 2..strglgth;
begin lsp := nil; fvalu.ival := 0;
if not(sy in constbegsys) then
begin error(50); skip(fsys+constbegsys) end;
Ah, so the tagfield set is missing. Looking at the definition of valu:
valu = record case {intval:} boolean of (*intval never set nor tested*)
true: (ival: integer);
false: (valp: csp)
end;
That comment ("not tested") is mine. valu is an undiscriminated variant. Well, that is an "undesirable" in my book, and with most original Pascal programmers, so it got changed in Pascal-P5 (by me). And so it trips the error.
So what has this got to do with large sinking ships? Well, the problem is somewhat of an iceberg. What is constant() to do with fvalu? Or any routine for that matter? It got the var passed variable, meaning it owns it in entirety. It set it up for an integer value, which means setting the tag type, then setting the value (ival) of it, which is in a variant. How else would you do that?
Again referencing the ISO 7185 error, I chose a particular interpretation of that error. Its an expensive check. The compiler keeps a list of outstanding var referenced blocks and checks tag accesses against that. The reason the error exists is that you could pass a var reference to a record containing a variant, then establish a reference to the variant, say ival, then change the tag, set another variant, and voila! you have an end-run around the type protection system. Another way of interpreting that would have been to check only against the variant, not the whole passed record.
Its an interesting problem, and it deserves more attention than I have time for right now. But here are the basic facts:
And so #3 is the present solution.
7 sample_programs/hello.pas PASS
25 sample_programs/roman.pas PASS
42 sample_programs/qsort.pas PASS
101 sample_programs/match.pas PASS
108 sample_programs/prime.pas PASS
339 sample_programs/fbench.pas PASS
856 sample_programs/drystone.pas PASS
973 sample_programs/startrek.pas PASS
1352 sample_programs/basics.pas PASS
1939 sample_programs/pascals.pas PASS
basic/basic.pas PASS
p2 PASS
p4 PASS
p5 PASS
p6 Fail
What's next? After Pascal-P6, the next logical target would be to roll out Pascal-P6 as its own host. I'm not real sure about that. An ISO 7185 compiler without Pascaline is not that useful to me, so I am thinking of going ahead with the Pascaline functions in pgen instead.
There are self compile scripts in regression cpcoms and cpints, but frankly those are more about self hosting the interpreter than self compilation. The Pascal-p6 compile (testp6) is the bar to clear for that.
What do you think? Would you prefer to have self hosting now? Let me know.
I''l put a week into the self host option. Its ready to self host. This should solve the issue for people who can't get gpc to run. I plan to leave it in prd/prr file mode for compatibility with gpc for a while. This is a gpc requirement. The purpose of that is that I can rerun the config script at any moment to get back to a gpc compile if there are problems.
It will also make things better for me, since pgen binaries do more complete runtime checks than gpc.
This, then, will form the basis for the 0.3 compiler.
7 sample_programs/hello.pas PASS
25 sample_programs/roman.pas PASS
42 sample_programs/qsort.pas PASS
101 sample_programs/match.pas PASS
108 sample_programs/prime.pas PASS
339 sample_programs/fbench.pas PASS
856 sample_programs/drystone.pas PASS
973 sample_programs/startrek.pas PASS
1352 sample_programs/basics.pas PASS
1939 sample_programs/pascals.pas PASS
basic/basic.pas PASS
p2 PASS
p4 PASS
p5 PASS
p6 PASS
Next is: pgen (itself)
7 sample_programs/hello.pas PASS
25 sample_programs/roman.pas PASS
42 sample_programs/qsort.pas PASS
101 sample_programs/match.pas PASS
108 sample_programs/prime.pas PASS
339 sample_programs/fbench.pas PASS
856 sample_programs/drystone.pas PASS
973 sample_programs/startrek.pas PASS
1352 sample_programs/basics.pas PASS
1939 sample_programs/pascals.pas PASS
basic/basic.pas PASS
p2 PASS
p4 PASS
p5 PASS
p6 PASS
pgen PASS
We appear to be done!
Next is I am trying some loop compiles (system compiles itself). Then need to add Pascal-P6 as a host for itself.
Replaced all of the binaries with their pgen generated binaries, pcom, pint and pgen:
************ Regression Summary *************
Sun 31 Mar 2024 12:23:11 AM PDT
Line counts should be 0 for pass
pint run ***************************************
0 sample_programs/hello.dif
0 sample_programs/roman.dif
0 sample_programs/match.dif
0 sample_programs/qsort.dif
0 sample_programs/fbench.dif
0 sample_programs/drystone.dif
0 sample_programs/startrek.dif
0 sample_programs/basics.dif
0 basic/basic.dif
1158 standard_tests/iso7185pat.dif
45 sample_programs/pascals.dif
Pascal-P2 run
0 p2/roman.dif
Pascal-P4 run
981 p4/standardp.dif
PRT run
450 standard_tests/iso7185prt.dif
Pascaline run
1129 pascaline_tests/pascaline.dif
Debug test run
5816 pascaline_tests/debug_test.dif
Sun 31 Mar 2024 12:24:12 AM PDT
Obviously some work to do, but this is pretty good.
So the build instructions and files will go into ./hosts/iso7185. This was originally created for the package mode, but it didn't work and I didn't have the time to fix it. The significance of the name is that it compiles for ISO 7185 mode with few unallowed extensions, like assign() and a few other things work. This wasn't really intentional. A lot of the code came from cmach, which had that support.
The next goal is to make it so that executing ./configure will switch you between gpc and self compile mode. That way, I can flip back and forth to debug things.
Since Pascal-P6 is moving to at least two different hosting modes available, I thought I would go over what that means and how it is used.
First of all, the default for configure is gpc as a host, and that is going to remain so for a while (perhaps quite a while). Thus:
./configure
Is automatically GPC on the current host and bit length.
Officially, the GPC target is:
./configure --gpc
The new, self compile target is:
./configure --iso7185
This actually uses the new pgen program to accomplish the self compile.
The build system for Pascal-P6 is modeled after the GNU standard form, thus a scratch build is:
./configure make
In the root directory. However, the reality is that, with the two host targets, there are actually 4 different possible setups with this arrangement. The reason for this is that the configure script places prebuilt binaries into the ./bin directory. For example, with GPC, you get pcom, pint and pgen binaries in the bin directory for Linux WHETHER OR NOT YOU HAVE GPC INSTALLED AT ALL. The difference is if you make or not. Make will wipe out the binaries on a successful compile. Thus we have:
Install prebuilt GPC only:
./configure --gpc
Install prebuilt iso7185 only:
./configure --iso7185
Install GPC and recompile:
./configure --gpc make
Install self compile and recompile:
./configure --iso7185 make
So if you don't plan to modify the compiler, you can install using configure and just not do the make.
Ok, now this gets interesting with --iso7185 mode. The gpc mode cannot compile itself, but the --iso7185 mode can. Thus after using ./configure --iso7185, you can make as many times as you want, and system will be the same. It rebuilds itself.
So what if you use the ISO 7185 mode, modify the compiler, and it all goes wrong? (a daily occurrence here). Well, in this case a
./configure --iso7185
Is essentially a "reset" operation. It brings you back to the original, prebuilt binaries configuration.
The ./configure --iso7185 is now checked in and you can play with it with the above plan. I recommend using
make -B
When changing the configuration. The make program does not understand when you change files under it.
So I could have stayed in the "old world" of GPC for a while until I get new world bugs sorted out. That is, keep using the GPC environment, generate test binaries with pgen, and debug those. But if you thought that, you don't know me. Banzi! Foreword to the new world.
So, first, we had regression, or at least most of it, working with pgen already. Why does it fail at all? Well, that was pgen, compiled by GPC, generating binaries that are then tested. Now it is pgen, generated by pgen, generating binaries, that are then tested. I tried to think of exactly what that meant, but it was giving me a headache, so I gave it up.
So first of all, pgen, and thus all of its compiled binaries, works in GPC compatible mode. This means it takes in from the file "prd" and outputs to "prr". The GPC compiled code is perfectly capable of parsing command line files, but that would have meant using GPC features. The reason for the prd/prr thing is that it compiles under strict ISO 7185 mode. So there you are.
The import of that is that pgen, and all of its binaries, are currently compatible with that method. It's not %100, and pgen could generate binaries that use file parameters even though it does not. There is that headache again. It's just simpler to make them all compatible. Someday when everything works and I am bored, I'll switch.
So what does it mean to be in an environment that you compiled with the same tooling that you are currently standing on? It means you can have a very special kind of crash, one that I have experienced many times. If you break the compiler code, then you may not be able to compile the system you are using. Meaning you lost control of your environment. You are bricked.
To get out of this, you can go back to a previous version, but what do you compile that with? For normal compiler situations, you essentially leave a trail of breadcrumbs, that is, you save and date every single version of the compiler binaries you create in hopes that you can find one that works.
With Pascal-P6 you have a few more convenient options. First, as stated above, you can always just reset your environment by executing ./configure again, and either go back to the starting binaries or even switch back to the GPC binaries. All this does is reset to the binaries in the ./hosts directory, so you can also just copy one of them to the ./bin directory like:
cp hosts/gpc/linux_X86/pint bin
Or even better, just execute one of them in place:
hosts/gpc/linux_X86/pint
Again, they are all compatible so it just works.
One nice thing about the new environment is that the pgen make is now part of the mainstream process. So I can build new versions of the compiler with a simple make, whereas before it was a special script or manual process. I considered having the make file be able to generate both a pgen and a gpc compile, but that, too, would have given me a headache, so I skipped it.
In any case, debugging with the self compile is better that you, or I would have thought.
************ Regression Summary *************
Mon 01 Apr 2024 05:37:58 PM PDT
Line counts should be 0 for pass
pint run ***************************************
0 sample_programs/hello.dif
0 sample_programs/roman.dif
0 sample_programs/match.dif
0 sample_programs/qsort.dif
0 sample_programs/fbench.dif
0 sample_programs/drystone.dif
0 sample_programs/startrek.dif
0 sample_programs/basics.dif
0 basic/basic.dif
0 standard_tests/iso7185pat.dif
0 sample_programs/pascals.dif
Pascal-P2 run
0 p2/roman.dif
Pascal-P4 run
230 p4/standardp.dif
PRT run
450 standard_tests/iso7185prt.dif
Pascaline run
1129 pascaline_tests/pascaline.dif
Debug test run
5816 pascaline_tests/debug_test.dif
Mon 01 Apr 2024 05:39:09 PM PDT
pascals runs now, the issue was that pgen still had the leftover memory limits because the code was copied from pint.pas. Thus it wasn't a pgen coder issue, which is encouraging.
When we get to a full regression passing in the self compile, I will generate a version 0.3.
Fixed p4.
************ Regression Summary *************
Mon 01 Apr 2024 07:36:45 PM PDT
Line counts should be 0 for pass
pint run ***************************************
0 sample_programs/hello.dif
0 sample_programs/roman.dif
0 sample_programs/match.dif
0 sample_programs/qsort.dif
0 sample_programs/fbench.dif
0 sample_programs/drystone.dif
0 sample_programs/startrek.dif
0 sample_programs/basics.dif
0 basic/basic.dif
0 standard_tests/iso7185pat.dif
0 sample_programs/pascals.dif
Pascal-P2 run
0 p2/roman.dif
Pascal-P4 run
0 p4/standardp.dif
PRT run
450 standard_tests/iso7185prt.dif
Pascaline run
1129 pascaline_tests/pascaline.dif
Debug test run
5816 pascaline_tests/debug_test.dif
the iso7185prt failures were found to be because iso7185 mode does not parse options from the command line. For good or bad reasons, the PRT test relies on that.
So this is a "will not fix". It will get integrated when we advance to Pascaline features implementation.
Likewise, although pcom and pint are capable of processing the full Pascaline language, and thus could run it, the multi-module abilities of Pascaline require pcom to be able to open multiple nested files. Thus neither the Pascaline test nor the debug test can be run.
We are at the end of what we can do with this version. This is going to go through a a documentation fixup phase and go for a version 0.3.
Looking through extend_pascaline.inc, most of the extension code there has little or nothing to do with Pascaline language features, its things like assign(), close, position(), etc. System functions that could easily appear in any ISO 7185 compiler. The other part is the use of the command file, which is already supported in pgen.
Thus I think it makes sense to go ahead and use this file in the iso7185 host version, and this will allow most or all of the above regressions to work.
The code was refactored to use Pascaline extensions. The next problem is that these need testing, and those are pascaline tests, yet we can't run the full Pascaline suite. Thus we have to arrange a partial test.
The regression results are unchanged, but now we can debug what is happening since the basis for it should work.
So during work on the iso7185 host, I already experienced a few "compiler crashes", that is, the make generated invalid compiler binaries, and thus the system was stuck in an unrecoverable state. To continue, I thought of two approaches to the problem. The first was to fall back to the gpc host, so that the base binaries would not be used to compile themselves. The second I though of was what I'll call "isolation testing".
Basically, it means I make the pcom and pgen products in place in their home directories, then directly reference them in testing. Thus the binaries in the ./bin directory are left alone, and the system cannot experience a circular crash.
To do this, I could have created makefiles for each product directory. However, there is an easier way. Executing the line:
make -Bn bin/pcom
or
make -Bn bin/pgen
(aka a "dry make") will dump all of the commands make would execute to make the product, but without actually doing it. The neat thing about this is that you copy the commands it outputs into a script file, then modify it to to the isolated build.
One thing I did to facilitate this was to make sure that make output full paths for files, instead of "relative paths". This is actually standard practice in good make files. The result is a command script that can be run anywhere, because it has no relative paths, and the script can be moved to other (sub) directories and run.
One thing I wanted to do was to get rid of the gpc prd/prr name convention. That is, gpc, by default, just opens header files under their name in the header. Thus prd and prr. Pascal-P6 has the ability to automatically connect them to names specified on the command line.
That code is actually pretty well developed and tested. That's not where the problem lies. The issue is fixing that creates a logical loop in the build for the compiler. The code has to be changed for the new behavior, but compiled by the previous compiler. So for example pcom, the worst example, must generate code to parse command parameters while itself using the old named file system. So recode to generate the new method, then change pcom and compile for the new system, right?
Well, none of that was working. Its exactly like standing on the branch you are sawing on. I took a couple of stabs at it using isolation testing, and had no success. The first reason was the code still had bugs. The second was getting it to loop back to itself. The method I ended up using was to isolation test to three levels. The first was the binary in bin. The second was the binary in the source tree. The third was to generate a new binary unrelated to the previous two.
So I would create the third level binary, debug that, then copy that working binary to the second level, self compile that a few times, then copy to the bin directory. It took most of two days to get it all to work right.
However, now it works and I am moving on to fixing the script files.
pgen generated binaries use the Pascaline Annex C: "Undefined program parameter binding" method. Any header files, or other kind of variable, that appears in the header is fufilled via the command line:
program test(input, datafile);
var datafile: text;
begin
...
end.
Any file that is predefined to Pascaline is taken care of by the compiler. Here this is the "input" file, but could also be "output", "error", "list" and "command" files.
Any other files, or other variables, must be declared by the program. Here it is "datafile". The program is then executed as:
test mydata
Thus the name "mydata" gets assigned to "datafile", and it is a text file.
In fact, the file type of a header file could be anything. It could well be "file of integer", "file of char", or anything else, including a complex file of records.
It can also be another kind of variable:
program test(input, number);
var number: integer;
begin
...
end.
This is executed:
test 42
This will also get resolved by the compiler for you. It reads the integer for you from the command line. Any number of header parameters can be automatically assigned thus:
program test(input, file1, file2, number1, number2);
var file1, file2: text;
number1: integer;
number2: real;
begin
...
end.
Thus:
test myfile yourfile 42 1.23
The files "input" and "output" should be familiar. The "error" file is the error output file, which is the number 2 file in Unix/Linux, or "stderr" in C. The "list" file is new, but this is just an alias for "output" in Pascal-P6. It is meant as a file to connect to a printer[1].
Now the fun one. What is "command"? Command is the command line after the program name (in this case "test") turned into a file. Thus:
program test(input, command);
var a, b: integer;
begin
read(command, a, b);
...
end.
Executed as:
test 42 24
Would set a = 42 and b = 24. In other words, if the command line is a file, you can use regular read() operations to do whatever you want to with the command line, including read it character by character.
Wait, isn't the command line an array of string arguments in Unix/Linux? Well, it is, but we reconstruct it into a single string again. Thus:
test hi bob 42
Appears in the command file as:
hi bob 42
ie., with the spaces removed.
C can't parse the command line with fscanf() calls now can it? How cool is that?
The neat thing about the command file is that it works along with automatic header parameters thus:
program test(input, datafile, command);
var datafile: text;
number: integer;
a: real;
begin
read(command, a);
...
end.
Executed:
test myfile 42 1.234
Would set datafile assigned as "myfile", number as 42, and variable a as 1.234.
The way it works is that any automatic header parameters are read off the command line first, then the command file gets what is left. In this case, after the automatic header file parameters are read, the command file is:
1.234
So you can mix and match automated header parameters along with manual parsing via the command file.
The neat thing about Pascaline header files is that they do the work in short programs for you:
program print(infile, output);
var c: char;
begin
reset(infile);
while not eof(infile) do begin
while not eoln(infile) do begin
read(infile, c);
write(c)
end;
readln(infile);
writeln
end
end.
Is a text file print program. It just prints the contents of a text file on the terminal. All the work of getting the filename and opening the file are done for you.
[1] Its standard with the IP Pascal series, its been there since 1980.
In the above program:
program print(infile, output);
var c: char;
begin
reset(infile);
while not eof(infile) do begin
while not eoln(infile) do begin
read(infile, c);
write(c)
end;
readln(infile);
writeln
end
end.
Didya spot the flaw in that program? Didya?
The issue is that if the input file infile does not exist, it will be created as empy. Then you are left scratching your head as to why the program prints nothing.
So IP Pascal had a solution for this, which was that files with a preceeding "_" were preexisting files. Thus:
program print(_infile, output);
var c: char;
begin
reset(_infile);
while not eof(_infile) do begin
while not eoln(_infile) do begin
read(_infile, c);
write(c)
end;
readln(_infile);
writeln
end
end.
This actually made sense in IP Pascal if you realize that the treatment for all header files is to open it under the name of the header file. Thus files like "input" actually have that name passed to the lowest levels, which recognize the name as a standard header file, and turn around and parse the actual name from the command line. Thus for IP Pascal, it was easy for the I/O subsystem to pull off that first "_" and use it to indicate a preexisting file.
Anyways, Pascal-P6 does not work like that, and I am actually not that fond of the scheme. I am leaning towards something like:
program print(*infile, output);
var c: char;
begin
reset(infile);
while not eof(infile) do begin
while not eoln(infile) do begin
read(infile, c);
write(c)
end;
readln(infile);
writeln
end
end.
The "*" in front of the file indicates that it is a preexisting file, and to flag an error if it does not exist.
This method actually has some precedence in Pascal. Wirth used it to indicate a segmented file in CDC 6000 Pascal (as in "a file with multiple sections). I don't think anyone would be confused by this, since segmented files never caught on[1].
So there you go. I haven't fixed this yet because it is not a serious problem in real world code.
[1] Segmented files make more sense on batch processing systems, which don't really exist anymore. They were a big deal in the 1960's and 1950's.
Actually, I thought about it and I was wrong about the example being incorrect. The program indicates it's intent, which is to read the existing file. This is because it applies reset() to the file, meaning (defaco) it expects to find content in it. It is true that reset() can be applied to the file any number of times, but trying to reset() a file that does not exist is a very different thing and can be detected. This applies even if the program closes the file, deletes it and reopens it.
Ok, so that header parameter change is having interesting follow on effects. In the old method, I could stack interpreters. That is, I can run pint on itself by concatenating the input files for the two relevant sections. pint gets the intermediate code for pint, then gets the program the interpreted pint is to run ontop of that:
Target program intermediate code
Runs on top of:
pint intermediate code
Of course the actual order is the opposite, because pint reads the pint intermediate in first, then the intermediate for the target program, so the order in the prd file is:
pint intermediate code Target program intermediate code
Now with the header change so that each file in the header (that is not one of the standard compiler files like input or output) has its name parsed from the command line. The issue is that each stacked program also reads from the command line:
pint pint.p6 pint.out program.p6 program.out
So pint, the interpreted program, has its code loaded from pint.p6 as prd, the output prr goes to pint.out. The program the interpreted pint reads is from program.p6 as prd, and the output of that, prr, goes to program.out.
There's that headache again.
In a way this is a better system, since you have four different files, instead of having to concatenate them together. Its just different.
And that difference has to come out in the script files.
As I have to do a lot of work to redo the various support scripts for iso7185 mode, its becoming clear that a slow death of the GPC host is coming. I have tried very hard to keep GPC supported, but things like the PRD test require a whole list of various compare files, and its just not worth it to keep up. Also, as time goes on and things change, there will be less overall testing of the GPC host mode, and bit rot sets in on untested material.
The main point here is that GPC already died a long time ago. Its getting hard to find working copies, and those copies need special workarounds. Thus I apologize to anyone who still uses GPC (anyone?), but the end is pretty much inevitable.
The P5 test ran without issue (iso7185 host). P6 is faulting with a segfault.
The next step would be to run the PRD test. After that I need to decide if I am going to push a version, or go for full up (pascaline) implementation.
The whole project frankly has been dragging quite a bit, so I am thinking the extra overhead to implement full Pascaline compatibility is less of a percentage of getting this thing up than before.
I ask myself that every day. Its kinda important since I am standing on it. If pgen can't produce a working system, the only thing to do is fall back to either a previous copy or (gasp) GPC. Fortunately I have never come near that end. Several bugs with pgen did appear after self compile, but they were in non-critical systems like the interpreter, as indeed the current segfault is. So work proceeds. The other fallback (ugh!) is to hand modify the .s files to fix any bugs, like I had to do in 1981 with IP Pascal.
At 4771 lines of code its amazing to me that pgen is performing this well at all. Its smaller than all the other components save pmach and cmach. Its also been online for a matter of weeks. So yes, my confidence in it is growing, perhaps without justification.
So debug of the assembly files is pretty straightforward. It basically divides into two domains. Tracking down where the failure is in the source can be done via writeln()s placed in the code, then when/if you get to a failing statement, you drop down into gdb/assembly mode debugging.
One thing that works: I wanted to be able to force the code into the debugger from source level. So I did a divide by 0. An olde by a goodie. It causes a processor exception. In fact, its one of the few assembly exceptions not associated with things like memory faults.
For grins, I compiled a C file with full debug (-g3) and debugged with gdb. And you get..... full source debugging. Yeah! I compiled to .s first, then compiled the .s to final binary, so there is no sleight of hand. The .s file really contains enough information to do source level debugging. Looking at the .s file (for the hello world program), and you can see that it has included assembly pseudo-instructions for everything that matters, lots of things that don't matter, and looks like the address of the final resting place of Jimmy Hoffa is down there somewheres.
THEN, I did the same thing with GPC. Same -g3 option, same everything. And you get.... nothing, no source level debug, nada, zip.
I seem to recall a mailing list posting to the effect that the GPC group tried it, could not make it work, or perhaps it was too hard, so there you are. In any case, it tells me that if I work hard, take my Weaties and put stuff in the .s file, I might (just might!) be able to get gdb to do source mode. Someday.
Looks like testp6 is running now. This is a very slow test, it runs the entirety of Pascal-P6 interpretively. Thus it takes a day to get through all permutations.
In the meantime, I am working on the PRD test.
At present, the PRD is run interpretively. Well, to be fair that was always so, because the pgen code generator is new. So before, the tests were run by a pint as compiled by gpc. Now, its pint as compiled by pgen. And pgen is finding errors in the run. These can be classified as true errors or false errors. ie., the failing pint run can be caused by errors not caught by GPC, or they could be false errors because of pgen bugs.
So far pgen as a pretty good track record of such bugs, so I think it's a good assumption that most or all of them are real issues. These kinds of problems in the code used to come out in the self-compiles, that is, pint run on itself or pcom run on itself, etc. These are the cpcoms and cpints scripts. They were run, but they take like 14 hours to run, so they are only run on a full regression, which is to say not that often. At some point I will run them again with pgen.
So running with pgen instead of GPC is kinda like running cpcoms and cpints, but at speed. Why "kinda"? Because pgen does not, at present, catch all ISO 7185 errors. The big one is undefined (uninitialized) storage, It doesn't do that check because it relies on having a bitmap of all storage in the program of initialized locations. This is actually doable in a binary, but the predicate is full control of the malloc()/free() processes. This is because we can't tell in the code where storage lies, in regular or dynamically allocated memory. Frankly I used to think this was impossible to do, but now I just think it is difficult.
In any case, the set of failures that pgen is not capable is small, and thus it will run the PRD at some point.
pgen can run a full regression, indirectly. Wait what? Pgen cannot catch all errors or run Pascaline extended contructs, but its interpreters, pint, pmach and cmach can. Pascal-P6 is written in ISO 7185 Pascal. So that is the next step, is to run all the interpreters via regressions.
Its impossible to completely test for everything. The purpose of compiler testing is to find the maximum amount of issues and to form as complete a test as possible.
/home/samiam/projects/pascal/pascal-p6/source/AMD64/gcc/psystem.c:1778: warning: the use of tmpnam' is dangerous, better use
mkstemp'
Yea, that one. Anyone who tries to compile gets that. The runtime needs to generate temporary filenames for common Pascal files. The rub is that they are generated when THE FILE IS ASSIGNED. That's the way it works. So the Linux kernel folks realized that the temp name generation method is insecure. The problem is the recommended "solution" to that problem is to specify a file is temp on OPENING it. And that I cannot do and keep to the Pascaline specification.
I'm not saying there is no solution, I just haven't had the time to work one out. So in the meantime ignore the warning.
These are PRT failures with pgen. They represent error cases in the PRT that pgen failed to flag as errors. pgen checks for almost everything except for undefined variables, which was covered above as needing special malloc() handling that pgen does not have at present. There is also "use after dispose" failures,of which were handled in pint by the custom malloc() and can also be handled in pgen by a fairly simple fix. This also requires the ability to process options to the binary, which is non-trivial.
Obviously some of this will be assigned as "technical debt", which is a fancy new age way to say "I'll get around to it", but frankly all of the PRT fails could be viewed that way, because correct programs can still be compiled. Thus I'll try to give some priorities to this list here.
standard_tests/iso7185prt1702c.err - No undefined check - won't fix standard_tests/iso7185prt1704.err - Undefined location access - won't fix standard_tests/iso7185prt1706b.err - VAR referenced file buffer modified, should be checked. - Fixed standard_tests/iso7185prt1713.err - Cannot reset closed temp file (error message is different) - Fixed standard_tests/iso7185prt1716.err - End of file, should be checked - Fixed standard_tests/iso7185prt1723.err - Dispose of nil pointer, should be checked - Fixed. standard_tests/iso7185prt1724.err - Dispose of undefined - won't fix standard_tests/iso7185prt1727.err - Undefined to pack - won't fix standard_tests/iso7185prt1730.err - Undefined to unpack - won't fix standard_tests/iso7185prt1732.err - Integer value overflow - Fixed standard_tests/iso7185prt1733.err - Invalid argument to ln - Fixed standard_tests/iso7185prt1734.err - Invalid argument to sqrt - Fixed standard_tests/iso7185prt1735.err - Real argument too large - Fixed standard_tests/iso7185prt1736.err - Real argument too large - fixed standard_tests/iso7185prt1738.err - Integer value overflow - Fixed standard_tests/iso7185prt1739.err - Integer value overflow - Fixed standard_tests/iso7185prt1740.err - File mode incorrect - Fixed standard_tests/iso7185prt1741.err - File not open - Fixed standard_tests/iso7185prt1743.err - Undefined location access - won't fix standard_tests/iso7185prt1744.err - Zero divide - Fixed standard_tests/iso7185prt1745.err - Zero divide - Fixed standard_tests/iso7185prt1746A.err - Invalid divisor to mod - Fixed standard_tests/iso7185prt1746b.err - Invalid divisor to mod - Fixed standard_tests/iso7185prt1757.err - End of file - Fixed standard_tests/iso7185prt1800.err - Pointer used after dispose - Won't fix standard_tests/iso7185prt1811.err - Undefined location access - Won't fix standard_tests/iso7185prt1834.err - This is passing - pcom does not consider it an error. standard_tests/iso7185prt1839.err - Invalid field specification - Fixed standard_tests/iso7185prt1850.err - This is passing - pcom does not consider it an error. standard_tests/iso7185prt1851.err - Undefined location access - won't fix standard_tests/iso7185prt1918.err - Undefined location access - Won't fix
Note: this list is edited in place.
I pushed a small correction, the psystem_asm.s file is renamed to psystem_asm.asm. The reason why is that the make file treats .s files as a type of intermediate. That is, a file that is generated and can be erased and rebuilt. The psystem_asm.asm file is not that kind of file, it is hand generated assembly source, and should not be touched.
Why not psystem.asm? Good question. To me this implies that psystem.asm was derived from psystem.c, which it is certainly not. It is its own file, and has different definitions in it. It is still in the psystem, as in "Pascal System support", so there is the admittedly awful name.
Going forward, the plan is to always use .asm for the source files that were not automatically generated.
Now that Pascal-P6 builds binaries, it was suggested to me that the separated source/build directory architecture would be a better idea than the old combined source directory. The idea is that:
Ok, so I did this for the main code. The outlying tests, sample_programs, standard_tests, whatnot I will have to think about.
This is more of a user note. Being able to set options in mid-source has always made me nervous. IP Pascal didn't allow that. Pascal-P does. The point is there is always a way that can be used to cause crashes or odd situations in the compiler. Thus this is something to be avoided if possible.
The list of PRT tests left over for pgen that are not passing:
standard_tests/iso7185prt1702c.err - No undefined check - won't fix
standard_tests/iso7185prt1704.err - Undefined location access - won't fix
standard_tests/iso7185prt1724.err - Dispose of undefined - won't fix
standard_tests/iso7185prt1727.err - Undefined to pack - won't fix
standard_tests/iso7185prt1730.err - Undefined to unpack - won't fix
standard_tests/iso7185prt1743.err - Undefined location access - won't fix
standard_tests/iso7185prt1800.err - Pointer used after dispose - Won't fix
standard_tests/iso7185prt1811.err - Undefined location access - Won't fix
standard_tests/iso7185prt1834.err - This is passing - pcom does not consider it an error.
standard_tests/iso7185prt1850.err - This is passing - pcom does not consider it an error.
standard_tests/iso7185prt1851.err - Undefined location access - won't fix
standard_tests/iso7185prt1918.err - Undefined location access - Won't fix
1834 and 1850 are special cases, since they give "unreferenced" warnings, not errors. Thus they are special across all interpreter/compiler modes.
1800 Requires a modified malloc()/free(). The idea is that the malloc() gives blocks that can be identified, with a special code before the start of the allocation, then the free() just marks the entry as free and does not recycle it. The drawback is that it accumulates unused space due to no recycling, and thus can't be used as an operational support library.
The remaining tests are for undefined values. pint, pmach and cmach do this, but do it with a bitmap with one bit for each byte location in the programs memory. The difficulty there is that, in a Linux or other operating system with virtual memory, both the heap and the stack are usually located in special memory areas and are managed by the operating system. Pint and the others have all memory, including globals, heap and stack in a single memory array and thus this is not difficult. In pgen it would require the use of a special malloc()/free, and likely a special stack as well. This is doable, but requires non-trivial coding. It also could not be used for common programs since the heap and stack areas would be fixed length.
Thus the above list will stay with pgen regressions for a while, possibly a long while.
Next is Pascaline, the extension set. This has four distinct steps:
I am a few days into #1. Why is that difficult? The system already runs Pascaline extensions in pint/pmach/cmach. That's actually a good question. Its actually running the same code. Most of the errors are "invalid variant" types. The reason is because pint used to be compiled by gpc, which forgave a lot of iso7185 errors like using inactive variants. This basically makes them undiscriminated variants, which works, but is "uncool".
The full up ISO 7185 error set was actually tested before on pcom and pint. That is the self compile test. It takes pcom and pint, and stacks them up on pint. Thus it runs all the ISO 7185 runtime checks. But the price of that is it is slow, slow slow. It used to take 12 hours, but I have gotten faster computers since those days. Now it finishes in a couple of hours. Still not that much fun. So it was run via itself, fully checked, why is it failing now? Well, because that was before the full set of Pascaline extensions was implemented and tested.
Next at #2, pgen gets run as a new step because at present, it only implements a subset of the Pascaline extended intermediates. After those are implemented, #3 runs a full test on it. Here we have the advantage that the Pascaline test has already passed in pint/pmach/cmach form.
Finally, #4 tests debugging using pint. This is pretty dependent on Pascaline extensions, I never separated it into ISO 7815 mode and Pascaline mode.
Actually, although that last list was a good one, I suspect that #4 follows #1 for the simple reason that it represents a milestone. That would be the first time that the iso7185 host branch runs everything that the gpc host did. This also acknowledges the fact that pgen changed everything about Pascal-P6. Having Pascal-P6 stand on its on toolset reshuffled the deck so to speak.
Starting off the next project, which is a 80386 (32 bit) code generation facility for P6. This is very preliminary, and the code and the documentation are subject to change. If you would like to help, that would be encouraged. I have committed a preliminary code module, pgen_gcc_i80386.pas. The documentation has also got a new section that discusses the project, so if you want to know more, read that.