Closed bartoldeman closed 6 years ago
There is also the strange pushw %ds
. Looking at -O1 code this replaces subw $2, %sp
. It looks like space on the stack is created for local variables but there are none.
Hello @bartoldeman ,
Thanks for the reports! I managed to hack together a peephole optimization that can generate addw $2, (%bx)
. I believe the i386 back-end uses a different approach to achieve the same thing, since it can combine the instructions even when I disable peepholes.
Now I "just" need to figure out why the code for incfar (.)
is reloading %es
, and what exactly is up with the unused stack variable. :-|
Hello @bartoldeman ,
I am making some progress in improving the output code for cases like incfar (.)
.
It seems that the problem is not with my far pointer patches, but due to some weirdness in how GCC itself was handling multi-shortword values when -fsplit-wide-types
was in effect. One of the optimization passes for splitting multi-shortwords into separate shortword variables (subreg1
) was apparently throwing away information about values stored in registers, and this was leading to the extraneous reloads and spills (the unused stack variable was for spilling and reloading %bx
; these later got elided).
After patching GCC to disable this pass (but leaving a later subreg expansion pass subreg2
in place), I can now obtain this output, even without any peephole optimization:
incfar:
pushw %es
pushw %bp
movw %sp, %bp
movw 6(%bp), %bx
movw 8(%bp), %ax
movw %ax, %es
movw %es:(%bx), %dx
addw $2, %dx
movw %dx, %es:(%bx)
popw %bp
popw %es
ret
And with an additional peephole rule to handle the loads into %es:%bx
:
incfar:
pushw %es
pushw %bp
movw %sp, %bp
lesw 6(%bp), %bx
addw $2, %es:(%bx)
popw %bp
popw %es
ret
The only problem now is that disabling the subreg1
pass also causes some previously latent bugs to show up (specifically, there is a regression in gcc/testsuite/gcc.c-torture/execute/pr60960.c
), so now I need to find these bugs and deal with them...
Thank you!
thanks for the good work. However overall the compiler is now worse for me in terms of size optimizations. I tried to get an example with -Os that shows this. It seems to be mostly that the stack is used more:
struct ab {
unsigned long a, b;
};
extern int test(struct ab __far *p, unsigned long b);
unsigned long find_b(struct ab __far *p)
{
unsigned long a = 0, b = 0;
if (p->a>64000)
{
if (p->b != 0xffffful)
b = p->b;
a = p->a;
}
while (!test(p, b))
{
b++;
if (b > a) b = 2;
}
return b;
}
before: 0x7b bytes, after 0x92 bytes (+19%)
Hello @bartoldeman ,
Thanks for the report. So it looks like turning off the subreg1
optimization pass wins for some routines but loses for others (argh!).
It seems that, as long as I manage to get the far pointer into a register pair, then GCC can easily generate good code. E.g.:
void incfar(int __far *a)
{
__asm __volatile("" : "=k" (a) : "0" (a));
*a+=2;
}
but I am not sure yet how to get the compiler itself to actually do this. I guess I need to really rethink some other ways to get rid of the reloading of %es
.
By the way, may I take a look at your setup for compiling the FreeDOS kernel using ia16-elf-gcc
, if it is convenient for you? I think it will be good for me to be able to see how any changes to GCC may affect code generation for the kernel as a whole.
Thank you!
I finally uploaded my fdkernel draft. It's in a separate branch because the patch is still one big chunk. https://github.com/bartoldeman/fdkernel/tree/ia16-elf-gcc-draft
you should be able to compile it using "make all COMPILER=gcc"
Hello @bartoldeman ,
Thank you very much! I am able to compile the code using your setup; I will see how I can improve GCC's code generation.
Hello @bartoldeman ,
I added a patch that now makes -Os
really optimize for size (for some reason, previously the effect of -Os
was limited to just a few peephole rules, such as the pushw %ds
thing).
A result is that the routine
ddt *getddt(int dev)
{
return &(((ddt *) Dyn.Buffer)[dev]);
}
in FreeDOS kernel/dsk.c
now compiles to
_getddt:
pushw %bp
movw %sp, %bp
movw $136, %ax
mulw 4(%bp)
addw $_Dyn+2, %ax
popw %bp
ret $2
whereas before, GCC would do the multiplication using shifts and adds (!).
The FreeDOS kernel code now compiles to 78,284 bytes (uncompressed), whereas before it was 79,052 bytes. Though this is still some way from the 69,828 bytes possible under the Watcom compiler. I will look into ways to further shrink the output.
The incfar (.)
routine you laid out above now also results in good code under -Os
:
incfar:
pushw %es
pushw %bp
movw %sp, %bp
lesw 6(%bp), %bx
addw $2, %es:(%bx)
popw %bp
popw %es
ret
(However, it still does the extra reloads if I use -O3
, so possibly something is still out of whack with the speed optimizations.)
Thank you!
I see we are now at 77,596 bytes so even better! Actually, as for OW, the kernel resident and init code segments can now be merged into a single segment which makes it even smaller. I'm working on an update.
Background: I managed to compile the FreeDOS kernel now but its size (81K) is bigger than both the Open Watcom C (70K) and Turbo C 2.01 (80K) compiled versions. I am looking for some optimization opportunities.
compile using
-Os -S
incnear
here the three (*) can be combined into
addw $2, (%bx)
. The i386 backend does this.There is a second issue here in that es is reloaded (**), which may be harder.
For
++
instead of+=2
the code looks fine with a singleinc
instruction.