Optimizations: low hanging fruit?

bartoldeman commented 6 years ago

Background: I managed to compile the FreeDOS kernel now but its size (81K) is bigger than both the Open Watcom C (70K) and Turbo C 2.01 (80K) compiled versions. I am looking for some optimization opportunities.

void incnear(int *a)
{
  (*a)+=2;
}

void incfar(int __far *a)
{
  (*a)+=2;
}

compile using -Os -S

incnear

incnear:
        pushw   %bp
        movw    %sp,    %bp
        movw    4(%bp), %bx
        movw    (%bx),  %ax (*)
        addw    $2,     %ax (*)
        movw    %ax,    (%bx) (*)
        popw    %bp
        ret

here the three (*) can be combined into addw $2, (%bx). The i386 backend does this.

incfar:
        pushw   %es
        pushw   %bp
        movw    %sp,    %bp
        pushw   %ds
        lesw    6(%bp), %bx
        movw    %es:(%bx),      %ax
        addw    $2,     %ax
        movw    8(%bp), %es (**)
        movw    %ax,    %es:(%bx)
        movw    %bp,    %sp
        popw    %bp
        popw    %es
        ret

There is a second issue here in that es is reloaded (**), which may be harder.

For ++ instead of +=2 the code looks fine with a single inc instruction.

bartoldeman commented 6 years ago

There is also the strange pushw %ds. Looking at -O1 code this replaces subw $2, %sp. It looks like space on the stack is created for local variables but there are none.

tkchia commented 6 years ago

Hello @bartoldeman ,

Thanks for the reports! I managed to hack together a peephole optimization that can generate addw $2, (%bx). I believe the i386 back-end uses a different approach to achieve the same thing, since it can combine the instructions even when I disable peepholes.

Now I "just" need to figure out why the code for incfar (.) is reloading %es, and what exactly is up with the unused stack variable. :-|

tkchia commented 6 years ago

Hello @bartoldeman ,

I am making some progress in improving the output code for cases like incfar (.).

It seems that the problem is not with my far pointer patches, but due to some weirdness in how GCC itself was handling multi-shortword values when -fsplit-wide-types was in effect. One of the optimization passes for splitting multi-shortwords into separate shortword variables (subreg1) was apparently throwing away information about values stored in registers, and this was leading to the extraneous reloads and spills (the unused stack variable was for spilling and reloading %bx; these later got elided).

After patching GCC to disable this pass (but leaving a later subreg expansion pass subreg2 in place), I can now obtain this output, even without any peephole optimization:

incfar:
    pushw   %es
    pushw   %bp
    movw    %sp,    %bp
    movw    6(%bp), %bx
    movw    8(%bp), %ax
    movw    %ax,    %es
    movw    %es:(%bx),  %dx
    addw    $2, %dx
    movw    %dx,    %es:(%bx)
    popw    %bp
    popw    %es
    ret

And with an additional peephole rule to handle the loads into %es:%bx:

incfar:
        pushw   %es
        pushw   %bp
        movw    %sp,    %bp
        lesw    6(%bp), %bx
        addw    $2,     %es:(%bx)
        popw    %bp
        popw    %es
        ret

The only problem now is that disabling the subreg1 pass also causes some previously latent bugs to show up (specifically, there is a regression in gcc/testsuite/gcc.c-torture/execute/pr60960.c), so now I need to find these bugs and deal with them...

Thank you!

bartoldeman commented 6 years ago

thanks for the good work. However overall the compiler is now worse for me in terms of size optimizations. I tried to get an example with -Os that shows this. It seems to be mostly that the stack is used more:

struct ab {
  unsigned long a, b;
};
extern int test(struct ab __far *p, unsigned long b);

unsigned long find_b(struct ab __far *p)
{
  unsigned long a = 0, b = 0;
  if (p->a>64000)
  {
    if (p->b != 0xffffful)
      b = p->b;
    a = p->a;
  }
  while (!test(p, b))
  {
    b++;
    if (b > a) b = 2;
  }
  return b;
}

before: 0x7b bytes, after 0x92 bytes (+19%)

tkchia commented 6 years ago

Hello @bartoldeman ,

Thanks for the report. So it looks like turning off the subreg1 optimization pass wins for some routines but loses for others (argh!).

It seems that, as long as I manage to get the far pointer into a register pair, then GCC can easily generate good code. E.g.:

void incfar(int __far *a)
{
  __asm __volatile("" : "=k" (a) : "0" (a));
  *a+=2;
}

but I am not sure yet how to get the compiler itself to actually do this. I guess I need to really rethink some other ways to get rid of the reloading of %es.

By the way, may I take a look at your setup for compiling the FreeDOS kernel using ia16-elf-gcc, if it is convenient for you? I think it will be good for me to be able to see how any changes to GCC may affect code generation for the kernel as a whole.

Thank you!

bartoldeman commented 6 years ago

I finally uploaded my fdkernel draft. It's in a separate branch because the patch is still one big chunk. https://github.com/bartoldeman/fdkernel/tree/ia16-elf-gcc-draft

you should be able to compile it using "make all COMPILER=gcc"

tkchia commented 6 years ago

Hello @bartoldeman ,

Thank you very much! I am able to compile the code using your setup; I will see how I can improve GCC's code generation.

tkchia commented 6 years ago

Hello @bartoldeman ,

I added a patch that now makes -Os really optimize for size (for some reason, previously the effect of -Os was limited to just a few peephole rules, such as the pushw %ds thing).

A result is that the routine

ddt *getddt(int dev)
{
  return &(((ddt *) Dyn.Buffer)[dev]);
}

in FreeDOS kernel/dsk.c now compiles to

_getddt:
        pushw   %bp
        movw    %sp,    %bp
        movw    $136,   %ax
        mulw    4(%bp)
        addw    $_Dyn+2,        %ax
        popw    %bp
        ret     $2

whereas before, GCC would do the multiplication using shifts and adds (!).

The FreeDOS kernel code now compiles to 78,284 bytes (uncompressed), whereas before it was 79,052 bytes. Though this is still some way from the 69,828 bytes possible under the Watcom compiler. I will look into ways to further shrink the output.

The incfar (.) routine you laid out above now also results in good code under -Os:

incfar:
        pushw   %es
        pushw   %bp
        movw    %sp,    %bp
        lesw    6(%bp), %bx
        addw    $2,     %es:(%bx)
        popw    %bp
        popw    %es
        ret

(However, it still does the extra reloads if I use -O3, so possibly something is still out of whack with the speed optimizations.)

Thank you!

bartoldeman commented 6 years ago

I see we are now at 77,596 bytes so even better! Actually, as for OW, the kernel resident and init code segments can now be merged into a single segment which makes it even smaller. I'm working on an update.

tkchia / gcc-ia16

Optimizations: low hanging fruit? #15