tkchia / gcc-ia16

Fork of Lambertsen & Jenner (& al.)'s IA-16 (Intel 16-bit x86) port of GNU compilers ― added far pointers & more • use https://github.com/tkchia/build-ia16 to build • Ubuntu binaries at https://launchpad.net/%7Etkchia/+archive/ubuntu/build-ia16/ • DJGPP/MS-DOS binaries at https://gitlab.com/tkchia/build-ia16/-/releases • mirror of https://gitlab.com/tkchia/gcc-ia16
GNU General Public License v2.0
178 stars 13 forks source link

Potential optimization: (AX << 8) | value => AH<-AL; AL=value #135

Open asiekierka opened 1 year ago

asiekierka commented 1 year ago

In my tests, the following code (changed a little for a minimal test case):

typedef unsigned char uint8_t;

typedef struct {
    uint8_t port;
    uint8_t bits;
} ws_eeprom_handle_t;

ws_eeprom_handle_t ws_eeprom_handle_internal(void) {
    ws_eeprom_handle_t handle = {0xBA, *((uint8_t*)0)};
    return handle;
}

compiled to the following code under -O2 -mcmodel=medium (as well as -Os -mcmodel=medium and -O3 -mcmodel=medium):

00000000 <ws_eeprom_handle_internal>:
   0:   a0 00 00                mov    0x0,%al
   3:   c1 e0 08                shl    $0x8,%ax
   6:   0c ba                   or     $0xba,%al
   8:   cb                      lret

I think that SHL/OR pair could be replaced with two MOVs, given that we can actually do so for AX/BX/CX/DX:

mov    0x0,%al
mov    %al,%ah
mov    $0xba,%al

which hopefully the compiler could optimize further to:

mov    0x0,%ah
mov    $0xba,%al
codygray commented 1 year ago

Unless it is attempting to avoid clobbering flags (something you worry about when rearranging instructions to optimize for a out-of-order superscalar processor, but not likely something a DOS compiler would worry about), the code generator should never emit a MOV instruction to clear a register (i.e., set it to 0). It should instead always XOR the register with itself. Therefore, the final optimized code should be:

xor   %ah, %ah
mov   $0xba, %al

Using a MOV (or even an XCHG, especially if the accumulator is one of the operands, where the special 1-byte encoding of XCHG can be used, which can be a significant performance win on CPUs constrained by prefetching like the 8088 and 386SX) instead of a shift left or right by 8 is probably a missed optimization opportunity in multiple places besides this one.

x *= 256

is equal to:

x <<= 8

but should be generated as a MOV that swaps the low byte into the high byte and then clears the low byte. The inverse should occur for division by 256.

asiekierka commented 1 year ago

I'd assume that turning the mov 0x0,%ah into a xor would be covered by a separate optimization rule; on the specific 8086 variant I'm targetting, I think there is no actual performance difference, so it didn't occur to me to bring it up. Sorry!