Some optimization ideas

viti95 commented 1 year ago

In graphics.asm, vga_draw_pixelfaster (lines 214-215), it's possible to replace the slow SHL reg, imm instruction with a combination of faster ones:

and dx, 0b00000000_00011111 ;3 cycles
shl dx, 11          ;5+11 cycles
;total 19 cycles, 7 bytes

with

mov dh, dl          ;2 cycles, shift 8 times to the left
and dx, 0b00011111_00000000 ;3 cycles, clean up non required bits and lower byte
lea dx, [dx*8]      ;3 cycles, shift missing 3 times
;total 8 cycles

In general avoid shifting instructions with anything different than 1 on 286 cpu's is better as is a slow instruction.

EDIT: A small update on the previous idea:

mov dh, dl          ;2 cycles, shift 8 times to the left
xor dl, dl          ;2 cycles, clean up non required bits and lower byte
lea dx, [dx*8]      ;3 cycles, shift missing 3 times
;total 7 cycles and less bytes used

viti95 commented 1 year ago

Also optimizable in vga_draw_pixelfaster (lines 223-224):

and dx, 0b0000_0001_1110_0000   ; 4 bytes
shr dx, 5               ; 3 bytes
;total 7 bytes

replace with

and dh, 0b0000_0001 ; 3 bytes
shr dx, 5       ; 3 bytes
;total 6 bytes

Why? Having to decode less bytes is faster (in most cases) since the 286 has no cache and has to read all instructions directly from RAM. Same issue happens on 386SX systems.

rehsd commented 1 year ago

Thanks, Victor.

When I try this change, I get the following error from NASM.

“error: invalid 16-bit effective address”

Pointing to this line of code: lea dx, [dx*8]

Thanks!

Rich

From: Victor Nieto @.> Sent: Thursday, June 29, 2023 8:18 AM To: rehsd/FreeDOS_AppCode @.> Cc: Subscribed @.***> Subject: [rehsd/FreeDOS_AppCode] Some optimization ideas (Issue #1)

In graphics.asm, vga_draw_pixelfaster (lines 214-215), it's possible to replace the slow SHL reg, imm instruction with a combination of faster ones:

and dx, 0b00000000_00011111 ;2 cycles

shl dx, 11 ;5+11 cycles

; total 18 cycles

with

mov dh, dl ;2 cycles, shift 8 times to the left

and dx, 0b00011111_00000000 ;2 cycles, clean up non required bits and lower byte

lea dx,[dx*8] ;3 cycles, shift missing 3 times

; total 7 cycles

In general avoid shifting instructions with anything different than 1 on 286 cpu's, as is a slow instruction.

— Reply to this email directly, view it on GitHubhttps://github.com/rehsd/FreeDOS_AppCode/issues/1, or unsubscribehttps://github.com/notifications/unsubscribe-auth/AUYMHGSQOYZBL66TTNT4IZDXNV6CHANCNFSM6AAAAAAZYQFGEQ. You are receiving this because you are subscribed to this thread.Message ID: @.**@.>>

rehsd commented 1 year ago

Thank you, Victor! This change resulted in a 4.6% reduction in test duration (improvement)!

Rich

From: Victor Nieto @.> Sent: Thursday, June 29, 2023 10:25 AM To: rehsd/FreeDOS_AppCode @.> Cc: Subscribed @.***> Subject: Re: [rehsd/FreeDOS_AppCode] Some optimization ideas (Issue #1)

Also optimizable in vga_draw_pixelfaster (lines 223-224):

and dx, 0b0000_0001_1110_0000 ; 4 bytes

shr dx, 5 ; 3 bytes

;total 7 bytes

replace with

and dh, 0b0000_0001 ; 3 bytes

shr dx, 5 ; 3 bytes

;total 6 bytes

Why? Having to decode less bytes is faster (in most cases) since the 286 has no cache and has to read all instructions directly from RAM. Same issue happens on 386SX systems.

— Reply to this email directly, view it on GitHubhttps://github.com/rehsd/FreeDOS_AppCode/issues/1#issuecomment-1613387083, or unsubscribehttps://github.com/notifications/unsubscribe-auth/AUYMHGURE2XAZX7AWDFES5LXNWM4BANCNFSM6AAAAAAZYQFGEQ. You are receiving this because you are subscribed to this thread.Message ID: @.**@.>>

viti95 commented 1 year ago

Well I've found out that LEA is not as powerful in 286 cpu's compared to 386+ (I use this trick extensively on FastDoom) as it doesn't support SIB (scale-index-base) with 16-bit addressing. Replacing that instruction with single SHL reg,1 instructions should work:

mov dh, dl      ;2 cycles, shift 8 times to the left
xor dl, dl      ;2 cycles, clean up non required bits and lower byte
shl dx, 1       ;2 cycles
shl dx, 1       ;2 cycles
shl dx, 1       ;2 cycles
;total 10 cycles, 10 bytes

EDIT: This uses more bytes, maybe it's slower. There is another way to have smaller and faster code:

ror dx, 5           ;5+5 cycles
and dx, 0b11111000_00000000 ;3 cycles
;total 13 cycles, 7 bytes

viti95 commented 1 year ago

Other optimization that can be done (vga_draw_pixel_faster, lines 250-252):

in  ax, VGA_REG     ; read the VGA register
and ax, 0b11111111_11110000 ; read the register, keep all bits except segment
or  ax, dx          ; update segment bits

to

in  ax, VGA_REG ; read the VGA register
and al, 0b11110000  ; read the register, keep all bits except segment
or  ax, dx      ; update segment bits

is just 1 byte smaller

rehsd / FreeDOS_AppCode

Some optimization ideas #1