Open viti95 opened 1 year ago
Also optimizable in vga_draw_pixelfaster (lines 223-224):
and dx, 0b0000_0001_1110_0000 ; 4 bytes
shr dx, 5 ; 3 bytes
;total 7 bytes
replace with
and dh, 0b0000_0001 ; 3 bytes
shr dx, 5 ; 3 bytes
;total 6 bytes
Why? Having to decode less bytes is faster (in most cases) since the 286 has no cache and has to read all instructions directly from RAM. Same issue happens on 386SX systems.
Thanks, Victor.
When I try this change, I get the following error from NASM.
“error: invalid 16-bit effective address”
Pointing to this line of code: lea dx, [dx*8]
Thanks!
Rich
From: Victor Nieto @.> Sent: Thursday, June 29, 2023 8:18 AM To: rehsd/FreeDOS_AppCode @.> Cc: Subscribed @.***> Subject: [rehsd/FreeDOS_AppCode] Some optimization ideas (Issue #1)
In graphics.asm, vga_draw_pixelfaster (lines 214-215), it's possible to replace the slow SHL reg, imm instruction with a combination of faster ones:
and dx, 0b00000000_00011111 ;2 cycles
shl dx, 11 ;5+11 cycles
; total 18 cycles
with
mov dh, dl ;2 cycles, shift 8 times to the left
and dx, 0b00011111_00000000 ;2 cycles, clean up non required bits and lower byte
lea dx,[dx*8] ;3 cycles, shift missing 3 times
; total 7 cycles
In general avoid shifting instructions with anything different than 1 on 286 cpu's, as is a slow instruction.
— Reply to this email directly, view it on GitHubhttps://github.com/rehsd/FreeDOS_AppCode/issues/1, or unsubscribehttps://github.com/notifications/unsubscribe-auth/AUYMHGSQOYZBL66TTNT4IZDXNV6CHANCNFSM6AAAAAAZYQFGEQ. You are receiving this because you are subscribed to this thread.Message ID: @.**@.>>
Thank you, Victor! This change resulted in a 4.6% reduction in test duration (improvement)!
Rich
From: Victor Nieto @.> Sent: Thursday, June 29, 2023 10:25 AM To: rehsd/FreeDOS_AppCode @.> Cc: Subscribed @.***> Subject: Re: [rehsd/FreeDOS_AppCode] Some optimization ideas (Issue #1)
Also optimizable in vga_draw_pixelfaster (lines 223-224):
and dx, 0b0000_0001_1110_0000 ; 4 bytes
shr dx, 5 ; 3 bytes
;total 7 bytes
replace with
and dh, 0b0000_0001 ; 3 bytes
shr dx, 5 ; 3 bytes
;total 6 bytes
Why? Having to decode less bytes is faster (in most cases) since the 286 has no cache and has to read all instructions directly from RAM. Same issue happens on 386SX systems.
— Reply to this email directly, view it on GitHubhttps://github.com/rehsd/FreeDOS_AppCode/issues/1#issuecomment-1613387083, or unsubscribehttps://github.com/notifications/unsubscribe-auth/AUYMHGURE2XAZX7AWDFES5LXNWM4BANCNFSM6AAAAAAZYQFGEQ. You are receiving this because you are subscribed to this thread.Message ID: @.**@.>>
Well I've found out that LEA is not as powerful in 286 cpu's compared to 386+ (I use this trick extensively on FastDoom) as it doesn't support SIB (scale-index-base) with 16-bit addressing. Replacing that instruction with single SHL reg,1 instructions should work:
mov dh, dl ;2 cycles, shift 8 times to the left
xor dl, dl ;2 cycles, clean up non required bits and lower byte
shl dx, 1 ;2 cycles
shl dx, 1 ;2 cycles
shl dx, 1 ;2 cycles
;total 10 cycles, 10 bytes
EDIT: This uses more bytes, maybe it's slower. There is another way to have smaller and faster code:
ror dx, 5 ;5+5 cycles
and dx, 0b11111000_00000000 ;3 cycles
;total 13 cycles, 7 bytes
Other optimization that can be done (vga_draw_pixel_faster, lines 250-252):
in ax, VGA_REG ; read the VGA register
and ax, 0b11111111_11110000 ; read the register, keep all bits except segment
or ax, dx ; update segment bits
to
in ax, VGA_REG ; read the VGA register
and al, 0b11110000 ; read the register, keep all bits except segment
or ax, dx ; update segment bits
is just 1 byte smaller
In graphics.asm, vga_draw_pixelfaster (lines 214-215), it's possible to replace the slow SHL reg, imm instruction with a combination of faster ones:
with
In general avoid shifting instructions with anything different than 1 on 286 cpu's is better as is a slow instruction.
EDIT: A small update on the previous idea: