Open Awesoken2 opened 1 year ago
A similar trick using LEA works for ceilings & floors:
Existing code:
shld ebx, ecx, 22 ;386:3cc, 486:2cc, ebx:[xxxxxxxxxxUUUUUUuuuuuuuuuuVVVVVV]
shld ebx, ecx, 6 ;386:3cc, 486:2cc, ebx:[xxxxUUUUUUuuuuuuuuuuVVVVVVUUUUUU]
and ebx, ebp ;386:2cc, 486:1cc, ebx:[00000000000000000000VVVVVVUUUUUU]
add ecx, edx ;386:2cc, 486:1cc
New code:
lea edx, [ecx+ebp] ;386:2cc, 486:1cc, ecx:[VVVVVVvvvvvvvvvvUUUUUUuuuuuuuuuu]
shr ecx, 26 ;386:3cc, 486:2cc, ecx:[00000000000000000000000000VVVVVV]
shld cx, dx, 6 ;386:3cc, 486:2cc, ecx:[00000000000000000000VVVVVVUUUUUU]
.. ;use ecx
lea ecx, [edx+ebp] ;386:2cc, 486:1cc
shr edx, 26 ;386:3cc, 486:2cc
shld dx, cx, 6 ;386:3cc, 486:2cc
.. ;use edx
Notes:
Wow, it is amazing to see this kind of optimizations after 30 years of DOOM!
I have a Pentium III @ 700MHz with Windows 98SE I can make some benchmarks tests with DOOM in dos mode if needed. I do not have a 386 nor 486 though. On a Pentium III however it would really not practically matter because we would have max FPS anyway, unless FastDoom started to raise all vanilla limits and allowed loading large wads.
By the way to better display the asm code on GitHub you can use this sequence,
```asm
asm codes gores here
```
ie:
xor ebx, ebx ;386:2cc, 486:1cc shld ebx, edx, 7 ;386:3cc, 486:2cc add edx, ecx ;386:2cc, 486:1cc
You can do the same with other languages replacing asm with c cpp etc. It is really unimportant but it took me ages to figure this out by myself, so I am telling :)
Thanks for the asm formatting hint. I edited my posts above and they look much better now!
If you want to test frame rate for the above code, it really has to be on a true 386 or 486. A Pentium III will run the wall loops about the same, and the ceiling loops a lot slower due to the partial register stalls.
If you're optimizing for a Pentium or above, here are some suggestions:
Curiously, if you actually did all of the above, you would basically have the Build Engine which uses all of the above techniques. (It's strange how I would know that? ;-P)
I don't know if FastDoom is autodetecting the CPU at startup. If not, I found this simple way to do that in Watcom C (DOS):
static int getcpu (void)
{
union REGS r;
r.x.eax = 0x0400; //DPMI get version
int386(0x31,&r,&r);
return((r.x.ecx&0xff)*100+86); //Returns: (286),386,486,586,686
}
^ Much easier than installing an invalid opcode handler ; P
Curiously, I never bothered to look at Doom's assembly loops in real detail until recently. I was shocked to find this line which was present in both inner loops (walls & ceilings/floors):
dec [loopcount]
That's a read/modify/write instruction, which takes 6cc on 386, or 3cc on 486. It would have been so easy to free up a register and knock off a few cycles: -4cc on 386, or -2cc on 486 (per iteration). I see FastDoom gets around this by unrolling all the way - a great strategy for 386 & 486. For the Pentium or above though, it's generally better to write tight loops with self-modifying code if you can't fit everything into registers.
Thanks for the ideas @Awesoken2 ! Pretty interesting optimization using LEA instruction, I'll try it for sure. As for register stalls, the 486 has 1 cycle penalty when writing to a lower part of a register (AL for example) and then using it afterwards fully (EAX). Also there is 1 cycle penalty when decoding an instruction with an immediate value and either an index offset or an immediate displacement.
Right now FastDoom only focuses on low end 486s and 386 cpu's so don't care much about Pentium or similar cpu's. I'll use for sure the getcpu function, I know a trick to use the CR2 register as an scratchpad on 386 processors (386SX and 386DX) that allows 32-bit data copy that is faster compared to memory copies.
BTW I changed the original code from
mov ebx,edx
...
shr ebx,25
to
xor ebx,ebx
...
shld ebx,edx,7
because Texas Instruments 486 CPU's datasheet describe SHLD/SHRD to be 1 cycle faster than SHL/SHR. I still have to try this, as Texas Instruments 486 share same core as Cyrix and ST 486s, and the datasheet of those describe the instruction latency to be the same in both cases.
And one question, the Build engine used a backbuffer for rendering? or did it use direct rendering to VRAM?
The penalty for write AL immediately followed by read EAX can be avoided by putting an unrelated instruction between them. Remember to re-order instructions when putting them back into the real code.
That stinks about the 1cc penalty for using both index & base. I see it in the 486 manual now. I totally missed that one. It basically ruins the lea trick for 486. Oh well, there's still the 386, I guess.
Some good news: the operand size prefix (66h) for the "shld cx, dx, 6", which usually has a 1cc penalty, can be overlapped with the previous instruction if it is multi-cycle (i.e.: shr ecx, 26).
I've never heard of this faster memcpy on 386 using CR2.. how does it work?
I know very little about the non-Intel processors from back then. It's crazy that SHLD would ever be faster than SHL as it's more complicated.
The Build Engine from 1995 or before had all kinds of crazy video modes, including unchained modes with support for every resolution I could find working values for. I would render direct to VRAM when possible - including unchained mode. The code sure was ugly though. Horizontal lines had to be done in 4 passes since using outp() to change the map mask was crazy slow.
Before VESA was prevalent, I made some special modes for Tseng ET4000, S3, and Paradise cards. These modes were all 320x200, but could triple buffer using a simple linear format (like mode 0x13). Supporting these cards required reverse engineering to figure out how to change the 64K window, as well as the high bits of the start address register.
By 1996, VESA had taken over - and it was equal or better than all the other modes, especially once linear memory (usually by using UNIVBE) was supported (for hi-res modes). With some prompting from Apogee, I removed the older modes, most of which were slower anyway. I kept a fallback mode to mode 0x13. The only modes to use a memcpy were this fallback mode, and the hi-res VESA modes when linear wasn't supported.
The 386SX (and some 386DXs) suffers from huge memory bottleneck, specially if the system doesn't have any cache. Reading and writing to the CR2/CR3 register is faster than doing from memory, even though the number of cycles of the MOV instruction is a bit higher:
mov r32, CR2 ; 6 cycles
mov r32, CR3 ; 6 cycles
mov r32, mem ; 4 cycles (+ mem latency)
mov CR2, r32 ; 4 cycles
mov CR3, r32 ; 5 cycles
mov mem, r32 ; 2 cycles (+ mem latency)
In FastDoom CR2/CR3 registers aren't used, so it's possible to use them as a scratchpad (for example to access dc_source or ds_source on rendering functions). I did some small tests and proved to be faster on my 386SX and cacheless 386DX, but much slower on any other 486 based system . A similar trick can be used with Cyrix based 386/486 cpu's with the debug registers, but it's not always faster as those cpu's have integrated L1 cache.
I have a new wall loop for 386 & 486. This one has no base+index addressing penalties:
;eax:&pal, ebx:Vhi, ecx:Vlo, edx:Vinclo, esi:&tex, edi:&scr, ebp:Vinchi
add ecx, edx ;386:2cc, 486:1cc, ecx:[vvvvvvvv vvvvvvvv vvvvvvvv v0000000]
adc ebx, ebp ;386:2cc, 486:1cc, ebx:[00000000 00000000 00000000 xVVVVVVV]
and ebx, 3fh ;386:2cc, 486:1cc, ebx:[00000000 00000000 00000000 0VVVVVVV]
mov al, [esi+ebx]
mov al, [eax]
mov [edi-(LINE-1)*SCREENWIDTH], al
If you don't mind texels 128-255 being shifted one column to the right, or wasting 2x memory by repeating columns, you can remove the 'and' and change 'adc' to: 'adc bl, dl' (swapping edx & ebp in the process) to keep the upper 24 bits zeroed.
As usual, be sure to interleave instructions in the actual implementation to avoid penalties.
..and here is a new ceiling/floor loop for 386 & 486:
;eax:&pal
;esi:&tex
;edi:&scr
;ecx:[ ulo .. vhi .. ]
;edx:[uinclo .. vinchi .. ]
;ebx:[ 0 0 vhi uhi ]
;ebp:[ 0 0 0 uinchi]
add ecx, edx ;386:2cc, 486:1cc, ecx:[uuuuuuuu uu000000 VVVVVVvv vvvvvvvv]
adc ebx, ebp ;386:2cc, 486:1cc, ebx:[00000000 00000000 xxxxxx00 0xUUUUUU]
mov bh, ch ;386:2cc, 486:1cc, ebx:[00000000 00000000 VVVVVVvv 0xUUUUUU]
and ebx, 0xfc3f ;386:2cc, 486:1cc, ebx:[00000000 00000000 VVVVVV00 00UUUUUU]
mov al, [esi+ebx]
mov al, [eax]
mov [edi+PLANE+PCOL], al
Notes:
Is it possible to use the ESP register as a additional register if the interrupts are disabled? I've been trying to use it but I get random lockups.
There are 2 ways to use ESP:
Without disabling interrupts, as a small iteration counter:
espbak: dd 0
..
mov dword ptr ds:[espbak], esp
mov dword ptr ds:[endp-4], (esp-num_iters*4) ;pseudocode!
jmp short beg ;for self-modifying code
beg: .. mov ?, [esp+0x88888888] ;optionally address an array .. sub esp, 4 cmp esp, 0x88888888 ;self-modified end pointer endp: jnc short beg
mov esp, dword ptr ds:[espbak] ..
Leave ESP as is for a starting value. You can then subtract 4 from it per iteration as your counter. Compare ESP to a self-modified value (assuming the entire purpose is you're out of registers) to decide when to terminate the loop. Finally, restore to the original value. You don't have to disable interrupts here since ESP always points to a valid clobber-able stack. Just be careful not to count too high, or else you'll run out of stack space. A few hundred iterations should be safe.
2. Disabling interrupts, as a general purpose register:
```asm
espbak: dd 0
..
cli
mov dword ptr ds:[espbak], esp
mov esp, (whatever) ;safely clobber esp here
mov esp, dword ptr ds:[espbak]
sti
Notes:
Curiously, in modern Windows code, ESP can be used freely - i.e. interrupts do not need to be disabled. I think it's because hardware interrupts are running in a different thread.
While reading my trusty "i486 Microprocessor Programmer's Reference Manual", page 844 (G-8), I noticed this stinker under "PREFIX OPCODES":
"On either processor, all prefix opcodes, including OFh, segment override, operand size/addressing, bus-lock, repeat, etc. require an additional clock to decode. This clock can be overlapped with the execution of the previous instruction if it takes more than one clock to execute."
That means your favorite instructions - SHLD & SHRD - are probably taking 1cc longer than you think, at least on Intel processors.
Too bad I don't have a real 486 to actually test this with. The ancient 486SX-20 laptop I thought I had just lost its backlight - and apparently the VGA connector doesn't work either. Boo. So I installed both PCEm and 86Box, assuming they'd be more accurate than DosBox. Well, I found that they really aren't. They just stupidly do simulated delays based on the instruction pneumonic, completely ignoring things like whether parameters are register or memory form. Forget about simulating penalties and they even say they don't simulate cache behavior. Oh well.
Boo. So I installed both PCEm and 86Box, assuming they'd be more accurate than DosBox. Well, I found that they really aren't. They just stupidly do simulated delays based on the instruction pneumonic, completely ignoring things like whether parameters are register or memory form. Forget about simulating penalties and they even say they don't simulate cache behavior. Oh well.
Sorry for your hardware loss. PCem is better than DOSBox but far from perfect indeed. The point of those emulators is to have a wide software support and DOSBox does not care at all about accurate details emulation. PCem on the other hand is more focused on accuracy (at least from what they claim) but is still far form its goal especially in cache handling as you said, I saw this by testing with SpeedSys and you only get a flat line on the memory bandwidth test.
Hopefully the situation will improve for now I see PCem as a tool to use old software not really to develop new software for old hardware.
While reading my trusty "i486 Microprocessor Programmer's Reference Manual", page 844 (G-8), I noticed this stinker under "PREFIX OPCODES":
"On either processor, all prefix opcodes, including OFh, segment override, operand size/addressing, bus-lock, repeat, etc. require an additional clock to decode. This clock can be overlapped with the execution of the previous instruction if it takes more than one clock to execute."
That means your favorite instructions - SHLD & SHRD - are probably taking 1cc longer than you think, at least on Intel processors.
Too bad I don't have a real 486 to actually test this with. The ancient 486SX-20 laptop I thought I had just lost its backlight - and apparently the VGA connector doesn't work either. Boo. So I installed both PCEm and 86Box, assuming they'd be more accurate than DosBox. Well, I found that they really aren't. They just stupidly do simulated delays based on the instruction pneumonic, completely ignoring things like whether parameters are register or memory form. Forget about simulating penalties and they even say they don't simulate cache behavior. Oh well.
I've tried on a Intel 486DX4 75MHz and the difference between MOV+SHR and XOR+SHLD is basically none. Still have to try on a Cyrix based 486 to see if the datasheets are right. Also I have to test any 386 to see if there is any performance difference.
@Awesoken2 I've been able to use your idea to render columns using interleaved code, the code is in this branch: https://github.com/viti95/FastDoom/tree/ken-silverman-optimizations
For now only is implemented for high detail and Mode Y executables (LEA+SHR), but this will allow me to check on real hardware the performance difference.
Some benchmarks (only for R_DrawColumn high detail):
UMC Green 486 (50MHz): Old code: 42.732 fps New code (LEA+SHR): 43.307 fps
Intel 486DX2 (66MHz): Old code: 37.651 fps New code (LEA+SHR): 37.819 fps
ST 486 DX2 80MHz (Cyrix): Old code: 41.397 New code (LEA+SHR): 40.785
That's +1.3%, +0.4%, and -1.5% (respectively). Are those fps numbers consistent across runs? If you're using the 140Hz interrupt counter and doing a multi-minute test, that's kind of slow and crappy. With a more precise timer, you could get a number instantly and not have to wait minutes per test. If the RDTSC instruction were available, that would be perfect. Too bad it doesn't exist on a 486. A good alternative is to use the 1.193MHz PIT, for example:
//Use PIT (1193181Hz) channel 2 for precise timing. Notes:
//* will interfere with PC Speaker code that uses PIT timer 2.
//* counter is 16-bit, so FPS < 18.2Hz requires some extra code to resolve.
//* PIT counter decrements - remember to negate!
static void hitim_start(void) { outp(0x61,inp(0x61)|1); outp(0x43,0xb4); outp(0x42,0); outp(0x42,0); } //period=65536
static int hitim_get (void) { int i; outp(0x43,0x84); i = (int)inp(0x42); return (int)inp(0x42)*256 + i; }
static void hitim_stop (void) { outp(0x61,inp(0x61)&0xfc); }
You can use the 140Hz interrupt counter to resolve the upper 16 bits like this:
//int gametics = 0;
//interrupt IRQ8_timer_handler() { .. gametics++; .. } //140Hz
int hitic0 = tim_get(); int gametic0 = gametics; //read 1st time stamp (or use previous frame's time stamp)
.. /*stuff to be timed*/ ..
int hitic1 = tim_get(); int gametic1 = gametics; //read 2nd time stamp
int dt = (gametic1-gametic0)*(1193181/140); //rough approx; multiplier should match period of PIT channel 0
dt -= 32768; dt += (((hitic0-hitic1) - dt)&0xffff); //quantize to best 16-bit interval. Note: PIT counts backwards!
//printf("%.3f fps",1193181.f/(float)dt); //show immediate fps; better to fifo dt's & display median. ; )
}
BTW, I posted a 2nd wall loop (about halfway down the thread) that has the potential to be even faster, and should definitely be tested as well. Perhaps we should call them:
Yes I repeat the benchmarks multiple times, so the result is pretty much stable. And yeah benchmarks are slow on Doom, as every frame has to be rendered to get the final result.
BTW there is an issue with the later wall optimization (that's why I haven't posted results with it). EBP register is used to jump to the unrolled texture mapping, so it cannot store an initial value:
jmp [scalecalls+4+ebp*4]
Maybe it's possible to patch the instruction in realtime in order to free the EBP register, I'm still trying.
Some unexpected results for LEA+SHR. I've tested on my 386SX@33 MHz on potato detail (same code as high detail) and this is the result:
original: 14.224 flipflop: 14.085
So also a bit slower. I think is this because the 386 has this two penalties:
DECODE-AFTER-JUMP OVERHEAD
When a jump is performed, a 386 spends some extra time decoding the next instruction to execute and flushing the pipeline.
THE FIRST INSTRUCTION requires exactly ONE EXTRA CPU CYCLE for every COMPONENT ( prefixes, opcode, mod/rm , sib ,offset, immediate value ) of the instruction. Notice i said COMPONENT!!! An 8bit displacement or a 32bit one, count always as an 1 extra cycle.
SHORT INSTRUCTIONS
Short instructions are loaded and decoded faster than longer ones and since 386 has no internal cache and less memory access bandwidth than a plain 486, this helps a lot.
Ah. I wasn't thinking about that when allocating registers. Here are some easy solutions:
Combine texture increments (edx:Vinc_lo & ebp:Vinc_hi) into the same register since their bits don't overlap. The extra amount being added to the low end of Vinc_lo (starting at bit 7) will never be noticed. You could use a 'rol edx, 7' to initialize the increment nicely.
Store jump target on the stack:
..
lea ebx, [scalecalls+4+ebp*4]
push ebx
mov ebx, edx
shr ebx, 25
jmp dword ptr [esp]
..
vscale1: add esp, 4 (pop, etc.) ret
I don't see how a jump penalty would cause a noticeable slowdown, as there are no jumps other than the initial one into the big unrolled loop. Perhaps there's a code alignment problem? I wonder if sticking nop's after the align 4 would help.
Yeah, the lea instruction may have a 1cc penalty for using an index, so it wouldn't surprise me if it doesn't run any faster. For this reason, I think the 'noshift' loop is a lot more promising, especially on a 386.
I've solved the EBP issue for now storing it in a variable in ram and jumping using that variable. Now I'm trying to add the new optimized wall loop, but I don't understand well how it works, or at least well what each variable means.
I guess &pal is the palette ptr, &tex the texture ptr and &scr the VRAM ptr, but the other ones no idea 😅
Here's a picture of the texture registers I use in the noshift loop:
31...24 23...16 15.....8 7......0
ecx: V_lo: [vvvvvvvv vvvvvvvv vvvvvvvv v-------]
ebx: V_hi: [-------- -------- -------- -VVVVVVV]
edx: vInc_lo: [iiiiiiii iiiiiiii iiiiiiii i-------]
ebp: vInc_hi: [-------- -------- -------- -IIIIIII]
key:
lower case: lower bits / fractional part
upper case: upper bits / your 7-bit texel coordinate
-: don't care (could be 0 or garbage before being masked off)
The idea is the same as adding a 64-bit number using two 32-bit registers:
add eax, ecx ;add lo 32 bits
adc ebx, edx ;add hi 32 bits, using carry flag from previous instruction
The only difference is I'm using 32 bits in the middle (bits 7 to 38).
Not really an issue, but an opportunity!
For comparison, here is the existing FastDoom code for texture mapping walls, reordered for simplicity, and showing only the math for texture mapping:
You can easily eliminate 1 clock cycle (cc) on 486 (-2cc on 386!) using 'lea' to perform both an 'add' and 'mov':
You must write out 2 iterations in your unrolling macro, as ebx & edx are being swapped on each iteration. In the setup code, write out an extra pixel so the count is always even.
-Ken S.