Can you give more details about how it works ? (+ some ideas)

tigrouind commented 5 months ago

Could you elaborate a little bit more what are the main optimizations of that project that make it faster on 486 and other non pentium CPUs ? README does not tell much.

AFAIK a lot of CPU time in Quake is spent during perspective correction as it requires floating point division. Quake mitigates this by interleaving FPU and ALU operation (which is possible on P5 architecture) and by only doing division once every 16 pixel (in between it is interpolated using ALU).

Some ideas to make it faster on 486 (maybe this has already been tried) :

Fill walls with a solid color instead of a texture to see if this is really the real bottleneck.
only do division every 32 pixel or more (this will produce more artefacts when looking at walls at sharp angles, but that's a trade off). There is a C version of D_DrawSpans16 here. By comparing that function with D_DrawSpans8 (from original source code), it's quite easy to make a D_DrawSpans32 version.
use fixed point arithmetic only (doing it in C code as a proof of concept)

goshhhy commented 5 months ago

Interleaving FPU and ALU operation is actually possible all the way back to the 8086/8088/8087 chips, and so that optimization remains in use. the FPU operates asynchronously from the rest of the CPU, so interleaving integer and floating point operations is indeed a way to improve performance. The stock engine goes to great lengths to do this where it makes the most sense to, which mostly comes down to the parts that do perspective-correct texture mapping - as you suggest - and to a lesser extent, the parts that do culling of objects that are out of the frame, by walking the BSP tree. The latter is not well-optimized in software quake, so I have been working on backporting the optimized assembly routine for it from GLQuake.

The timings of the 486 are different than the Pentium by quite a bit, and warranted changes to accomodate that in critical sections. One oddity is that the 486 is sometimes faster accessing memory in the L1 cache than it is accessing an fpu register, so there are occasions where it makes sense to move a variable into memory that was previously stored in a register for the duration of a function.

I usually do quite a bit of testing to ensure that individual changes I make actually have the effect on performance that I expect - this can be difficult, but sometimes it is easy to isolate the code in question and turn it into an "artificial benchmark" of sorts that gives me a better idea whether a particular instance of this sort of change is worth it. A collection of these is essentially what the "Qmark" binary is.

For the Pentium, an additional optimization is done, which takes advantage of the fact that the FPU itself is pipelined, being able to simultaneously run 3 operations at once, but with a 3-cycle latency. The Pentium architecture accomodates this by internally fusing FXCH operations with other FPU operations and performing that in the same cycle (using register renaming, I believe). So, you can swap FPU registers "for free" in between operations in order to keep that pipeline filled. This optimization is documented in Michael Abrash's Black Book, Chapter 63, Heading 6, which at the time of writing can freely be found here. Other parts of this book also offered crucial insights for some optimizations that I am still attempting to implement. Since FXCH is not free on a 486, but rather takes up 4 cycles, undoing the FXCH optimizations done for the Pentium often results in a measurable increase in performance. I attribute a substantial amount of the performance gain to this.

Rather than just changing the subdivision size for the texture mapping, I've instead explored how i might be able to change the overall way that texture mapping is done, or finding edge cases that I can optimize. The former has involved me attempting to rewrite the renderer to use either constant-Z drawing, or a form of tile-based rendering similar to early PowerVR GPUs and most modern GPUs; I haven't completed either of these alternatives yet. The latter has involved adding checks for surfaces that have a constant horizontal or vertical Z (i.e. completely level ceilings and floors), and using an alternate rendering algorithm which takes advantage of the fact that the Z will never have to be adjusted across the entire scanline.

Using fixed-point math is frequently not a win. I've tested this a number of times, and the 486 is simply not good enough at integer math, at least without very clever alternative algorithms or approximations. Integer math is usually, on average, slightly but noticeably slower than single-precision FPU math as used in Quake. I have also done experimentation with targeting Cyrix processors. These are actually good at integer math, with a multiply taking only 3 cycles, as an example (486 can take up to 44). With the Cyrix, relying on integer math alone isn't really faster either. The right answer ends up usually being either a careful mix of FPU and integer instructions, as is already done, just with some adjustment for the timings - or, in a few cases, it actually becomes interesting to consider FPU emulation on the integer unit while also doing real FPU math. i've experimented with this but do not have a good working version that still retains proper visual fidelity. One version I've tested with does a very similar trick to the Quake3 "fast inverse square root" approximation, but taking out the "square" part, for which i had to do a search for an appropriate magic number. This unfortunately results in a very noticeable amount of warping on textures, so I did not end up using it.

A more recent toolchain does also contribute measurably, but not overwhelmingly.

Details of the above explanation might not be exactly right, but those are the broad strokes.

goshhhy commented 5 months ago

Oh, and this:

Fill walls with a solid color instead of a texture to see if this is really the real bottleneck.

This is supported, even in the stock engine, by setting the cvar r_drawflat to 1. There are a number of other useful cvars for debugging the performance of the renderer as well, though I've also added some things of my own, like a better version of the overlay for displaying time taken by the various subsystems during a frame (r_dspeeds), and a cvar (r_perfdebug) that I use during development to flip between two different implementations of a routine to test which one is faster in various scenarios.

goshhhy commented 5 months ago

Also, because of the above, testing on real hardware is vital. I test code in emulators like Dosbox or 86Box on occasion, mostly just to make sure I didn't completely break anything before I transfer it to my DOS system. But since so many of the optimizations are microarchitecture-focused, emulators offer limited utility. And on occasion there are surprising differences between the emulated system and the collection of real hardware I test with. By just timing the performance of a few functions, Qmark can make a pretty good guess whether it's in an emulator or running on real hardware. I once tried to modify 86Box to more accurately emulate FPU timing details, which would've made it a good alternative to testing on real hardware, but this turned out to be too big a drain on performance and it was reverted shortly after it was merged.

tigrouind commented 5 months ago

DOSBox : AFAIK it's not cycle accurate, all instructions are counted as one cycle (eg: setting 3000 cycles will execute 3000 instructions every ms, regardless the instruction type). Additionally there are many things which are virtually free in DOSBox but not on a real 486 (eg: copying data from memory into graphics card). So you can't rely on that emulator too much when doing optimization (as you mentioned). It might give a good approximation but that's all.

Renalicious commented 2 months ago

I realize that this project is mainly about speeding up Quake on the 486, but I'm curious about some the 586 and mmx build options, and what exactly is new there? I'm not seeing any mmx specific code, so is this something the compiler implements at build time? To that end, are there any mmx specific optimizations that can be added, even outside of the rendering parts of the engine?

As far as I understand, back in the day Carmack was more interested in keeping the code platform / processor agnostic, and it wasn't until Quake 3 that he finally implemented a single mmx optimization for memory copy, or something like that. Other Q3 forks added mmx / sse optimizations for sound mixing, etc.

I'm definitely not a seasoned programmer, more of a hobbyist tinkerer, so I find this stuff fascinating! Great stuff! :)

goshhhy commented 2 months ago

for the 586, there is very little new - the 586 build is simply using the stock-optimized code. there are a few changes elsewhere in the C portions of the codebase that may affect performance in some cases, but for the most part it won't.

for MMX, i've tested a few changes related to that, which is why i have it in the build system, but i don't think i've done anything extensive on that front - i spent a lot more time working on Cyrix optimization instead.

over time, i'm becoming less interested in optimizing for 486 systems exclusively, and more interested in broader optimizations that are likely to affect more targets; for that reason, i'm also considering renaming the project soon.

goshhhy / 486quake

Can you give more details about how it works ? (+ some ideas) #15