joncampbell123 / dosbox-x

DOSBox-X fork of the DOSBox project
GNU General Public License v2.0
2.78k stars 381 forks source link

Voodoo and CPU emulation performance #3959

Open Torinde opened 1 year ago

Torinde commented 1 year ago

For CPU emulation there is a DOSbox original table on Host-Emulated CPU equivalency

What is the performance of the current Voodoo emulation? (some info is here and a related discussion is 86Box/86Box/discussions/2909)

Similar table of correspondence would be helpful as well, e.g. "to emulate Voodoo 2 SLI you need Zen2 at 3GHz".

Is the current Voodoo emulation multi-threaded? 86box is If all 3dfx cards are added as emulation options with multi-threading and user has recent multi-core CPU - with the top 3dfx you may reach the level of 1328MPixels/s, which is equivalent of Geforce FX5500/Radeon X300 (minimum requirements for Age of Empires III - one of the last games to still work on Win9x... disregarding the lack of D3D9 for the moment)

Originally posted by @Torinde in https://github.com/joncampbell123/dosbox-x/discussions/3867

Torinde commented 1 year ago

DOSBox-X may or may-not be faster then original DOSBox. I have not benchmarked them side-by-side with the same configs.

Keep in mind that DOSBox-X has a lot of additional emulation options and added emulation accuracy, and those options and accuracy typically come at a cost. So it would not surprise me if DOSBox-X were to be slower as a result.

Regarding cycles and retro CPU types, in the DOSBox-X drop-down menus there are already options to set cycles to something that approaches a certain CPU type. image

Also keep in mind that the way cycles works is completely different from the MHz or GHz rating of CPUs. See https://dosbox-x.com/wiki/Guide%3ACPU-settings-in-DOSBox%E2%80%90X#_is_dosbox_x_cycle_accurate

Originally posted by @rderooy in https://github.com/joncampbell123/dosbox-x/discussions/3867#discussioncomment-4294255

OK, so the question is what host CPU is needed to provide a set amount of cycles: image

Originally posted by @Torinde in https://github.com/joncampbell123/dosbox-x/discussions/3867#discussioncomment-4294426

Torinde commented 1 year ago

According to my tests, the CPU emulation is very similar in both. Sometimes one can be slightly faster, sometimes the other. DOSBox-X has slower graphics emulation (svga_s3) and this is clearly reflected in the tests (even in text mode).

Comparing the Speed Test on my old hardware Core i5-450M for five launches, the same result 766.8 XT is shown four times in both (once it is different). They were only slightly lower during video recording 758.2.

speed.test.dosbox-x.vs.dosbox-svn.webm

Performance is influenced by how and with what compiler both binaries were compiled. In my case both were built with the same compiler and using identical compiler flags.

edit: For comparison. QEMU (kvm) 6748.2 XT

Originally posted by @grapeli in https://github.com/joncampbell123/dosbox-x/discussions/3867#discussioncomment-4296081

OK, since performance is similar, then the table from DOSbox original applies! So, only two CPUs remain to be quantified:

Originally posted by @Torinde in https://github.com/joncampbell123/dosbox-x/discussions/3867#discussioncomment-4296339

Torinde commented 1 year ago

The results from this table I suspect come from 8-9 years ago. Treat them as indicative only.

Few people currently use Core2Duo. Core i5 4xxx series processors come from 2013-2014. Dosbox has changed over these 8-9 years as well as operating systems, compilers...

I suspect that any relatively new Amd or Intel processor clocked over 4.6 GHz will achieve an emulation speed corresponding to a => 1 GHz Pentium processor.

Originally posted by @grapeli in https://github.com/joncampbell123/dosbox-x/discussions/3867#discussioncomment-4298847

OK, that's encouraging! Still, to cater for hosts with less that highest frequencies and to get guest CPU performance to the maximum of the ISA era (Win9x and especially WinXP - P6 at above 2GHz, P4 at almost 4GHz) - would be useful to have a hypervisor core, for example based on virt86 as discussed at #1089

GPU tests from 86Box/86Box/discussions/2909 show that 86box multithreaded emulation theoretically can reach even Voodoo 5 6000 speed on top-line modern multi-core host CPU. But it's not there yet.

Originally posted by @Torinde in https://github.com/joncampbell123/dosbox-x/discussions/3867#discussioncomment-4503795

Torinde commented 1 year ago

Clock speed of the host CPU is the most important factor, as it will determine how many cycles can be emulated inside DOSBox-X.

Extra CPU cores, once you get past dual-core CPUs, typically do not help, so if you were to choose a new CPU for running DOSBox-X, clock frequency is the most important factor.

A newer CPU generation can also help, as often a newer CPU can run more instructions per clock cycle then older once.

Originally posted by @rderooy in https://github.com/joncampbell123/dosbox-x/discussions/3867#discussioncomment-4299435

A newer CPU generation can also help, as often a newer CPU can run more instructions per clock cycle then older once.

I just checked on Google Cloud Shell. AMD EPYC 7B12 processor clocked at 2.25GHz. Speed Test result under dosbox-x is 1273 XT. Nearly 2x faster than 2010 Intel Westmere clocked at 2.4GHz (2.6 with turbo).

Originally posted by @grapeli in https://github.com/joncampbell123/dosbox-x/discussions/3867#discussioncomment-4301199

Torinde commented 1 year ago

I have tested a lot in the past the CPU/GPU performance on DosBox-X. I think that the biggest issue is not how fast it runs but how accurate (predictable) it is, if it can be guaranteed that it services processing requests as a stable system. You can always try to emulate a faster system but it is obvious what happens when it is not stable.

However, straight to the point:

Originally posted by @dodleh in https://github.com/joncampbell123/dosbox-x/discussions/3867#discussioncomment-4335978

Torinde commented 1 year ago

Certainly, this is difficult to achieve due to the nature of the PC - that is, the whole variety of components that affect performance.

I'm rather concerned that DOSBox-X has lost the original graphics emulation performance presented in the main dosbox branch.

I just launched Fallout (1997). The difference in how this game works with -svn and -X is big. Caused by write_p3c9.

52.89%  dosbox-x                   [.] write_p3c9
0.32%  dosbox                      [.] write_p3c9

Many people run a benchmark - Chris' 3d Benchmark. Why is there such a big difference and thanks to VGA_ChainedVGA_Slow_Handler::writed, which slows it down. Although it brings the desired results in the Legend demo as opposed to dosbox-svn, but I suspect that many more people will run this benchmark than this demo.

Originally posted by @grapeli in https://github.com/joncampbell123/dosbox-x/discussions/3867#discussioncomment-4370675

Torinde commented 1 year ago

DOSBox-X emulates video memory delay. The defaults attempt to emulate ISA bus delays unless you enable PCI enumation and PCI VGA emulation, which it then emulates PCI BUS delays.

The video memory delay is fully configurable and can be set to 0 to disable it entirely, of course.

As for why write_p3c9 is slowing things down, I'm not entirely sure, I'll look into it.

Originally posted by @joncampbell123 in https://github.com/joncampbell123/dosbox-x/discussions/3867#discussioncomment-4375593

The video memory delay is fully configurable and can be set to 0 to disable it entirely, of course.

This benchmark itself is of little importance, but some people may use these results as a reference point, and this has a bad effect on the perception of DOSBox-X. With vmemdelay=0 and vmemdelay=-1 it's the same. The result for me is about 30% worse than under dosbox-svn. Rather, it is largely due to this difference.

16.41% dosbox-x                 [.] VGA_ChainedVGA_Slow_Handler::writed
2.81%  dosbox                   [.] VGA_ChainedVGA_Handler::writed

edit: For SVGA the difference to the disadvantage of dosbox-x is about 10% (here VGA_ChainedVGA_Slow_Handler::writed is not called), which is definitely less than in VGA mode 25-30%. The most authoritative comparison is probably with SDL_VIDEORIVER=dummy. SDL2 offers a slightly better offscreen output, but -svn is based on SDL1.

Originally posted by @grapeli in https://github.com/joncampbell123/dosbox-x/discussions/3867#discussioncomment-4376782

Torinde commented 1 year ago

To put forward a theory about 3c9 by the way, there is some intricate code involved that carefully maps the VGA palette with consideration for the attribute controller, DAC mask, and oddities involved with 256-color mode as well as an alternate oddity with how Tseng ET4000 cards differ from standard VGA behavior in that way.

Originally posted by @joncampbell123 in https://github.com/joncampbell123/dosbox-x/discussions/3867#discussioncomment-4375648

Torinde commented 1 year ago

If I am right, one way to optimize that is not to process the whole mapping every single 3c9 write, but to instead process the entire palette in response at vertical retrace or the next call to the VGA routine responsible for rendering the next scan line on screen.

Originally posted by @joncampbell123 in https://github.com/joncampbell123/dosbox-x/discussions/3867#discussioncomment-4375656

Today's changes to the VGA emulation code have resulted in significant performance improvements in Fallout. The load via write_p3c9 has dropped to the level of -svn. 0.31% dosbox-x [.] write_p3c9

Chris' 3d Benchmark also saw a significant increase in the improvement of this test score. Well done.

Originally posted by @grapeli in https://github.com/joncampbell123/dosbox-x/discussions/3867#discussioncomment-4561848

I don't know if it's very important, but PCPBench benchmark initialization itself is very slow compared to dosbox-svn. To see this, you need to slow down dosbox very much.

systemd-run --user -G -p "CPUQuota=5%" ./dosbox-x -defaultdir . -set logfile=dosbox-x.log

On very powerful hardware CPUQuota should be even lower 1-3. -x ==== -svn 157s --- 4s pcpbench initialization time (before real benchmark)

33.52% dosbox-x [.] VGA_TEXT_Xlat32_Draw_Line 11.23% dosbox-x [.] RENDER_StartLineHandler

6.09% dosbox [.] VGA_TEXT_Draw_Line 1.72% dosbox [.] RENDER_StartLineHandler

The time taken to initialize PCPBench is similar to that of -svn when dosbox-x is started with ttf output.

perf record -e cycles:pp --call-graph dwarf -p $(pidof dosbox-x)

Disassembly of section .text:

00000000005f4020 <VGA_TEXT_Xlat32_Draw_Line(unsigned long, unsigned long)>:
VGA_TEXT_Xlat32_Draw_Line(unsigned long, unsigned long):
static uint8_t* EGA_TEXT_Xlat8_Draw_Line(Bitu vidstart, Bitu line) {
return EGAVGA_TEXT_Combined_Draw_Line<MCH_EGA,uint8_t>(vidstart,line);
}

       │      *draw++ = vga.dac.xlat32[(font&0x100)? foreground:background];
  0.11 │ 600:┌─→test      $0x1,%ah
  1.75 │     │  mov       %rdi,%rdx
 15.54 │     │  cmovne    %r9,%rdx
  0.06 │     │  add       $0x4,%r15
       │     │font <<= 1;
  2.35 │     │  add       %rax,%rax
       │     │*draw++ = vga.dac.xlat32[(font&0x100)? foreground:background];
  4.19 │     │  mov       0x81a18(%r12,%rdx,4),%edx
 18.77 │     │  mov       %edx,-0x4(%r15)
       │     │for (Bitu n = 0; n < 9; n++) {
 15.82 │     ├──cmp       %r10,%r15
  0.02 │     └──jne       600
       │      for (Bitu n = 0; n < 8; n++) {

                5f44cb VGA_TEXT_Xlat32_Draw_Line+0x4ab (/tmp/dosbox-x/src/dosbox-x)
vga_draw.cpp:2243
                5f44cb VGA_TEXT_Xlat32_Draw_Line+0x4ab (/tmp/dosbox-x/src/dosbox-x)
vga_draw.cpp:2243
                5f1ce8 VGA_DrawSingleLine+0x68 (/tmp/dosbox-x/src/dosbox-x)
vga_draw.cpp:3383
                5f4639 VGA_TEXT_Xlat32_Draw_Line+0x619 (/tmp/dosbox-x/src/dosbox-x)
vga_draw.cpp:2456
                5f1ce8 VGA_DrawSingleLine+0x68 (/tmp/dosbox-x/src/dosbox-x)
vga_draw.cpp:3383

The perf was made half a year ago, the relevant lines of the vga_draw.cpp source code may have changed.

Originally posted by @grapeli in https://github.com/joncampbell123/dosbox-x/discussions/3867#discussioncomment-4564780

Torinde commented 1 year ago

@grapeli @joncampbell123 @rderooy @dodleh