jpd002 / Play-

Play! - PlayStation2 Emulator
http://purei.org
Other
2.04k stars 248 forks source link

EE Speed thoughts #775

Open bigianb opened 4 years ago

bigianb commented 4 years ago

I've been thinking about why the Play EE performance is so poor compared with pcsx2 (about 4 times slower from simple tests). Its annoying because they do very similar things and so one would reasonably expect similar performance. There are 2 things that pcsx2 does which are different than play:

  1. It uses system protected pages to detect if a recompiled block has been modified and thus needs recompiling. This saves a check on each memory write ... but I'm not sure it would be a big win as we need to do a TLB check anyway.
  2. The recompiled block logic is different ... and this could be a win. As I understand it from the code, pcsx2 assigns a recompiled block to possibly every address (well, word address given that instructions are word aligned). This I think will result in longer recompiled blocks than Play which is a win. This is because the block will end the first time we hit a branch instruction whilst Play's will end whenever we get to a branch instruction or a branch target. Effectively pcsx2 allows overlapped recompiled blocks whilst Play always bisects them.

Any other ideas? I think I need to write an elf which contains various EE typical code (no GS or anything) and then specifically see what FPS both Play and pcsx2 can achieve.

jpd002 commented 4 years ago

Hi!

We already do 1 and 2. Number 2 is recent, it was a part of the whole block linking improvements. Right now, the emulator will create blocks that start from any address up til a branch.

Other theories/ideas:

bigianb commented 4 years ago

Ah very cool ... it’s been a couple of months since I looked. Did those changes make an appreciable difference?

jpd002 commented 4 years ago

It helped a bit, maybe in the 10~15% range.

How did you do your tests to compare performance against PCSX2? We could probably try to identify specific test cases that could help us focus on the right things.

bigianb commented 4 years ago

well ... not very scientifically. Just the games that run at 15fps on my laptop in play run at 60fps in pcsx2. What I need to do is write some specific scenarios in an elf and run that on both systems. Also run it across various Play! versions to quantify what effects some changes have.

rcaridade145 commented 4 years ago

@jpd002 @bigianb

The logic used by PCSX2 can have flaws

There was a discussion around the EE on PCSX2 a while back https://github.com/PCSX2/pcsx2/issues/1110 .

I'm more familiar with dreamcast emulation so the same logic may not apply. Considering its arch emulators tend to use

` // BIOS

private static final int HACK_BASE  =   0x8C000100;
private static final int HACK_ROMFONT=  0x000;
private static final int HACK_GDROM =   0x100;
private static final int HACK_SYSINFO=  0x200;
private static final int HACK_FLASHROM= 0x300;
private static final int HACK_UNKNOWN=  0x400;

private static final int SYSCALL_SYSINFO    =   Memory.getMemoryAddress(0x8C0000B0);
private static final int SYSCALL_ROMFONT    =   Memory.getMemoryAddress(0x8C0000B4);
private static final int SYSCALL_FLASHROM = Memory.getMemoryAddress(0x8C0000B8);
private static final int SYSCALL_GDROM  =   Memory.getMemoryAddress(0x8C0000BC);
private static final int SYSCALL_UNKNOWN=   Memory.getMemoryAddress(0x8C0000E0);

wvalor = 0x000B; / RTS / memViewWord.put(_word_index(getMemoryAddress(HACK_BASE + HACK_GDROM)),wvalor); wvalor = (short)0xFFFF; / BIOS_HACK/ memViewWord.put(_word_index(getMemoryAddress(HACK_BASE + HACK_GDROM + 2)), wvalor); `

(1) http://www.shared-ptr.com/sh_insns.html

unknownbrackets commented 4 years ago

One thing that helped improve the generated code quality in PPSSPP was an interface to compare blocks: given a start address, there's some UI that shows the basic block side by side, MIPS on the left, disassembled native code on the right. Hopping through random blocks, we found some that had obviously bad patterns this way.

It also helped to export the game's functions to a sampling profiler (see hrydgard/ppsspp#4692.) Seeing both game and emulator functions together helps give a sense of what would really help. Might also help identity idle loops here.

-[Unknown]

bigianb commented 4 years ago

Say I wanted to write a PS2 elf that would perform a series of tests and then write out the time taken for each one (via console or saved to a memory card), how would I perform the timing? I don't care about running it on real hardware ... I'm thinking about something that could run on each build and chart the performance effects of changes given a constant target hardware. If I use a performance counter / timer I think I will get the virtualised clock (so it will always give me the same timings). What I want is to grab how much actual wall clock time has elapsed on the host. Is there a way to do this other than to add a special register only present in Play to return it?

jpd002 commented 4 years ago

Might be a long shot, but maybe reading the clock through CDVDMAN (CdReadClock) would work?

In Play! this is implemented as reading the host's clock, but I don't know about PCSX2 (if that's what we want to compare it against). Resolution is pretty bad though, so, it's probably not very good unless we do a huge number of iterations.

I like @unknownbrackets ideas. @bigianb didn't you already do something to sample the time spent executing game functions? I vaguely remember you showing me a CSV with timing info a while back. Don't know if PCSX2 has something similar, but it might also be something interesting to use for comparison.

Zer0xFF commented 4 years ago

I don't care about running it on real hardware

if you don't care about that, then wouldn't introducing a new call, with a better resolution be the best option?

bigianb commented 4 years ago

I don't care about running it on real hardware

if you don't care about that, then wouldn't introducing a new call, with a better resolution be the best option?

That's what I meant by adding a new register to read. If something already exists though that would be better because I don't really like adding special hacks if not necessary. Also, being able to run on pcsx2 as a comparison would be a bonus. I'll take a look at CdReadClock. As long as it has sub-second accuracy then it should be fine .... I'm happy with a test running for 10 or 20 seconds in any case to reduce noise from host machine loading. I'd be happy for it to run for minutes even, it's not like its something that is run that often.

uyjulian commented 4 years ago

There might be other ways to do timing:

  1. Touch a file on the host filesystem and look at the creation date
  2. Use an external program to time when a string comes from stdout
bigianb commented 4 years ago

So the CD call is accurate to a second. That means running for 1 minute will have an intrinsic jitter of about 2%. If we're looking at gains of a couple of percent that means we need to be running for about 5 minutes per test to get statistically significant results. I think we need to add a bios call with millisecond accuracy ... it could just return the 32 bit epoch value.

unknownbrackets commented 4 years ago

Another thing we did was a microbenchmark tool that compiled a sequence of instructions unroll it ~100 times, and then run it both under interpreter and jit. Then output the speed of interp vs jit.

You can see it here (you have to modify the code to test a different sequence): https://github.com/hrydgard/ppsspp/blob/0b4f60272cd9c106032b5b881650a7e5b6581053/unittest/JitHarness.cpp#L108

We used this especially to test tricky register caching or complex instructions, and also to verify performance across devices. For example, we could make sure an implementation of vrot (which calculates sin/cos of a register value and calls back into C) is actually faster than the interpreter, or that an implementation of vmmul (which does 16 dot products) is actually faster unrolled in a tight loop. But also that sequences of instructions (such as many lw/ld from a base register, each with slight offsets) are properly fast.

It allowed us to quickly test different native code implementations against known blocks or sequences.

-[Unknown]

jpd002 commented 4 years ago

I've been playing around and here's examples of x86 code output for very simple EE code snippets: https://gist.github.com/jpd002/497e67bc57d43e75005b34ccb7066a32

Prolog/Epilog are taking a huge amount of instructions compared to the rest. There's certainly ways to improve.

unknownbrackets commented 4 years ago

From a brief look, it seems like:

-[Unknown]

bigianb commented 3 years ago

I've been writing some micro-tests to see where pcsx2 is quicker than Play! I haven't actually found any yet though - Play! seems to handily beat PCSX2 at everything so far. Any thoughts on where to look next? I'm thinking VU. Results are here: https://github.com/bigianb/ps2-speedtests/blob/main/results.md Code is all there too - all very simple stuff compiled with the ps2dev ps2sdk.

jpd002 commented 2 years ago

I've been researching this topic a bit deeper these past days and it seems the main thing that makes lots (like 75% from my tests) of games run slowly is that we don't have proper idle loop skipping. Lots of games report 100% EE usage when they only need a fraction of the time to do what they need to do.

Underclocking is an idea, but I don't like it so much, because it's not optimal and it can easily break things. I see it more like a tweak for users who want to test very specific scenarios, not a solution for a problem that touches a lot of games.

VU cycle stealing (from PCSX2) is another interesting idea that I would like to implement, but it's specific to 3D games. Lots of 2D games suffer for this problem.

I think better idle loop detection is the way to go, but it's tricky to get right. I'm open to ideas 😄.

There's also another problem I've finally been able to pinpoint which causes performance issues in some games that don't run at 60fps when the frame limiter is active. Before we had the frame limiter, the emulator was syncing using vsync and I kinda thought some of the slowdowns I was seeing were due to the GPU taking time to complete rendering the frame.

But it turned out to be wrong and the explanation is that if some frames take less time (less than 16ms) to execute and some other take more time, there will be sleep time added to frames that take less than 16ms which is time wasted on frames that need more time. An example of this is FFX which runs at 30fps: one frame out of 2 does a lot of work and the other one does nothing. Disabling the frame limiter will make the game run at its full potential, but if it's on, a sleep will be added every 2 frames, slowing down the emulation.

Games that run at 30fps suffer from this, others not so much, but it's subtle. Implementing a better sync method would probably be the solution (something like https://redream.io/posts/improving-audio-video-synchronization-multi-sync, thanks @literalmente-game for the link).

uyjulian commented 2 years ago

Some examples of idle loop checking that could be checked:

Interrupt flag checking: https://github.com/ps2dev/ps2sdk/blob/859b68b59fe6b7beb58665cfb3230d34f7ddb118/ee/mpeg/src/libmpeg_core_c.c#L89

IPU flag register checking: https://github.com/ps2dev/ps2sdk/blob/859b68b59fe6b7beb58665cfb3230d34f7ddb118/ee/mpeg/src/libmpeg_core_c.c#L98

RPC status checking: https://github.com/ps2dev/ps2sdk/blob/859b68b59fe6b7beb58665cfb3230d34f7ddb118/ee/rpc/cdvd/src/scmd.c#L938

unknownbrackets commented 2 years ago

Do games commonly use a function to wait, or is it often just inline in the code like that / a pattern to be detected? It's definitely easier to isolate if it's a dedicated function.

Though, it might be good to make sure that CPU cycles are consuming appropriate amounts of time - 75% seems like a lot for idle loops. Maybe something about the memory read isn't moving forward time sufficiently? In PPSSPP, we've found correcting HLE cycle counts (based on tests) to really help performance in some areas.

If it has to be an inline pattern, I'd say you'd want a dedicated analysis class for it, that looks for patterns. If it can identify a loop, you might be able to "insert" a fake "idle_until_x" instruction inside the loop during compilation. So you'd basically be rewriting:

bool done = false;
while (!done) {
    done = SifCheckStatRpc();
}

To:

bool done = false;
while (!done) {
    done = SifCheckStatRpc();
    wait_until_sif_rpc_stat();
}

To turn the spin loop into something like a condition variable. From there, it'd depend on how cooperative PS2 threads are (I forget how much they differ from PSP threads.) If other threads can't wake during the spin, that's super easy, just need (per type) to determine when it'd wake and eat that many cycles.

If threads can wake (I assume at least IRQs, so it's probably this?), presumably only better-priority threads, so would need to get both the number of cycles until next better thread OR number of cycles until the event, and eat the lower of the two.

As for the frame limiting: personally I like what we did in PPSSPP, which is simple. We set a flag if any framebuffer is rendered to. When it comes time to flip (during a vblank), we check if that flag was set for the displayed framebuffer. If it's not set, then we keep powering along and don't bother with a host flip. This works generally well on Android, for example. But we had to add an option to force flip for some people, some video cards were giving "microstutters."

Of course, PPSSPP also does all graphics API/driver interaction on a dedicated thread, which has been great for performance. Decoupling emulation from graphics API overheads should definitely be good. But beware you may want to make it configurable as it can add input lag.

-[Unknown]