Dynamic recompiler [$895] [BOUNTY]

inactive123 commented 7 years ago

See bounty link here -

https://www.bountysource.com/issues/48048939-dynamic-recompiler

Conditions:

A dynarec system for Beetle PSX, preferably written in C or else C++98. Portability to the various targets libretro supports is important in considering the programming language being used here. No C++11 or C++14 please,
A working backend for x86 32bit, x64 (x86 64-bit), ARMv7 (32bit). aarch64 (ARM 64bit) is optional and could maybe be made a separate bounty.
Should be engineered in such a way that new backend implementations for other architectures (like ARM and PowerPC) can be easily implemented. There is a separate bounty for a PowerPC dynarec, but first this one needs to be successfully completed before the PowerPC dynarec is feasible.
The dynarec needs to be PIC-compliant. This is essential since libretro cores predominantly are meant to run as dynamic libraries.
Should be significantly faster than the interpreter CPU core, and should lower Mednafen/Beetle PSX’s CPU system requirements considerably.

Some things to look at if you might need some inspiration -

https://github.com/daeken/beetle-psx-libretro

Daeken a long time ago started work on a dynarec for Beetle PSX. It technically works, but is too slow to be usable and probably still contains many bugs. We also in hindsight do not want to rely on generic JIT engines like libjit and LLVM, believing them to be too slow for the job at hand, and that we'd rather have just a custom-written dynarec. So look at it for inspiration but don't try to mimick what was done here or make the same design decisions like opting for libjit or other generic JIT engines.

Also - once this bounty has been completed - it would be much easier to make a Wii U dynarec. A separate bounty for this exists here -

https://github.com/libretro/RetroArch/issues/4852

https://www.bountysource.com/issues/44566014-bounty-to-add-dynarec-core-for-wii-u-port-of-beetle-mednafen-psx

rtissera commented 7 years ago

Is using PPSSPP code still considered the best approach as discussed here ? https://github.com/libretro/beetle-psx-libretro/issues/182

Or starting something from scratch ? Written a custom dynamic is quite a beast, even if you don't rely on libjit / LLVM, I think using a self-contained IR + JIT existing engine as an efficient approach. sljit looks like some small enough, self-contained, C based, BSD-licensed, good target for this task.

inactive123 commented 7 years ago

The PPSSPP proposal was just an idea. In retrospect it might not be the best idea, and we might prefer something from scratch. I think the issue with just repurposing PPSSPP's code is the C++11 codebase and the fact that it might just be overkill for such a basic CPU as the PSX's R3000A compared to PSP.

The only real requirements I have are described above; it needs to be either C or C++98, it should be possible to easily add new backends later on (for instance, the aforementioned Wii U PowerPC backend, which is a separate bounty that anybody can fetch after this one is done) and it has to be a significant performance improvement over the interpreter. The high system requirements of Mednafen/Beetle PSX so far is still what limits the reach of this emulator and userbase compared to something like ePSXe and PCSX-R, and it also prevents the Vulkan renderer from really shining on Android since the current interpreter core is too slow even on a Shield ATV.

I am fine with somebody using a generic JIT engine provided it is fast, portable and it doesn't make us run into packaging issues. For instance, TinyTiger made an LLVM RSP dynarec some time ago for Parallel N64, but it has been exceedingly difficult to package libllvm into the core, not to mention that LLVM breaks their own APIs every once in a while, so I look back on that as kind of a code experiment failure that I'd wish to avoid in the future.

If sljit is a way better offering over LLVM and libjit for this task (both approaches which I think have kinda not really worked based on past experiences), then I'm OK with a bounty hunter giving it a shot.

hrydgard commented 7 years ago

Oh just go right ahead and grab PPSSPP's, as long as you're at least GPL 2.0+. It's not overkill, in fact it's fairly lean for what it is. It's C++11 but could very easily be backported to C++98 if desired, not much needs to be changed. Just delete the FPU and VFPU parts, delete a few more instruction implementations that will no longer be necessary, and adapt its interfaces to your code base.

inactive123 commented 7 years ago

Well, I still think a custom approach might be best suited for this particular project. C++11 is certainly out of the question and the conditions state it should be either C or C++98. C++98 is already a compromise on my end since we would rather prefer C instead.

Also, very important - the dynarec needs to be PIC-compliant. I dont know if PPSSPP finally supports that but it didnt last time we did a libretro port of it, and that is a huge issue since libretro cores practically demand that. ToadKing had to make various patches for that which have long since gone out of date and which didnt cover absolutely everything.

hrydgard commented 7 years ago

@twinaphex It's now optionally PIC-compliant enough for most purposes on x86-64. Sticking to C is an ideological choice, I'm not gonna debate that except noting that it would also be technically very easy to port to C if desired, though a lot of mechanical text-editing work for little benefit, given that it's super easy to link C++ with C.

inactive123 commented 7 years ago

Well, if the PPSSPP dynarec can be made to be fully PIC compliant for all architectures, and if a C++98 port is not too much work, it could be considered an option and a bounty hunter could go for it. But full PIC compliance for all architectures here is an absolute requirement, it shouldnt be left at only x64.

hrydgard commented 7 years ago

Depends on the definition - if by PIC-compliant you mean that it doesn't need tricks to put various blocks of memory close to each other, which is the usual problem in emulators, then yeah, it was already complaint on the other architectures even before the recent changes to x86-64.

inactive123 commented 7 years ago

This is the problem at hand -

http://eli.thegreenplace.net/2011/11/03/position-independent-code-pic-in-shared-libraries/

Libretro cores typically are shared libraries that get loaded into a frontend by way of a dynamic linker. The code needs to be completely position independent for this purpose. Dynarecs are often a huge problem from this perspective if this hasnt been specifically catered to. I am thinking of the Ari64 dynarecs in Mupen64plus and Yabause, and also PCSX2's dynarec. Dolphin up until a few months ago also was not PIC-compliant.

PPSSPP back when we ported it also did not work reliably in libretro even after some of ToadKings PIC patches and there were still many writerest errors that could be spammed at random times with a messagebox. This was years ago however. If its fixed for x86 x64 with a specific optional define, that is a good first step. However, we need to guarantee that it will work reliably on other architectures as well.

hrydgard commented 7 years ago

Ah, right. Yes PPSSPP's dynarecs should no longer have any trouble with any of that.

inactive123 commented 7 years ago

That's good to hear.

inactive123 commented 7 years ago

If so, sure, if one feels so inclined, they can give it a shot. I will leave this decision up to whomever wants to take a stab at the bounty as to which approach they want to go for.

zherczeg commented 7 years ago

(if you have questions about sljit, let me know)

BinaryFissionGames commented 7 years ago

Alright, I've been looking into this for the past few weeks and I'm at a point where I feel like I understand this enough to be able to do this.

I'm going to try to adapt PPSSPP's, because it's just so close to being what we need.

ajshell1 commented 6 years ago

Good luck you glorious bastard. Even if you fail, you'll be remembered.

meepingsnesroms commented 6 years ago

@Denu8thell I looked at your fork and it seems to be going very well (already compiling for windows x86_32), is there any chance that armv7 android will be working in the next week?

By working I mean run-able not perfect.

BinaryFissionGames commented 6 years ago

@meepingsnesroms Only the changes that needed to be made to memory stuff compile for Win32 right now, not the JIT engine itself. It's REALLY time consuming and tedious work, because I have to change all the logging, the way memory works, the way timing works, and past that a lot of the various bits specific to PSX aren't implemented (For instance, the CP0 & GTE commands, which are just interpreted at this point).

Past that, I'm trying to get it to work for x86 first, because that's the machine I have access to right now.

Plus, school is starting for me Tuesday, so I'm gonna have less time to look at this.

I think a week is VERY ambitious, but I'm trying to go as fast as I can.

wiired24 commented 6 years ago

Best of luck @Denu8thell We're all rooting for ya

BinaryFissionGames commented 6 years ago

Hey everyone, It's been a while since my last update, so I just wanted to post what's going on:

Right now, I think I'm pretty close to getting it to work for x86 machines. I'm having some trouble with the way the memory model is working (The compiler is trying to compile instructions that are out of bounds, I'm pretty sure.). I hope to get it at least running before the end of this week, but we'll see.

rz5 commented 6 years ago

@Denu8thell, your latest commit mentions that your work can be compiled on win64. Is that your preferred OS for development?

If so, and if you are dependent on e.g. the MSVC project files, do say so, maybe some libretro guys can update those files to ease your task.

inactive123 commented 6 years ago

Thanks for the feedback @Denu8thell. Looking quite forward to seeing this running. You can visit us at #retroarch and either me (Twinaphex) or r5 would be happy to have a dialogue with you and perhaps lend a helping hand where necessary.

inactive123 commented 6 years ago

@Denu8thell I have been testing the branch on your fork.

So far, these are my findings:

1 - I get around 800 to 900fps on this device (x86_64 Windows). I compiled it with mingw and set HAVE_HW=1 and HAVE_JIT=1. 2 - However, I get no BIOS splash screen, and it doesn't seem to resume ingame. When I enable logging i do see it logging the timestamps though.

Have you been able to get this to work with any game so far? If so, I'd like to reproduce it. I'm also trying to see if I can get other people like @simias to maybe chime in and help you along with the timing if possible.

rz5 commented 6 years ago

@Denu8thell - Just in case you haven't seen it yet: https://pcsxr.codeplex.com/SourceControl/latest#pcsxr/libpcsxcore/ix86_64/iR3000A-64.c

pcsxr also has a dynarec, I believe that's one of the files related to it, might be good reference material.

meepingsnesroms commented 6 years ago

It is good as a reference but should not be copied, the pcsx-rearmed dynarec has issues with 3d games and that is the main reason getting a beetle psx dynarec is important.

I know this is not the same dynarec since it is for x86-64 but it may have that issue as well.

diego-rbb-93 commented 6 years ago

Any news on this?

inactive123 commented 6 years ago

The bounty has been increased to $380.

Nevertheless, I think it's fair to conclude that development on a proposed solution has grinded to a halt? Maybe if the current approach is not working out, try another one instead of trying to get PPSSPP's dynarecs to work? Or maybe we could reduce the scope of this to make it more manageable?

piratesephiroth commented 6 years ago

I thought wiisxr's PPC source would finally get this going

simias commented 6 years ago

What follows are a few musing that may or may not be of interest for people working on PS1 dynarecs:

I've looked a bit into dynarecs lately (out of curiosity mostly) and I'm trying to figure out what would be the best way to map the PS1 MIPS CPU onto a modern PC. A potential issue is that the MIPS CPU has a ton of registers (32 in total, although R0 is hardwired to 0) which is significantly more than a PC. Admittedly amd64 CPUs have had the good idea of adding a bunch of new registers but we still have only 16 GP registers to play with. Technically they're 64bits so you could cram all 32bit MIPS registers in there but I'm not sure if that's a good idea. And it definitely won't be possible on ARM or 32bit x86 CPUs.

I've added counters in mednafen's source code to get some stats for CPU usage, here's what it looks like after playing for a minute or so:

Final Fantasy VIII
Medievil
Rollcage
Vagrant Story

Register usage seems relatively similar across these 3 games, the only obvious difference is RA being used a lot more in FFVIII than the others (and SP less).

R0, V0, and SP stand out clearly, although we can ignore R0 since it's not a "real" register but it means that it's probably worthwhile to allocate a real host register to V0, V1, A0, A1 and SP (and more if the host has enough registers left). Maybe RA as well if FFVIII is not a rare exception.

simias commented 6 years ago

So I've been looking into this the past few days and I've come to the conclusion that using a non-emulator targeted IR optimizer (something like LLVM or the like) might not be as practical or beneficial as I had hoped. The main problem is that if you don't want to risk miss-translating you have to be super paranoid with the code you're recompiling. In particular it's very difficult to know what could be a jump target and therefore shouldn't be "factored" with the rest.

Consider this rather common PSX machine code to load a 32bit literal value into a register (these are the very first two instructions executed by the BIOS on reset):

lui     t0,0x13
ori     t0,t0,0x243f

This puts 0x13243f into register t0. Every time a program needs to load a literal bigger than 16bits into a register you'll encounter code like this one. If you look at this and think about recompiling it for x86 it might be very tempting to factor it into a single instruction:

mov    $0x13243f,%some_reg

Pretty straightforward. Except that if later on you realize that there's a branch targeting the ORI then this optimization produces invalid code. If you can't be certain that there's no branch aiming at this ORI you simply can't do this very basic optimization. Same issue with any kind of operation reordering, merging, register renaming etc...

I also think that if you want a dynarec that perfoms well these might not be the optimizations you should prioritize. Consider for instance a basic SW instruction (this is instruction number 4 in the BIOS):

sw      t0,0x1010(at)

I've spent a while attempting to recompile this. I'm not done yet (I don't handle timings or exceptions) but here's what my generated assembly looks like:

; Load address + offset into %ecx
67 41 8d 88 10 10 00    lea    0x1010(%r8d),%ecx
00 
89 c8                   mov    %ecx,%eax

; Check for 4-byte alignment 
83 e0 03                and    $0x3,%eax
74 01                   je     aligned
cc                      int3   ; TODO: handle exception

aligned:
; Mask region bits
89 c8                   mov    %ecx,%eax
c1 e8 1d                shr    $0x1d,%eax
23 4c 85 08             and    0x8(%rbp,%rax,4),%ecx

; Test if the address points at RAM (mirrored 4 times)
89 c8                   mov    %ecx,%eax
81 f9 00 00 80 00       cmp    $0x800000,%ecx
73 23                   jae    not_ram

; We're in RAM, mask the address if we're in a mirror
81 e1 ff ff 1f 00       and    $0x1fffff,%ecx

; Invalidate dynarec page at address
c1 e8 0b                shr    $0xb,%eax
48 69 c0 18 08 00 00    imul   $0x818,%rax,%rax
c7 84 05 c0 00 00 00    movl   $0x0,0xc0(%rbp,%rax,1)
00 00 00 00

; Compute address in RAM buffer and store the value
03 4d 28                add    0x28(%rbp),%ecx
44 89 29                mov    %r13d,(%rcx)
eb 2e                   jmp    done

not_ram:
; We're not in RAM, see if we're in the scratchpad
2d 00 00 80 1f          sub    $0x1f800000,%eax
3d 00 04 00 00          cmp    $0x400,%eax
73 08                   jae    other

; Yes, compute address in scratchpad buffer and store the value
03 45 30                add    0x30(%rbp),%eax
44 89 28                mov    %r13d,(%rax)
eb 1a                   jmp    done

other:
; We're neither in RAM nor scratchpad, call the emulator code
44 89 ef                mov    %r13d,%edi
89 ce                   mov    %ecx,%esi
41 50                   push   %r8
41 51                   push   %r9
41 52                   push   %r10
41 53                   push   %r11
e8 00 00 00 00          callq  0x7f
41 5b                   pop    %r11
41 5a                   pop    %r10
41 59                   pop    %r9
41 58                   pop    %r8
done:

This will probably grow by a few instructions by the time it handles all the cases. That's 37 instructions, and that's for every single memory access (admittedly memory loads would be a couple of instructions shorter since they can't cause invalidation). Suddenly the need to merge the LUI/ORI opcodes above doesn't seem to matter nearly as much.

What this code does is figure out where the SW is aiming. First we need to mask the address to remove the "region bits", then:

If it's in the scratchpad we can directly compute the offset and move the value there.
If it's in RAM we can do the same thing but if we want to do things right we also need to make sure to invalidate any dynarec'd code at this address since it's outdated.
If it's elsewhere (device registers etc...) then we need to jump to the emulator's code to handle that.

That's a huge overhead, and you'll have that all over the place.

So my point is that unlike what I assumed getting into this, recompiling machine code is significantly different from JIT'ing some scripting language's bytecode. Byte-code is higher-level, it's meant to be interpreted/compiled. Machine code not so much.

simias commented 6 years ago

A little note that might be helpful to those considering porting a dynarec to mednafen: you need to be sure to handle the "cache isolation mode" even if you don't support the instruction cache: https://github.com/libretro/beetle-psx-libretro/blob/ad23401ab68d1594b941572bd4e89528bb12a6fa/mednafen/psx/cpu.cpp#L329

In this mode writes to memory don't actually make it through and are instead doing cache maintenance. If you don't implement caching you can of course ignore that but you have to make sure that the writes don't make it through to the memory otherwise you'll overwrite a bunch of stuff. In particular when the BIOS invalidates the cache it'll write a bunch of zeroes (IIRC) starting at address 0 and up to 1024 or 4096 (I'm not so sure anymore). If these go to RAM they'll overwrite critical bits of the BIOS, including the function tables so nothing will work.

A simple and cheap way of handling this would be to swap the RAM buffer for a "dummy" one in the dynarec when this bit is set, then swap back the real RAM buffer when it's clear.

inactive123 commented 6 years ago

Thanks a lot to echoscreen who chipped in an additional $100, bringing the total up to $480 now.

iOS4all commented 6 years ago

Please support Nintendo switch as well Thanks

inactive123 commented 6 years ago

@iOS4all That falls outside of the scope of this particular bounty. So please don't request it again.

If you want this, it's a separate bounty. Aarch64 support is not included as part of this bounty, it's already an awful lot that has to be covered in this specific bounty. I don't want to emburden @simias with any more.

Ryunam commented 6 years ago

I just contributed with $20 more, bringing the total amount to $500. Hoping for some interesting developments in the near future!

phire commented 6 years ago

Pretty straightforward. Except that if later on you realize that there's a branch targeting the ORI then this optimization produces invalid code. If you can't be certain that there's no branch aiming at this ORI you simply can't do this very basic optimization.

Transforming code in such a way that you can jump into it at any emulated instruction boundary is a fool's errand.
You wipe out almost every single optimisation opportunity and end up with something that is a little better than pasting interpreter instruction implementations together.

Instead, recompile blocks of instructions, starting at an entry point, until an exit point (probably a branch instruction). Apply as many optimisations to that block as you want, reorder the instructions, merge the instructions, do constant propagation, register mapping...

If you ever encounter the case where a branch instruction branches into halfway through your compiled and optimised block, then generate a new block starting at that halfway point, continuing up to that same endpoint.

Overlapping blocks are the key to good dynamic recompilation performance.

simias commented 6 years ago

@phire interesting, I dismissed overlapping blocks for the moment because they complicated the design (how do you deal with invalidation efficiently with overlapping blocks?) but I'll certainly be able to change that later if indeed it turns out that performance is not good enough. An other possibility would be to still have non-overlapping blocks but if you realize later on that you're jumping into the middle of a folded instruction (or something else that doesn't make the code jump-safe such as constant propagation) then you blacklist the optimization and recompile the block without it. That seemed like a more practical approach to me but it's not like I'm an expert.

At any rate I'm hopping that the benefits in terms of code threading, branch prediction and instruction decoding will be enough to make a significant performance difference as it is without these additional optimizations. At least it will give us a base we could improve iteratively later on even if performance is sub-par.

My current concern is less about making a fast dynarec and more about having a dynarec that works. Once all the interfaces with the rest of the emulator are working we'll see if we have to rewrite all the guts of the recompiler. In particular it'll make the problem more approachable by other contributors since it doesn't require as much in-depth knowledge of how the PSX works and how mednafen emulates it.

hrydgard commented 6 years ago

Both Dolphin and PPSSPP use overlapping blocks in their JITs, it's fine. Not having to care about handlign jumping into the middle of a block simplifies things a lot. Invalidation is also not an issue, why would it be?

simias commented 6 years ago

Suppose the code write at address 0xf000ba0 in RAM, if you want to be strict you have to take care to invalidate any recompiled blocks which contain this address. With non-overlapping "linear" blocks it's trivial, I just clear a bit in a table using the shifted address as offset. With overlapping blocks even with an optimized structure to look up the relevant blocks the overhead seems large if you do it for every RAM write.

The only other way to handle that I can think about would be not to invalidate on RAM write and instead try to be a little more clever and assume for instance that anything that actually loads code will then proceed to flush the instruction cache which could be trivially detected in the emulator and trigger a flush of the recompiled code. That being said it won't catch all cases that way. I considered having an option to do that as a "speed hack" but it seemed too rash to be the default.

As you can tell so far I've been very defensive and avoided any optimization that could potentially cause accuracy issues with the major exception of the PSX instruction cache itself which would be a pain to emulate accurately.

simias commented 6 years ago

I guess an other way would be to have a linear array which keeps track of which addresses have been recompiled as a first check before actually jumping to the expensive invalidation function since in a normal use case I expect that the overwhelming majority of writes do not actually modify code.

hrydgard commented 6 years ago

When emulating reasonably modern-ish systems you can usually mostly count on games to issue cache-invalidation instructions, Dolphin takes that approach. PPSSPP does the same but also overwrites the first instruction of each block with a special instruction containing the block number, which allows a few tricks - this is used to look up blocks, and if games overwrite with an instruction, the block is effectively no longer reachable. This is not exactly a very purist approach though :) And can have problems invalidating properly when you start linking up blocks.

meepingsnesroms commented 6 years ago

There is some api on linux that is effectively "trigger interrupt on write to address range" which could be used, I forgot the function name though.

hrydgard commented 6 years ago

Well you can always mprotect and install an exception handler, but it's only on a page level, which isn't always granular enough.

meepingsnesroms commented 6 years ago

The ps1 only has 2mb of program ram so any extra wasted memory to force a page boundary should be fine, usually there 4KB right?

simias commented 6 years ago

@hrydgard Yeah, that seems reasonable. I think the reason I'm sheepish about it is that in many aspects PSX is kind of the threshold between "modern" and "retro" architecture-wise, if that makes sense. It wouldn't surprise me at all if some PSX games out there played dirty with self-modifying code, or simply forgot to flush the I-cache in some conditions (it only contains 1024 instructions after all, a single long-ish function is enough to effectively "flush" the cache).

@meepingsnesroms I'm not sure it would help for invalidation since with non-continuous blocks you'd still have to look up the addresses of the block (in emulator virtual memory) and then tell the kernel to trigger a fault on access. It doesn't sound super fast to me.

The ps1 only has 2mb of program ram so any extra wasted memory to force a page boundary should be fine, usually there 4K right?

I currently mmap a single huge buffer on startup to contain the entire recompiled code on start (close to 1GB currently I think) so I certainly hope so! I actually think I'll have to rework that for 32bit architectures, I think I might be a bit greedy VIRT-wise.

senquack commented 6 years ago

@simias The signal-handler + mprotect method suggested by @hrydgard is a very effective way to improve performance of recompiled PS1 code. I know because I've recently implemented it in my work on a PS1->MIPS32 dynarec. Without it, you are forced to either emit slow function calls for loads/stores, or emit address range-checks and emit both inline and C calls. This bloats your code tremendously and all the emitted branches add additional BTB pressure.

The trick is having a 4KB host page size. You virtually map PS1 address space. We use Posix or tmpfs shared mem fd to mirror PS1 RAM areas 4X to the start. Then, only Expansion-ROM, Scratchpad, and ROM are mapped to the appropriate offsets. On PS1 hardware, Scratchpad starts at 0x1f80_0000, and HW I/O ports conveniently begin exactly 4KB past that at 0x1f80_1000. You leave the HW I/O port region and the cache control port at 0xfffe_0130 unmapped, and rely on your SIGSEGV signal handler to catch accesses to them. The sig handler will look at the offending load/store instruction in your emitted code, and emulate it, typically by calling the appropriate HW I/O port C function. If it was a load, it will patch up the correct host reg in the mcontext_t struct in the ucontext. Before returning, it adjusts the mcontext_t's PC to skip the offending instruction, and flags the block as accessing HW I/O ports. When it is next recompiled, it will emit more expensive address range checks or call functions for loads/stores. It won't trigger the sig handler anymore, which is good because that is quite slow.

By default, loads/stores in emitted code become much more streamlined. Only 3-4% of blocks total ended up being flagged in our dynarec. Total code size was reduced an average of 23%.

You can achieve even more by using the SIGSEGV handler to catch code modification. That involves more work, which I'd like to describe later or just show you some commits soon rather than blather on. This is where mprotect comes in.

Regarding Icache issues: I have quite a bit of experience on this issue now, and can confirm that, indeed, there are games that "forget" to do an Icache flush before executing freshly-loaded code. In fact, it is for this reason that being overly-vigilant in your detection of code modification can backfire instead of achieving anything useful at all. I'd like to speak more about this later, I have taken too much time for right now though.

hrydgard commented 6 years ago

@senquack here's your original comment that got deleted: https://gist.github.com/hrydgard/628fcf193ce7fd72b44b835b733ae119 :)

senquack commented 6 years ago

@hrydgard, thanks you rock!

@simias as a final suggestion: Don't make your blocks super-long, don't do crazy inter-linking of blocks, just make them simple and overlapped. We don't even yet support branching from inside a block to a target in the same block.. a new block is always started. Still, the performance is adequate and the resulting simple code invalidation methods have served us very well. Basically just invalidating only a block starting at a given address targeted by a store in code. There's a lot more to it, but again I took too much time/space here already.

BTW, your idea of backing up a portion of PS1 RAM when Icache is isolated is a good idea, in fact I'd implemented it already and the method works quite well. An alternative is to do it via mmap trickery, once you have sig-handler magic working. Regardless, the big benefit is to avoid having to emit cache isolation status-check-and-branch before each emitted store.

simias commented 6 years ago

Oh yeah, mprotect for handling memory access without explicit check is clever, I currently do have a bunch of ifs for every access. I had a few ideas to improve that later including trivial ones like disabling those checks for the SP register (assuming that it's indeed only used as a stack pointer which I hope is not too wild an assumption given the vast number of available GPR on MIPS). An other idea I had was to add profiling code to the 1st pass of the recompilation which would track where the loads/store actually end up going (the vast majority probably targeting RAM or scratchpad). Then once enough data is collected to recompile the code, only keeping the relevant target. At any rate my hope was that the branch predictor would be able to work its magic through these many branches, although obviously the large number of load/store operations in typical PSX code will put it through a lot of stress...

Your solution is definitively more elegant although I suppose it might hurt portability. I like that currently the only platform-specific thing about my code is the mmap call to get executable memory.

I have quite a bit of experience on this issue now, and can confirm that, indeed, there are games that "forget" to do an Icache flush before executing freshly-loaded code. In fact, it is for this reason that being overly-vigilant in your detection of code modification can backfire instead of achieving anything useful at all. I'd like to speak more about this later, I have taken too much time for right now though.

Glad to hear that my paranoia wasn't baseless but I hope you'll be able to explain what you mean by "backfire". Surely if you encounter self-modifying code without flush you want to be overly-vigilant? What's the alternative?

I definitely like the idea of non-overlapping blocks since it means more aggressive optimization but I think I'll keep with my current approach until I get to the point where I can actually run some games, it'll be easier to incrementally improve the code if I start from something that already works.

Thank you for your feedback, it's greatly appreciated.

Oh by the way if any of you have some feedback about how to best deal with timing code I'm all ears. IMO it's the clunkiest part of my code so far, I decrement a cycle counter before every instruction and eventually I'm going to add a cmp + ret to bail out if it reaches 0 to let the event handler run. I could streamline things by not doing it for every instruction and instead "buffering" it (for instance only checking on branches) at the cost of even worse timing accuracy. I wonder if there's a better way to handle this.

hrydgard commented 6 years ago

For the timing countdowns, I do the following (look at PPSSPP if you like):

at the end of each block, decrement global counter (dedicated reg or memory, your choice) by the number of cycles the block took, (sub ctr, 7 for example), right before jumping back to the dispatcher
the dispatcher checks flags, if it dropped below 0 you go process the upcoming event, reset the counter to the number of cycles to the next event, and jumps to execute the next block.

This will then also work when you start linking blocks, if you get that far - blocks should then have a tiny prologue that do the check, and that's where you'll jump to from other blocks after decreasing the counter.

Of course, I'm sure there are many other sensible ways to do it.

senquack commented 6 years ago

Oh yeah, mprotect for handling memory access without explicit check is clever, I currently do have a bunch of ifs for every access. I had a few ideas to improve that later including trivial ones like disabling those checks for the SP register (assuming that it's indeed only used as a stack pointer which I hope is not too wild an assumption given the vast number of available GPR on MIPS).

Yes, based on my experience, you can skip emitting an address range-check when base reg is $zero, $k0, $k1, $sp, or $gp, and simply assume these will always go to either RAM or scratchpad. The latter ($gp) is the riskiest, but I've never found a problem. Furthermore, you can skip code invalidation for stores for these base regs, excluding $zero (can be used to access first 32KB kernel RAM).

Glad to hear that my paranoia wasn't baseless but I hope you'll be able to explain what you mean by "backfire". Surely if you encounter self-modifying code without flush you want to be overly-vigilant? What's the alternative?

The vast majority of invalidation needed should be coming from DMA writing to RAM, which is very easy to catch and be more thorough about (if you feel the need... it's not actually necessary to be thorough., trust me ) Almost all of the rest needed can be caught by flushing recompiled blocks lying in RAM when cache control port isolation status changes. After all this, there's still a small number of games that do need individual block invalidation when a store occurs in their code. Typically, the actual PS1 code I've found doing the needed invalidation ends up being a compiler-generated memcpy, probably part of a code-overlay loading routine. But, this isn't always the case, so you can't just scan for and detect a memcpy sequence and emit code invalidations just for it.

Games that I know need this fine-grained code invalidation with stores (I'm sure there's more):

'G-Darius' (starting game from main menu)
'Elder Gate' (after beginning new game)
'Legends of Dragoon' (transitions into battle)
'Star Ocean 2' (after intro FMV)
'Valkyrie Profile' (early in game after talking to Freya and Odin)

So, without using a sig-handler + mmap/mprotect to assist you in detecting when a block stores to a page of PS1 RAM holding code, you are stuck with having to invalidate blocks for each and every store. Just because a few games drag you down. Only 1-2% of blocks total store to a page of RAM containing code, but without the sig-handler, you can't detect this. Block ptr invalidation can be expensive because for non-word-sized stores you must force word-alignment of NULL-store, etc. It's so much cheaper to let your sig-handler dictate when you must emit this crap. NOTE: some PS1 code is shared on the same page as data, which is unusual as compared to PC. So, that further complicates things and there are some programs, not a few, that do this.

There are even some that actively "modify" code per-frame etc, only to have the code always be the same by the time it's next executed.. I'm looking at you, "Looney Tunes Sheep Raider". So, you can see the topic goes on and on and on.. Here, overly-vigilant code-modification-detection can not only hurt performance, but also cause glitched rendering! BTW, it never does an Icache flush between any of these invalidations, it simply doesn't give a f*ck. Strange strange game.

Even worse, there's some games that depend on you not invalidating code, because they failed to (or intentionally didn't) do an Icache flush before executing at addresses within freshly-loaded code. For these, I believe PCSXR interpreter uses Icache-emulation, which is slow and perhaps even infeasible for recompiled code. Instead, we use a workaround for games known to need this. It involves never emitting any code invalidations for stores in recompiled code, flushing all RAM code when a DMA3 CDROM sector begins with "PS-X EXE" header, and, as usual, flushing all RAM code when PS1 Icache isolation sequence occurs. Games that need this hack:

'Formula One Arcade' (crash on load)
'Formula One 99' (crash on load) (NOTE: PAL version requires .SBI file)
'Formula One 2001' (in-game controls, AI broken.. even the PS3 supposedly has trouble emulating this.) NOTE: Formula One 2000 still broken, possibly CDROM emulation quirk in PCSXR code we use and not icache-related.

As far as I know, the sole reason PCSXR interpreter has ICache emulation, is to workaround these games. Our dynarec gets around it just as effectively and at a much lesser performance cost!

We've found that, instead of using complex inter-linked blocks, making blocks 'overlapping' allows for much simpler code-modification handling. Simply invalidating a block that begins at an address stored to in PS1 code is enough with all the aforementioned approaches (especially flushing all RAM code upon Icache status change). An improvement I am about to attempt is, instead of flushing all RAM code on Icache status change, instead allow two different versions of any given block to exist simultaneously. When the validity of a block is in question, a checksum of the PS1 code associated with it is made using a fast high-quality hash function (xxHash). Then, if it is later invalidated, the checksum can be compared to curent PS1 RAM, and it can be decided what to do. This will especially assist things like RPG battle transitions, where frequently a partial code overlay is swapped in or out of RAM on the fly, together with a Icache flush (when you're lucky hehe!). This way, the battle code and the world code can exist in harmony without having to be re-recompiled constantly.

For timing question: We do it how @hrydgard does it. Mednafen is focused very much on accuracy, but for a dynarec, users are interested in performance. We have an adjustable setting that advances the PSX clock 'X' number of cycles per emulated instruction: before a block returns, it advances it 'X' amount, based on the number of instructions executed at that point from the start of the block. The default is 2 cycles and it seems to work nicely. This is I think where the advantage of having blocks that can only be jumped to at their start shows: it really simplifies counter handling.

BTW, we don't advance counter differently for GTE or div/divu instructions etc, every instruction is just "averaged out" to '2 cycles'. 3 (or anything between 2..3) can be used for a boost in performance, but at a cost of compatibility and timing-accuracy. I want to investigate these matters further, but for now my focus is on improving performance for embedded devices.

trapexit commented 6 years ago

Forgive me if this has been discussed elsewhere but I was curious if anyone had considered some of the tips & tricks Darek Mihocka of emulators.com discussed over the years? His work on JIT vs interpreted VMs? Using JIT w/ some of the tricks above could make it difficult to port to other platforms.

Regardless... interesting read:

libretro / beetle-psx-libretro

Dynamic recompiler [$895] [BOUNTY] #214