Dynamic recompiler [$895] [BOUNTY]

inactive123 commented 7 years ago

See bounty link here -

https://www.bountysource.com/issues/48048939-dynamic-recompiler

Conditions:

A dynarec system for Beetle PSX, preferably written in C or else C++98. Portability to the various targets libretro supports is important in considering the programming language being used here. No C++11 or C++14 please,
A working backend for x86 32bit, x64 (x86 64-bit), ARMv7 (32bit). aarch64 (ARM 64bit) is optional and could maybe be made a separate bounty.
Should be engineered in such a way that new backend implementations for other architectures (like ARM and PowerPC) can be easily implemented. There is a separate bounty for a PowerPC dynarec, but first this one needs to be successfully completed before the PowerPC dynarec is feasible.
The dynarec needs to be PIC-compliant. This is essential since libretro cores predominantly are meant to run as dynamic libraries.
Should be significantly faster than the interpreter CPU core, and should lower Mednafen/Beetle PSX’s CPU system requirements considerably.

Some things to look at if you might need some inspiration -

https://github.com/daeken/beetle-psx-libretro

Daeken a long time ago started work on a dynarec for Beetle PSX. It technically works, but is too slow to be usable and probably still contains many bugs. We also in hindsight do not want to rely on generic JIT engines like libjit and LLVM, believing them to be too slow for the job at hand, and that we'd rather have just a custom-written dynarec. So look at it for inspiration but don't try to mimick what was done here or make the same design decisions like opting for libjit or other generic JIT engines.

Also - once this bounty has been completed - it would be much easier to make a Wii U dynarec. A separate bounty for this exists here -

https://github.com/libretro/RetroArch/issues/4852

https://www.bountysource.com/issues/44566014-bounty-to-add-dynarec-core-for-wii-u-port-of-beetle-mednafen-psx

senquack commented 6 years ago

Has anyone profiled Mednafen PS1 and determined where its cycles are being spent? If it's software-rendering, I bet most of them are being spent there. I know that's the case for us. In fact, after heavy optimization of the dynarec, the factor of cycles spent in dynarec vs software renderer can be anywhere from 1:10 to 1:20.

Furthermore, users might be disappointed when vertex-accuracy improvements that depend on tracking of values through memory no longer work. It's not enough to patch up GTE, no, you must track those values as they flow out of GTE through CPU calculations and RAM, and how can a dynarec do that efficiently?

If Mednafen is software-rendering, a fruitful speed improvement for multi-core devices, if it hasn't been done already (I don't keep track) is to software-render in a separate thread. Do the CPU/poly-setup code in one thread, do scanline rasterization in another. The GPU/VRAM of the PS1 should be sufficiently separated from the CPU to make this easy/worthwhile.

A dynarec will surely improve performance, but only so much and only if it's done with performance instead of "accuracy" in mind. Otherwise, you're just wasting time that'd be better spent elsewhere IMO. I put accuracy in quotation marks because cycle-accuracy on a system with unpredictable cache and a CDROM etc is a kinda silly idea.

simias commented 6 years ago

Even worse, there's some games that depend on you not invalidating code, because they failed to (or intentionally didn't) do an Icache flush before executing at addresses within freshly-loaded code.

Ah of course, I conveniently forgot about that case. I considered having a mode that would strictly emulate the icache but as you pointed out it's far from trivial and would probably cause a nosedive in performance. Simply handling the timing side of it wouldn't be too bad if you're doing thing cleverly I suppose, however actually execution from an incoherent cache would be an other thing altogether. I suppose one way to do it would be to check before each "block" if the cache is going to be coherent for that block and otherwise either bail out to the interpreter or recompile a special chunk of code representing whatever the icache contains and running that instead.

@hrydgard Regarding the timing code that seems to match what I've planned, although with the added limitations that with my non-overlapping blocks I can't just increment (or decrement) the counter once for a long stretch of instructions since I potentially have several entry points per-block. Currently I just have one sub in the prologue of each recompiled instruction which is obviously sub-optimal but I was hoping that, pipeline-willing, it would be pretty cheap on modern superscalar architectures. After all there's a dedicated register for the counter and no pipeline hazard. Maybe I'm too optimistic. At any rate it makes "linking" blocks trivial, I jump from block to block without having to call the dispatcher (which saves me from having to bank all my registers to switch back to the C calling convention).

Regarding the performance of Beetle/mednafen IIRC it's about 50/50 between the software renderer code and the rest of the emulator so clearly the dynarec won't be a silver bullet performance-wise. I still hope that it will give a notable performance boost in this configuration. Frankly my main reason for writing this dynarec is intellectual curiosity, as a user I'm perfectly happy with the interpreter.

There also are two somewhat experimental hardware renderers, one OpenGL based on the code from my Rustation PSX emulator and one Vulkan written by TinyTiger. Unfortunately neither of them are good enough to work with all games, Pete's GL plugin is still far superior in almost every aspect (we do emulate dithering though!)

I considered threading the software renderer and I think I know how I'd do it, unfortunately mednafen's GPU code is not very thread-friendly (a significant amount of global state in particular) and it would require a lot of butchering to make it happen. I decided that I would try it on my Rustation emulator instead but that's been on the backburner for a while now.

Regarding the GTE "accuracy" hacks those will break with the dynarec but I'm not too bothered about it, they're nice to have but they don't work with all games and I doubt it'll be a deal breaker not to support them for most users. I guess there's always the possibility of emitting the high-precision tracking code as part of the dynarec if somebody really wants to have their cake and eat it too.

I put accuracy in quotation marks because cycle-accuracy on a system with unpredictable cache and a CDROM etc is a kinda silly idea.

I don't completely agree with that, obviously the CDROM timings are wildly unpredictable (I even ended up implementing a PRNG in Rustation for that reason...) but for most games the CD is only accessed sporadically when assets are loaded, then it's either idle or streaming audio to the SPU.

When the game is loaded and running from RAM the timings are mostly deterministic. Of course you have the RAM refresh cycles, GPU<->CPU asynchronicity and a few other things that could shift a bit from console to console but I expect it's relatively minor.

Most games seem happy with wildly inaccurate timings though, so it's not a huge deal but after discussing with Ryphecha I gather that it's not difficult to find PSX games that rely on precise timings in some parts. Given the use size of the PSX library it's pretty difficult to know where one is supposed to draw the line, I'm sure there's a Pizza Hut demo disc out there that relies on perfect icache emulation and recursive branch delay slots...

senquack commented 6 years ago

You don't need perfect icache emulation. There's literally only 3 games (maybe a few more? I don't know) that abuse it and you already know the secret work-around. You'll never make a dynarec accurate, but Mednafen is as close as you can get in open-source, which is nice. But they don't know any more than you do how many actual cycles it took to get anywhere ;)

If you look at Mednafen PSX code too long you start to get the idea it's somehow a gameboy that needs cycle-perfect emulation. It's not, and you actually have a lot of leeway on the CPU side. IMO, it's the CDROM you have to worry about, and thankfully Mednafen has an awesome CDROM subsystem. The SPU can be troublesome too, but thanks to Notaz, at least in PCSXR, almost all bugs there are gone (Valkyrie Profile is picky with timings, can't use 'tempo fix' IIRC)).

I'll help you with whatever actual edge-cases you can identify, and would like to know of any actually, but I would focus 100% on performance. The timing ghosts IMO are an illusion, too much time spent looking at and worrying over Mednafen source code!

Weird things you actually need to worry about in a PS1 dynarec (as I update and remember them):

Branch/jump targets reading an address written in the BD slot of the jump.
An ALU op reading a MFC2 (gte) dest reg immediately after the mfc2.. it should see the old contents (Front Mission 3 battle leg destruction animation)
GTE instruction occuring twice when an exception occurs. If this is new to you, I can try to explain further later today.

simias commented 6 years ago

You're probably right, I think I'm aiming for more accuracy mainly because it's generally easier to make things less accurate and faster later on than the other way around. If I start cutting too many corners and I actually break something it'll be a pain to debug it whereas if I manage to get a (potentially slow) dynarec to work correctly I can iteratively improve it and make sure that it still works correctly at every step.

My current design might be a bit naive (especially the blocking) but it has the advantage of being rather straightforward to debug and I hope I'll be able to produce a working PoC soon enough.

By the way do you have any clever tricks for dealing with branch/load delay slots? I currently try to reorder them whenever possible but there are a few (hopefully very rare) edge cases that are non trivial to recompile, the worst being arguably a jump-in-jump-delay scenario like:

   j _foo
   j _bar

IIRC that should cause the first jump to prepare the execution of _foo but then the jump in the delay slot overrides that and targets _bar instead, however since the first instruction of _foo was being loaded it's executed as the jump delay slot of the 2nd jump only to immediately continue with _bar. I always though it was a funny "hack" because it lets you execute any random single instruction anywhere in memory and immediately get the control back regardless of the context (well, unless said instruction is also a jump...). It's trivial to emulate in an interpreter but not so much in a recompiler...

Obviously such a sequence seems rather unlikely but Ryphecha mentioned that Threads of Fate actually relies on it:

If you want to debug it, select "Mint" as your character, the graphics in the hallway will be spazzing out if you're doing recursive branch delay slots wrong

senquack commented 6 years ago

We only handle a few edge-cases in terms of branches/jumps in BD slots. I'll get back to you on that, but I know that it's only a few one or two games that need any sort of special BD treatment. I think 'Shadow Master' is one of them.

There is certainly no need to go overboard trying to handle edge cases that you'd never encounter and be able to test anyway. It's an overblown concept, PS1 branches/jumps in BD slots. There's more that I need to look up, but it's not much believe me. I'll dig up what I can later..

'Threads of Fate' took a good bit of work on Dmitry's part, it's merely worrying about instructions at a branch target reading a load or MFC2 etc executed in the BD slot of the branch, i.e. standard R3000A MIPS load delay. There was no automatic stalling of the pipeline back then, like there is on a modern MIPS chip. So, when this situation occurs, the instruction at the branch target is emulated via the interpreter, and then I believe the BD slot is then as well (I'll need to look at it more closely) Still, a very very unusual case and only maybe seen once. We do check for it all the same.

simias commented 6 years ago

@senquack yeah I'm sure it's not worth bothering too much with that, I just wondered if there was a clever trick I might have missed to handle delay slot in a more generic way. One can dream.

@trapexit it's a very interesting paper but most of it isn't really applicable to a PSX dynarec. The clever trick for byte swapping is irrelevant since PSX is LE (and I'm not sure there's a strong incentive to limit yourself to the 80368 IA these days). The XOR page table translation is clever but I wonder if it would be massively cheaper than an indirect reference like I do currently. It also wouldn't play very nice with my cache isolation trickery. @senquack's trick with remapping + mprotect is superior although not quite as portable.

The TLB thing to cache accesses to speed up subsequent access might have some merit but I'd have to implement it to figure out how much of a speedup it gives in practice. I'll have to consider it later on, I suppose if it's clever enough it might give a significant boost to memory accesses. At any rate @senquack's method is still superior since it removes the need to do range checking in the first place.

descent-ru commented 6 years ago

Hello there. I'd like to start with how I appreciate your discussion :3 You guys are badass!

Now, I've never done anything useful in the field, but I've been looking into this for a few days and there are a few ideas I'd like to share. If anything, you can have a laugh at my naivete.

1) Why choose between an interpreter and a recompiler? Given that the host will almost certainly have more cores than the emulated device, you can run them in parallel, with the interpreter running in real time and the recompiler doing a slow but thorough job optimizing hot spots using spare cores. A few minutes in the game the main loop will be completely recompiled and optimized a few times over, then the interpreter will point its branch targets to recompiled code. Finally you can store the compiled code into a cache file so next time the game will run at full speed from the start.

2) If you can make a versatile enough opcode table that can completely describe the load-store differences between the dozen xor opcodes in the x86 then making the rest of the (universal!) recompiler will be a breeze. You'll just have to describe some special snowflakes like "count_leading_zeros" as intrinsics common to all backends then do some rudimentary liveliness analysis using the aforementioned table as a reference and finally put the code back together using the target's opcode table and a generic register allocator.

3) Handling delay slots shouldn't be hard: if the instruction in the delay slot doesn't affect the branch condition, they can be swapped and then recompiled normally; otherwise you can make a thunk for a branch target with the instruction in the delay slot immediately followed by an unconditional jump to the real branch target.

4) You really should look into using memory protection to emulate the psx memory map. Those branches in 'readmemory' and such are an absolute murder for the cpu pipeline. Sure, there are some platforms that don't have an mmu, but shouldn't those be treated as stretch goals? It's also highly likely that memory accesses between different memory segments won't overlap so there can be some optimizations done.

Cheers!

simias commented 6 years ago

Why choose between an interpreter and a recompiler? Given that the host will almost certainly have more cores than the emulated device, you can run them in parallel, with the interpreter running in real time and the recompiler doing a slow but thorough job optimizing hot spots using spare cores. A few minutes in the game the main loop will be completely recompiled and optimized a few times over, then the interpreter will point its branch targets to recompiled code. Finally you can store the compiled code into a cache file so next time the game will run at full speed from the start.

I think that's a bit similar to the 2-pass approach I described somewhere earlier, where the recompiler would first emit "slow" code with various profiling tools to collect stats and then later a 2nd pass that would optimize it further using the profiling data. I suppose you could use the interpreter for the first pass but then the problem is that you have to switch to interpreter mode every time you hit a portion of code that's not been recompiled yet, and on top of that the interpreter will be even slower than usual because of the profiling overhead. It's probably doable but I'm not sure if it's massively more convenient. I think caching the profiling data (or even the recompiled code) on disc in such an approach is a good idea however.

Regarding running the interpreter and recompiler in parallel I'm not sure what it would achieve and it's not as simple as "let's just use threads". Consider that the emulated CPU interacts with the emulated hardware constantly, if you have two CPUs running in parallel and doing the same thing they're going to step on each other's toes. One will read one byte from the controller port and the other will read the next one, none of them getting complete data. Or pushing the same data to the GPU twice, or incrementing a variable in RAM twice. You'd have to basically run the entire emulator twice for this to work, but then what's the point?

If you can make a versatile enough opcode table that can completely describe the load-store differences between the dozen xor opcodes in the x86 then making the rest of the (universal!) recompiler will be a breeze. You'll just have to describe some special snowflakes like "count_leading_zeros" as intrinsics common to all backends then do some rudimentary liveliness analysis using the aforementioned table as a reference and finally put the code back together using the target's opcode table and a generic register allocator.

I think making a generic intermediary representation that would target optimally any architecture is a bit too ambitious for my little brain. Unifying between the wildly different amounts of registers, calling conventions, instruction sets, the different ways tests and branches are implemented... For instance right now since I'm targeting amd64 I use static register allocation using r8-r15 (storing the statistically most used PSX registers in these, the rest in RAM). It might not be optimal for any given block of instructions but at least I don't have to worry about having to match conventions when jumping from one block to the next.

But the biggest difficulty wouldn't be the registers, for instance x86 can reference memory directly in operands for something like xor, so if you want to xor a register into RAM you can do it in a single instruction. On load/store architectures you can't. That difference alone seems tricky to abstract over.

Handling delay slots shouldn't be hard: if the instruction in the delay slot doesn't affect the branch condition, they can be swapped and then recompiled normally; otherwise you can make a thunk for a branch target with the instruction in the delay slot immediately followed by an unconditional jump to the real branch target.

That's how I do it but it doesn't handle all the cases unfortunately. Problems arise in the (unlikely) case where you end up with nested delay slots, be it branch or load delay.

Consider:

    jal   _foo
    jal   _bar

Or even the snickier load-delay-in-branch-delay:

    jal   _foo
    lw    a0, 0(t0)

These two sequences are very tricky to recompile efficiently. Fortunately they're also rather useless and unlikely to be very common in the wild.

You really should look into using memory protection to emulate the psx memory map. Those branches in 'readmemory' and such are an absolute murder for the cpu pipeline. Sure, there are some platforms that don't have an mmu, but shouldn't those be treated as stretch goals? It's also highly likely that memory accesses between different memory segments won't overlap so there can be some optimizations done.

My main concern isn't MMU-less machines (is there even one wildly available MMU-less machine capable of emulating a PSX?) it's code simplicity and portability. I agree that it's a much better approach but I think it's something that can be fixed later once my basic "dumb-and-slow" dynarec works.

descent-ru commented 6 years ago

I suppose you could use the interpreter for the first pass but then the problem is that you have to switch to interpreter mode every time you hit a portion of code that's not been recompiled yet, and on top of that the interpreter will be even slower than usual because of the profiling overhead.

Oh, I failed to elaborate. My bad.

The recompiler won't actually run any code, it would produce entire basic blocks that the interpreter can jump into. The overhead will be that of a function call (setting up a stack frame, spilling the registers, etc). And when more blocks are recompiled you can do some obvious stuff like inlining or eliminating extra calls, as those won't be visible to the interpreter. One advantage of the approach you can't beat is that the recompiler won't have any time constraints whatsoever, so it can go all out optimizing the code. The low-hanging fruit of an ability to validate the recompiled code against the interpreter during development won't hurt either.

There are some obvious ways to reduce the profiling overhead. You can do sampling: have another core check the virtualized program counter a few thousand times a second then match it against known memory regions. You can limit profiling to branch targets by keeping a separate sparse (with pages allocated on write) branch hits table and having another core do an exponential falloff at fixed time intervals, like linux loadavg on steroids.

By the way, keeping a real program counter would allow some crazy stuff like recording user inputs and then doing frame-accurate replays for regression testing on a buff server.

I think making a generic intermediary representation that would target optimally any architecture is a bit too ambitious for my little brain.

Well, that's the problem :( There's actually a complete machine-readable ARM specification (google for "ARM’s ASL Specification Language"), but it's so dauntingly verbose I'm afraid of even looking into it. But allegedly, with its help you can patch hack together a working ARMv8 interpreter over a weekend.

Unifying between the wildly different amounts of registers, calling conventions, instruction sets, the different ways tests and branches are implemented...

You can, however, make a couple of architecture-specific tables with things like "ADD takes (reg, [reg] or imm) does ADD op then overwrites (reg, [reg] or imm) and flags" and make a table-driven compiler that will use liveliness analysis to minimize spills and dataflow analysis to do necessary operations on data through a "path of least resistance".

On a side note, the more I think of an intermediary layer between the emulated code and the host the more I see it like some stack-based machine with 1bit to 128bit values and common intrinsics (like add/sub/mul or count_leading_zeros) as opcodes. The codegen will then consist of a register allocator and a pattern matcher with a table-based code emitter. One can dream, huh?

Problems arise in the (unlikely) case where you end up with nested delay slots, be it branch or load delay.

Oh, I didn't know about that. My MIPS book says it's an undefined behavior, though. I can't think of a better solution yet, sorry.

it's code simplicity and portability.

Well, at least on Windows and Linux this can be abstracted away. Are there any other systems I don't know about?

By the way, what's the baseline target platform for emulation host? I.e. can I buy a 512M 1-core 40x-slower-than-a-Celeron RPi Zero and be sure all others are faster than it?

trapexit commented 6 years ago

"Well, at least on Windows and Linux this can be abstracted away. Are there any other systems I don't know about?"

Theoretically all platforms libretro targets. https://docs.libretro.com/#which-platforms-are-retroarch-available-for

RetroArch runs and is supported on GNU/Linux, BSD, Windows, Mac OSX (PPC/Intel), Haiku, PlayStation 3, Playstation Vita, Playstation Portable, XBox 360, XBox 1, Raspberry Pi, Nintendo Gamecube, Nintendo Wii, Nintendo Wii U, Nintendo 3DS, Android, iOS, Open Pandora, and Blackberry.

simias commented 6 years ago

@descent-ru Yeah you make sense, both the interpreter/profiler and the architecture abstraction could definitely be a great improvement if they can be made to work correctly. I just prefer to keep a less ambitious objective for the moment otherwise I'll just get overwhelmed by the complexity of the task at hand.

My MIPS book says it's an undefined behavior, though

That's fortunate because it's even less likely to have been used purposefully by a coder or compiler, that being said when you code an emulator there's no such thing as an "undefined behavior"...

By the way, what's the baseline target platform for emulation host? I.e. can I buy a 512M 1-core 40x-slower-than-a-Celeron RPi Zero and be sure all others are faster than it?

I don't understand, all other what?

The target performance target is not well defined as far as I know, I guess in the end we'll mostly bench against other dynarecs out there. Looks like the RPi has a 1.4GHz Cortex-A53, seems like it should be powerful enough to emulate a PSX although it might require some more optimization passes and an OpenGL ES renderer. Maybe the GTE could be efficiently emulated using NEON if we're not too picky with accuracy. At any rate, that'll be for later.

legoboy0109 commented 6 years ago

If you people haven't seen this yet, it might be helpful. http://drhell.web.fc2.com/ps1/ This is the emulator I personally use to play PSX games, and it has a Dynarec.

simias commented 6 years ago

@legoboy0109 I can't read Japanese, is it open source?

legoboy0109 commented 6 years ago

Google translate is a helpful tool. I'm not sure, but there is quite a bit of info about building a dynarec for psx right there on the website for anyone to read.

piratesephiroth commented 6 years ago

Xebra? I think that emulator is ancient, at least on Windows. Also it's freeware but not open source so probably not helpful at all.

legoboy0109 commented 6 years ago

Well, I know it's still being developed because the latest windows build is from 2017. I'm not sure how actively, but it runs better than this emulator and has no audio stuttering or even that many bugs. It's only disadvantages are it's inability to upscale and user unfriendly interface.

inactive123 commented 6 years ago

Bounty has increased to $720 now courtesy of saftle and meepingsnesroms!

rcaridade145 commented 6 years ago

Considering this has been taking a while, wouldn't it be easier and perhaps a middle ground to port the cached interpreter from mupen64? Or you believe it isn't worth it?

inactive123 commented 6 years ago

@simias is still busy with active development on his fork.

Cached interpreter could be a nice additional feature, but we want the real dynarec as well.

inactive123 commented 6 years ago

@simias If you are stuck on any particular part of the dynarec, are there ways you could split up tasks to external members so we might get this sooner to completion?

andres-asm commented 6 years ago

just wondering, why can't we use WiiSX-R instead?

roflcopter777 commented 6 years ago

I thought WiiSX-R doesn't have a dynarec at all, no?

rcaridade145 commented 6 years ago

@roflcopter777 https://wiki.gbatemp.net/wiki/WiiSX_compatibility_list_(beta_2) according to here it does.

roflcopter777 commented 6 years ago

Odd. But still.

simias commented 6 years ago

@twinaphex sorry for being awol for a couple of months, I didn't have a lot of spare time for emulation. Things are calming down though, I hope to get back to within a week or two.

If anybody needs help understanding my code or writing their own dynarec feel free to ask.

inactive123 commented 6 years ago

OK, cool. You think you are getting close to getting this working and giving some kind of measurable performance improvement over the interpreter?

inactive123 commented 6 years ago

What if we reduced/narrowed the scope and initially aimed here for a working x86 x64 backend with good performance? Would that seem less daunting? I could put in additional funds towards this purpose. Then the other backends could follow later, and maybe even in separate bounties.

simias commented 6 years ago

I have a decent subset of the MIPS instruction recompiled for amd64 but I still have a significant amount of work ahead of me until I actually manage to run some games. I've also started implementing unit tests because I wasn't feeling super comfortable having such a huge slab of code without tests. Hopefully it'll make it easier to validate the recompiler and debug it.

inactive123 commented 6 years ago

OK, encouraging to hear,

simias commented 6 years ago

So I've revisited the issue of block patching and linking. My current idea is the following:

All blocks are always linked, any kind of branching in a block is implemented by a CALL *target where target is the beginning of a recompiled memory block. If the target has not yet been recompiled when the CALL is emitted then a "dummy" target is used that directly invokes the dynarec.
After the target has been recompiled and since the caller jumped there using a CALL I can retrieve the address of the calling instruction from the stack and hot-patch it. If several locations jump into the same block (by calling a common function for instance) they'll be patched incrementally one by one.
If the target is actually valid I just have to pop the unused return address from the stack in the prologue
When a block is invalidated I just overwrite the prologue with code that recovers the caller's SP and calls the dynarec.

The advantage of this method is that I don't have to keep track of who-links-to-whom, I fix the links "just in time" when a block attempts to call an invalid block. Abusing CALL to save the caller's address also means that I don't have to explicitly add code to identify the linker when patching is due, however it means that I will have to add a POP to remove the address in the prologue of the linkee.

The inconvenient is that I can never get rid of recompiled blocks since I never know when all the potential callers have been patched, so I effectively leak memory until a full cache flush is encountered. An easy workaround is to trigger a full flush if the memory consumption goes above a reasonable limit which will cause the dynarec to restart from scratch (and therefore probably cause stuttering). Alternatively I could implement a simple reference counter on each block that causes it to be free when all references have been patched. That being said unless I map each segment individually it won't help much because of fragmentation. I guess I should probably map a few MBs at a time and release/reuse them when all blocks are freed.

Beyond that and thanks to @hrydgard and @senquack's advice I've decided to rework my code to use overlapping blocks which ought to simplify things and allow more aggressive optimization. The first thing I plan to implement are dynamic register allocation and constant propagation, but that's for later.

simias commented 6 years ago

Actually I just realized that I'll have to implement an other algorithm to deal with indirect calls (JR and the like) but that shouldn't be too difficult. Maybe instead of doing a CALL I'll just set wanted target address in, say, EAX, and the address to be patched (if there's one) in ESI for instance, then the dispatcher can work its magic. It'll add a pair of MOVes befor every JUMP but that's not too bad I suppose.

simias commented 6 years ago

I've been toying with using memory mappings, mprotect and segfault handlers to handle memory accesses as described by @senquack. For 64 bit architectures I can start by mmaping 4GB of address space for the entire PSX addressable memory and then remap specific subsection as needed (RAM, BIOS, scratchpad etc...).

However am I right to assume that this approach can't be trivially implemented on 32bit platforms? After all on those I can't just remap 4GB since, well, that's the whole address space of the host. I can use region masking to reduce the address space to 512MB (or maybe even less if I compress the address space further) but I still need to map chunks of memory at arbitrary absolute addresses in the virtual space of the process which sounds tricky, especially with the address space randomization used in my modern OSs.

I guess the follow up question is: do we want to have a dynarec that works on 32bit hosts? I suppose I could leave the current "slow" logic available as a fallback for these.

inactive123 commented 6 years ago

Yeah, we definitely want to keep the door open for 32bit. Those systems will need the jump in performance the most.

simias commented 6 years ago

Yeah after thinking a little bit more it's probably a good idea to keep the "simple" code available anyway, if only for testing purpose.

simias commented 5 years ago

So I have make some progress, I can display the BIOS boot screen but the textures are messed up. I also have no sound. So... it's something. I think I could also display the PlayStation logo but I need to implement the GTE first.

dynarec

inactive123 commented 5 years ago

Awesome to see it beginning to run something!

simias commented 5 years ago

So the textures are actually uploaded correctly to the GPU but then some (but not all) drawing commands get mangled. In particular all triangles are drawn at with the right coordinates but the texture mapping coordinates are all wrong, which explains why they look completely messed up.

nayslayer commented 5 years ago

Hey, @twinaphex! Now that PS1's finally getting some of the attention it deserves, can you please point out some other core that could use a new dynarec, but lacks a commited developer? For the past week I've been thinking up ideas on how to simplify writing recompilers, and now I'm itching to put them to good use. Otherwise I'll just go get myself stuck on the Quixotic quest of refactoring pcsx2 before I can do any meaningful things.

Oh, a glimpse of what I'm trying to achieve: mips_alu.isa. The idea is to make a machine-readable ISA spec using ops common to all architectures, then use a pattern-driven translator to turn it into the host code, like golang compiler does. To keep it real I'll probably start by writing an interpreter and then turning it into a recompiler by using luajit's dynasm (a preprocessor-based emitter for x86, x86_64, arm, arm64 and ppc).

inactive123 commented 5 years ago

Hi there @nayslayer !

Yeah, I could certainly think of another core. Mednafen/Beetle Saturn is a Sega Saturn emulator that, much like Mednafen/Beetle PSX here, lacks a dynarec. It has even more steep system requirements than even Mednafen/Beetle PSX however, so it would be very beneficial if a dynarec could drastically lower system requirements.

The Sega Saturn uses two Hitachi SH-2 CPUs running at about 28MHz each. Let me know if this is something that interests you. Otherwise I can think of another core.

The repo can be found here -

https://github.com/libretro/beetle-saturn-libretro

nayslayer commented 5 years ago

Yeah, that looks good, thanks! I'm in the process of acquiring specs/sdks/games, once I do, I'll post an issue on the repo you provided.

inactive123 commented 5 years ago

Awesome to hear! Dynarec in Beetle Saturn would be a definite game changer I'm sure.

pcercuei commented 5 years ago

Link to my own dynarec: https://github.com/pcercuei/lightrec

It's kinda slow as it calls back to C for each read/write, but with @senquack's trick and some proper high-level optimization (constant propagation, SSA) it shouldn't be too bad.

The nice thing with my dynarec is that it uses GNU Lightning as the code generator. Think about something like LLVM but much lower level. As such, it will work on x86, ARM, MIPS and PPC, all of them in 32 and 64-bit variants.

The other interesting thing is that I added a debug feature that allows you to run two instances in parallel and get detailed information as soon as a recompiled block does not behave correctly.

inactive123 commented 5 years ago

Put another $100 into this bounty.

pcercuei commented 5 years ago

Are there somebody already working on this? I'm willing to throw two months of full-time work on this, but I don't want to step on someone else's toes.

inactive123 commented 5 years ago

Hi there @pcercuei,

that's very gracious of you, thanks. I guess this all depends on where @simias stands right now. He has a dynarec branch on his repo. The last activity has been in September though.

If you guys can both bring this currently existing work by @simias to completion through a joint effort, would you guys be fine if we split the bounty 50/50?

All we ultimately care about is that we get this dynarec working sooner rather than later, and at this point after having waited over a year or so or longer, I think it might be best if we work with what currently already exists instead of having to build everything up from scratch and losing even more months.

So, what do both of you think about my proposal to work together on simias' dynarec and then split the bounty 50/50? It won't work of course if both of you don't agree on this, but I hope we can find some common ground.

Klauserus commented 5 years ago

hi. I spend 40 dollars to the bounty and i hope i can see a good beetle Emu on my Raspberry or Handheld console in Future (not a makeshift). Dear programmer, drink a Beer for me after work! I love your work. Greetings

inactive123 commented 5 years ago

Thanks very much for the stimulus! We are now at $895!

pcercuei commented 5 years ago

I think @simias is working on a "classic" mips32 to x86_64 dynarec. I want to work on a radically different design, loosely based on the one I already have, and honestly I think it would take me as much time to understand @simias' dynarec and learn x86 assembly than writing a first version of my improved dynarec.

inactive123 commented 5 years ago

Hi there,

OK, so both of you working on the same codebase is not going to happen then.

I don't want to upset either of you since I know @simias has worked long and hard on his dynarec too (although it is not yet working), so how about this instead -

@pcercuei works on his own dynarec independently, upon completion he can get half of whatever this bounty is at right now. Plus you will get another $200 from the house (libretro) upon completion of this dynarec just as an additional thank you. So you'd be looking at $450 + $200. And maybe the bounty will still go up in the intervening time.

The other half of the bounty will still be kept whenever @simias reaches completion on his own dynarec. I am fine with alternate dynarec systems especially if they can later be repurposed to target other systems/devices as well. So I'd still very much appreciate simias' dynarec being completed.

Does this sound agreeable?

On our end, we just want something to start working as soon as possible. The dynarec bounty started August 2017, I would be very glad if by end of December we finally have some games starting to run with either of both dynarecs that is at least 2 to 4x as fast as the current interpreter core.

inactive123 commented 5 years ago

@pcercuei @simias Can I get a response from either of you on what I wrote above?

libretro / beetle-psx-libretro

Dynamic recompiler [$895] [BOUNTY] #214