simias / rustation

Playstation emulator in the Rust programing language
Other
552 stars 21 forks source link

Subpixel precision (GTE accuracy) #28

Open simias opened 8 years ago

simias commented 8 years ago

I'm developing a prototype in the subpixel branch. More details to come...

Nucleoprotein commented 8 years ago

After removing garbage some games are not affected by this hack at all like mentioned before Tomb Raider II (SLUS-00437) and Silent Hill (SLUS-00707) ie. they look exactly same as without it. Also TR2 screenshot here: https://imgur.com/a/a0XZo with broken coords - not broken ones are fixed point :stuck_out_tongue:

ADormant commented 8 years ago

I believe Crash Team Racing and Wipeout are problematic too.

simias commented 8 years ago

Okay, finally got my debugger up and running.

So the first thing I looked into was how the BIOS displayed the "PlayStation" logo at the start. I used the SCPH7003 BIOS (NA, version 3.0).

The first GTE XY FIFO access is an SWC2 at PC 0x8004e7d0:

=> 0x8004e7d0:  swc2    $12,0(a3)       /* $a3: 0x80086eb8 */
   0x8004e7d4:  swc2    $13,0(t0)
   0x8004e7d8:  swc2    $14,0(t1)
   0x8004e7dc:  swc2    $8,0(t2)
   0x8004e7e0:  nop
   0x8004e7e4:  c2      0x158002d
   0x8004e7e8:  lw      t1,28(sp)
   0x8004e7ec:  mfc2    t0,$7
   0x8004e7f0:  mfc2    t0,$7
   0x8004e7f4:  nop

Then I expected the BIOS to use the DMA to upload the completed commands to the GPU but it's not the case, instead this code uploads the data to the GPU:

=> 0x80050b38:  lw      t6,0(a0)        /* $a0: 0x80086eb8 */
   0x80050b3c:  move    v0,a1
   0x80050b40:  addiu   a1,a1,-1
   0x80050b44:  addiu   a0,a0,4
   0x80050b48:  bnez    v0,0x80050b38
=> 0x80050b4c:  sw      t6,0(v1)        /* $v1: 0x1f801810 (GPU GP0) */

So we can see that instead of using the DMA the BIOS copies the commands from the RAM towards the GPU in software using regular LW/SW.

This is an interesting situation for subpixel precision because in order to handle this situation we need to tie the enhanced precision vertex data with one of the CPU's general purpose registers ($t6 here).

Of course the BIOS is not the most interesting test case for subpixel precision and it's not really a big deal if it breaks for the PlayStation logo but i wouldn't be surprised if some games did something similar.

i30817 commented 8 years ago

Ehhh, can that situation be detected and logged? If there are only a few games doing that, no offense, but i'd rather have them broken or at least to have a fast and slow path (for those games) than slowdown everything significantly. I know it's a hack, but the feature itself is a hack.

simias commented 8 years ago

Yeah maybe, I'm going to test more games. I haven't really settled on a solution yet. I was just interested to test the BIOS because I noticed that my current implementation didn't work there and wanted to figure out why it didn't.

simias commented 8 years ago

Also maybe it could be made optional, the hack could have various levels of complexity which could be turned on and off depending on the game and the capabilities of the host computer.

simias commented 8 years ago

I managed to get it working with Crash Bandicoot but not Spyro for some reason.

@i30817 Do also try to get perspective correct mapping working or just subpixel precision? Since I'm using an OpenGL renderer I thought I might try to get the z-coordinate with the floating point coordinates but that doesn't seem to work well so far.

i30817 commented 8 years ago

I have no good idea of graphical programming so i can't answer that about the z-coordinate precision on the GTE.

In general I guess if you manage to surpass the other emulators at graphical enhancement of ps1 games it would be a powerful draw to users, but making the feature optional and with as many fast vs slow paths as possible seems best for the final solution (simpler prototyping is ok).

If you manage to detect when the simpler technique fails and replace it with the more complex one without false positives or missed events; that would be best (certainly better than per-game configs, which sound troublesome with the ps1 library size, as well as too coarse a measure since surely most games that need the more complex technique might not need it everywhere?).

simias commented 8 years ago

I see. Currently I manage to run Crash at full speed with the expensive version of the hack but I'm almost maxing out my CPU.

I think I'm going to try to get better compatibility with my emu before I continue with this hack, I can't really test all the games I want.

ADormant commented 8 years ago

Is it possible to make this emulator multithreaded? For example on one thread CPU and GTE on the other SPU and MDEC or even CPU and GTE on different threads? Though It'd be better if GTE could be emulated on a GPU. By the way could you give option to switch between these hacks?

simias commented 8 years ago

I implemented it in a way that would make it possible to make it an option (with no performance hit when the option is disabled) but I haven't actually implemented the option yet.

The GPU could be multithreaded but I'm not sure if there's a point since it's already de-facto offloaded to the host GPU through OpenGL so it shouldn't take too much CPU time.

For the rest it's more difficult, the GTE is so tightly coupled with the CPU that it's going to be hard to make it run in a separate thread. The MDEC is coupled with the CPU and the DMA which is itself coupled with RAM (and CPU) so it's going to be pretty difficult too, although probably less so than the GTE.

The MDEC has pretty specific use cases (FMV, pre-rendered backgrounds...) and generally it runs during loading times or while video is being displayed and the rest of the system is pretty much idle (except for SPU and CD-ROM, probably) so I'm not sure you'd see any significant improvement by threading the MDEC.

The SPU might be doable, I'm not sure at this point if it's worth it.

ADormant commented 8 years ago

Audio is rather power hungry in many emulated consoles.

simias commented 8 years ago

Yeah but in order for threading to give us performance we must offset the cost of the resynchronizations. If the average game tinkers with the SPU very frequently (reading registers, uploading audio, waiting for interrupts...) the thread might spend all of its time resync'ing which might well end up being slower than optimized single threaded code. There is no such thing as a free lunch.

Nucleoprotein commented 8 years ago

I get about 25FPS on BIOS screen using master branch, so rustation is very slow for me, but I have old CPU - Q6600.

ADormant commented 8 years ago

Vulkan and DX12 are supposed to have better multithreading anyway but in regards to performance with OpenGL consider using in the future https://www.opengl.org/registry/specs/ARB/vertex_attrib_binding.txt https://www.opengl.org/registry/specs/ARB/vertex_attrib_64bit.txt I heard this extension is essential to performance. There is also bindless drawing. https://www.opengl.org/registry/specs/ARB/conservative_depth.txt https://www.opengl.org/registry/specs/ARB/indirect_parameters.txt https://www.opengl.org/registry/specs/ARB/buffer_storage.txt https://www.opengl.org/registry/specs/ARB/multi_bind.txt https://www.opengl.org/registry/specs/ARB/multi_draw_indirect.txt https://www.opengl.org/registry/specs/NV/bindless_multi_draw_indirect.txt https://www.opengl.org/registry/specs/NV/bindless_multi_draw_indirect_count.txt https://www.opengl.org/registry/specs/ARB/fragment_shader_interlock.txt https://developer.nvidia.com/sites/default/files/akamai/opengl/specs/GL_NV_fragment_shader_interlock.txt https://www.opengl.org/registry/specs/INTEL/fragment_shader_ordering.txt https://www.opengl.org/registry/specs/ARB/shader_atomic_counters.txt http://blog.icare3d.org/2014/09/maxwell-gm204-opengl-extensions.html

ADormant commented 8 years ago

@simias By the way did you already implement perspective corrext texture mapping in that subpixel branch since you mentioned it here:

Do also try to get perspective correct mapping working or just subpixel precision?

simias commented 8 years ago

@tapcio ouch, that is pretty slow. I haven't really spent time optimizing yet, hopefully that will be improved in the future. It still runs decently well on my core i5-2450M @ 2.5GHz.

@ADormant I tried. In the subpixel branch I store the Z coordinate of the vertex alongside the precise X and Y values and I feed them to OpenGL. I don't really know if it works though, Crash Bandicoot doesn't have a lot of obvious texture warping going on. I'm going to get more games working and give it an other try.

ADormant commented 8 years ago

@simias @iCatButler This looks like a better implementation https://github.com/iCatButler/pcsxr https://github.com/iCatButler/pcsxr/commit/b1f5a6ce4d7b9156910078300bfdf4ff0fd8ccf0 https://github.com/iCatButler/pcsxr/commit/7767ea4acbae995cd8e6302bdb7c97e89748dfd8 https://github.com/iCatButler/pcsxr/commit/e3df273095a5800e3dcdcb63bd66e269c0c2d3a8 https://github.com/iCatButler/pcsxr/commit/0c06f5ebc604f909096e97b06ba19c2df412e813 https://drive.google.com/file/d/0Bz8IYcLfu84zQmd2ZHJlNkhsdjA/view?pref=2&pli=1 Not perfect but works better https://drive.google.com/file/d/0Bz8IYcLfu84zQmd2ZHJlNkhsdjA/view?pref=2&pli=1 It's a version created by @iCatButler http://ngemu.com/threads/peteopengl2tweak-tweaker-for-peteopengl2-plugin-w-gte-accuracy-hack.160319/page-42

simias commented 8 years ago

What is this exactly? @tapcio's code or yet an other implementation? It looks similar to what we were trying to do here as far as I can tell at a glance from the code.

ADormant commented 8 years ago

@simias iCatButler implemented perspective-correct texture mapping. Trilinear and anisotropic filtering should be doable with it. https://github.com/iCatButler/pcsxr/commit/216c2ff3aefc9e0295ed9b1486935d65f6c13f55 https://github.com/iCatButler/PeteOpenGL2Tweak/commit/a32aba6d2cb13a4648760d18b8bd464e6cbf7587 https://github.com/iCatButler/pcsxr/commit/153c8eb4997d21d3b2965cf38d4348f05c29860f https://drive.google.com/file/d/0Bz8IYcLfu84zNVVBeVQ5VHk1R0E/view?pref=2&pli=1

ADormant commented 8 years ago

@simias Reagarding iCat's implementation PGXP it's still not perfect and even more advanced implementation may require Getting the remaining vertex data will either mean much more widespread mirroring of CPU operations or some form of mesh reconstruction that will make a best guess at the exact 3D position from the low precision coordinates. CPU https://github.com/iCatButler/pcsxr/commit/f70082329d751ee8a358437feb34134e283b27d8

http://ngemu.com/threads/peteopengl2tweak-tweaker-for-peteopengl2-plugin-w-gte-accuracy-hack.160319/page-47

ADormant commented 7 years ago

Dynarec for PGXP https://github.com/iCatButler/pcsxr/commit/36ef7277127e1010c296ce8792f74934c2ee9d2f