jpd002 / Play-

Play! - PlayStation2 Emulator
http://purei.org
Other
2.04k stars 248 forks source link

sse version of psm4 #1031

Closed bigianb closed 3 years ago

bigianb commented 3 years ago

speeds up xenosaga II titles. Only SSE at the moment - arm still needs to be done (build will fail on arm for the moment - I'll hopefully have time to fix that up this coming week).

Zer0xFF commented 3 years ago

the few games i've tested seem to have PSM4 related texture corruption image

bigianb commented 3 years ago

the few games i've tested seem to have PSM4 related texture corruption image

Thanks. I've only looked at xenosaga II which seems ok. What games do you see the issue with and I'll look to see if any are in my collection. I'll also extend my auto tests a bit.

Zer0xFF commented 3 years ago

Thanks. I've only looked at xenosaga II which seems ok. What games do you see the issue with and I'll look to see if any are in my collection. I'll also extend my auto tests a bit.

I've tested FFXII, Kingdom Hearts, Megaman X8 & Champions of Norrath (I imagine BGDA might have similar issues to CoN), each seems to have some issue. for CN issues is most promient in memory catd prompt screen and in game pause menu image

KH: image it can be seen ingame in KH as well, but this is in the intro, this is practicluarly interesting though since you can see that everything is shuffled around.

bigianb commented 3 years ago

Looks better now - getPixelAddress in the indexer returns the wrong value for the 4 bit specialisation. Rather than fix it, I used getColumnAddress instead as that's more efficient and is the address we need anyway.

Zer0xFF commented 3 years ago

that seems to do the trick, thanks.

ill do some quick fps tests later on and report back.

Zer0xFF commented 3 years ago

FPS report: comparing CI local build https://github.com/jpd002/Play-/commit/f77eebb67fe092787e74d33d4cf2f3a8b947860b vs local build based the same branch + this PR, both with "RelWithDebInfo" config, running with resolution @ 4x

seems im getting mix fps results, most ingame seem to be down mostly no change, while pause menu (FFXII/MX8) seems faster, I also noticed a remaining graphical glitch in Champions of Norrath, screenshot below, though i couldn't see any other issue in other games

FFXII in game: Scene 1: without/with PR: 85fps

Scene 2: without/with PR: 66fps

Scene 3 (Shop): without PR: 122 fps with PR: 142 fps

Scene 3 (selection): without PR: 216 fps with PR: 270 fps

Scene 1/2: image

Scene 3/4: image

Kingdom Hearts in game: Scene 1 waterway: without/with PR: 255 fps

Scene 1 seaside: without/with PR: 110 fps

Scene 3 pause:

without/with PR: 420 fps Scene 1/2 image Scene 3 image

MX8 in game: Scene 1 pause meu: without PR: 130 fps with PR: 140 fps

Scene 2: without/with PR: 90 fps

image

Champions of Norrath: Scene 1 text: without/with PR: 74 fps

Scene 2 inventory: without/with PR: 208 fps

image image

bigianb commented 3 years ago

Couple of thoughts here:

  1. This should not cause a slowdown in any case. I wonder if the CI is built with different optimisations. Try running with both compiled locally with the same options. Just uncommenting line 31 will drop back to the non SSE version for comparison: https://github.com/jpd002/Play-/pull/1031/files#diff-986dd4ba0b1730f6b4e59121c257bf7897ded5daa9efea3c23b9e8e56f9cd5d2R31
  2. I would expect the effect to be small is most games. The big one is the xeonosaga II title screen which seems to use PSM4 a lot.
  3. I'll do some more testing to try and snag that remaining issue.
  4. Code review of the actual SIMD logic would be appreciated :)
Zer0xFF commented 3 years ago

I wonder if the CI is built with different optimisations.

Indeed, out of habit my local build uses "RelWithDebInfo" config, which will ofc add an overhead, So i redid the "CI" test with "RelWithDebInfo" to compare like to like, and the fps normalised, no perf lost, while we still have fps gains in MX8/FFXII pause/shop view.

I would expect the effect to be small is most games.

I expected as much (beside use in text texture?), thats why I was initially puzzled

Code review of the actual SIMD logic would be appreciated :)

Jean, I think he's talking to you runs

bigianb commented 3 years ago

Code review of the actual SIMD logic would be appreciated :)

Jean, I think he's talking to you runs

image

It's not too scary. Look at this project https://github.com/bigianb/ps2-speedtests/tree/main/hostTests/HostTests/GSTransferTests which is just the swizzle extracted for easy testing and transforming. A good start would be to figure out why my PSM4 test passes when we can see there are times when it does not work. That's really odd.

bigianb commented 3 years ago

corruption on small textures should be fixed now.

jpd002 commented 3 years ago

Very nice work again! Tested Xenosaga 2 and I'm getting a very nice speed boost in OpenGL mode :)

As for the SIMD logic, all I can say is that it makes my head hurt 😄 Before you created your PR, I was trying to devise a sequence of instructions that would work for unswizzling PSMT4, but I didn't get to the end of that. From what I can see, my strategy looks similar to yours, but I'll try to do some more work on my end to see if I get the same result as you.

I just looked at the disassembly real quick and I saw that the loads were not aligned. Don't think it makes a huge difference, but I think we could use aligned loads for this.

00007FF79A7B4E30  movdqu      xmm0,xmmword ptr [rsi]  
00007FF79A7B4E34  lea         rax,[rsp+60h]  
00007FF79A7B4E39  movdqu      xmm1,xmmword ptr [rsi+10h]  
00007FF79A7B4E3E  lea         r9,[rsp+90h]  
00007FF79A7B4E46  movdqu      xmm4,xmmword ptr [rsi+20h]  
00007FF79A7B4E4B  movdqa      xmm8,xmm11  
00007FF79A7B4E50  movdqu      xmm7,xmmword ptr [rsi+30h]  
bigianb commented 3 years ago

The neon implementation hurt my brain but got there in the end. Would be interesting to see it on a device (tests pass on my PI but not tested any games). The perf test itself shows a big speedup on the pi though ... something like 50 quicker - so should have an impact on the parts of games that use 4 bit textures.

bigianb commented 3 years ago

Fixed up the formatting issues with trailing whitespace. This should be ready to go (it should be tested on a real device).

jpd002 commented 3 years ago

Tested FF12 on my iPhone, it's working good :smiley: I'm getting improvements of a few FPS (like 3 or 4) in menus.

About the Linux failure, I guess it doesn't like the use of pshufb since it's an SSSE3 instruction and we're targeting SSE2. The minimum requirement for the emulator is SSE2. Is there a way to use the intrinsics even if we don't specify the -mssse3 in the build flags?

bigianb commented 3 years ago

About the Linux failure, I guess it doesn't like the use of pshufb since it's an SSSE3 instruction and we're targeting SSE2. The minimum requirement for the emulator is SSE2. Is there a way to use the intrinsics even if we don't specify the -mssse3 in the build flags? Yes, that would do it. It would be difficult to do it in pure SSE2 I think. Given that SSSE3 was introduced almost 15 years ago, are there practically any chips out that that do not have it which can run Play?

jpd002 commented 3 years ago

Fair enough, let's add the -mssse3 flag here: https://github.com/jpd002/Play-/blob/master/Source/CMakeLists.txt#L38.

rcaridade145 commented 3 years ago

https://walbourn.github.io/directxmath-sse-sse2-and-arm-neon/

I would prefer to target SSE4.1.

Edit: From my tests both gcc can vectorize most of the same code with both flags. However there are some differences. SSS3 SSE4.1

jpd002 commented 3 years ago

Oh, sorry, I mislead you with my tip. The -mssse3 flag needs to be set in here: https://github.com/jpd002/Play-/blob/master/Source/gs/GSH_OpenGL/CMakeLists.txt#L20. TARGET_PLATFORM_UNIX includes all archs, so, the condition needs special care to exclude ARM and ARM64.

jpd002 commented 3 years ago

Thanks a lot for this!

Just to be sure, I've tested Xenosaga 2 on Android (nVIDIA Shield) and I'm getting a 5fps improvement in the title screen, everything looking good.

I can't wrap my head around the SIMD logic (I'm having a hard time focusing), so, I can't give you feedback about it, sorry. I'm still keeping a note in my task list about the potential gains of using SIMD for texture swizzling/unswizzling so I can take a look at it later. I think the gains are also heavily dependent on whether the game is GS bound or not since all that logic happens on the GS thread.

Maybe another opportunity for optimization would be to parallelize the swizzling/unswizzling on multiple threads, but I have no idea if it would be viable or not (ex.: split processing of pages per CPU core instead of doing everything in a single CPU core).

bigianb commented 3 years ago

For multi-threading it would be easy to do each column on a separate thread as they are fully independent. You would need some dispatch mechanism to avoid the thread creation overhead (or use a lightweight thread). There are 2 big overheads I see now:

  1. the swizzle when transferring into the GS. This is harder because it is not aligned to a column ... so we would need to do a central aligned chunk the quick way and use the current method for the unaligned edges.
  2. The VIF and GIF FIFOs are very inefficient. we could use neon instructions to extract bit ranges very quickly and avoid the memcopy operations.

On Fri, 12 Mar 2021 at 16:16, Jean-Philip Desjardins < @.***> wrote:

Thanks a lot for this!

Just to be sure, I've tested Xenosaga 2 on Android (nVIDIA Shield) and I'm getting a 5fps improvement in the title screen, everything looking good.

I can't wrap my head around the SIMD logic (I'm having a hard time focusing), so, I can't give you feedback about it, sorry. I'm still keeping a note in my task list about the potential gains of using SIMD for texture swizzling/unswizzling so I can take a look at it later. I think the gains are also heavily dependent on whether the game is GS bound or not since all that logic happens on the GS thread.

Maybe another opportunity for optimization would be to parallelize the swizzling/unswizzling on multiple threads, but I have no idea if it would be viable or not (ex.: split processing of pages per CPU core instead of doing everything in a single CPU core).

— You are receiving this because you authored the thread. Reply to this email directly, view it on GitHub https://github.com/jpd002/Play-/pull/1031#issuecomment-797593785, or unsubscribe https://github.com/notifications/unsubscribe-auth/AAVHBO24E7F4QBREOR3TRALTDIV3ZANCNFSM4YLQIPBQ .