Closed bigianb closed 3 years ago
the few games i've tested seem to have PSM4 related texture corruption
the few games i've tested seem to have PSM4 related texture corruption
Thanks. I've only looked at xenosaga II which seems ok. What games do you see the issue with and I'll look to see if any are in my collection. I'll also extend my auto tests a bit.
Thanks. I've only looked at xenosaga II which seems ok. What games do you see the issue with and I'll look to see if any are in my collection. I'll also extend my auto tests a bit.
I've tested FFXII, Kingdom Hearts, Megaman X8 & Champions of Norrath (I imagine BGDA might have similar issues to CoN), each seems to have some issue. for CN issues is most promient in memory catd prompt screen and in game pause menu
KH: it can be seen ingame in KH as well, but this is in the intro, this is practicluarly interesting though since you can see that everything is shuffled around.
Looks better now - getPixelAddress in the indexer returns the wrong value for the 4 bit specialisation. Rather than fix it, I used getColumnAddress instead as that's more efficient and is the address we need anyway.
that seems to do the trick, thanks.
ill do some quick fps tests later on and report back.
FPS report:
comparing CI local build https://github.com/jpd002/Play-/commit/f77eebb67fe092787e74d33d4cf2f3a8b947860b vs local build based the same branch + this PR, both with "RelWithDebInfo" config, running with resolution @ 4x
seems im getting mix fps results, most ingame seem to be down mostly no change, while pause menu (FFXII/MX8) seems faster,
I also noticed a remaining graphical glitch in Champions of Norrath, screenshot below, though i couldn't see any other issue in other games
FFXII in game: Scene 1: without/with PR: 85fps
Scene 2: without/with PR: 66fps
Scene 3 (Shop): without PR: 122 fps with PR: 142 fps
Scene 3 (selection): without PR: 216 fps with PR: 270 fps
Scene 1/2:
Scene 3/4:
Kingdom Hearts in game: Scene 1 waterway: without/with PR: 255 fps
Scene 1 seaside: without/with PR: 110 fps
Scene 3 pause:
without/with PR: 420 fps Scene 1/2 Scene 3
MX8 in game: Scene 1 pause meu: without PR: 130 fps with PR: 140 fps
Scene 2: without/with PR: 90 fps
Champions of Norrath: Scene 1 text: without/with PR: 74 fps
Scene 2 inventory: without/with PR: 208 fps
Couple of thoughts here:
I wonder if the CI is built with different optimisations.
Indeed, out of habit my local build uses "RelWithDebInfo" config, which will ofc add an overhead, So i redid the "CI" test with "RelWithDebInfo" to compare like to like, and the fps normalised, no perf lost, while we still have fps gains in MX8/FFXII pause/shop view.
I would expect the effect to be small is most games.
I expected as much (beside use in text texture?), thats why I was initially puzzled
Code review of the actual SIMD logic would be appreciated :)
Jean, I think he's talking to you runs
Code review of the actual SIMD logic would be appreciated :)
Jean, I think he's talking to you runs
It's not too scary. Look at this project https://github.com/bigianb/ps2-speedtests/tree/main/hostTests/HostTests/GSTransferTests which is just the swizzle extracted for easy testing and transforming. A good start would be to figure out why my PSM4 test passes when we can see there are times when it does not work. That's really odd.
corruption on small textures should be fixed now.
Very nice work again! Tested Xenosaga 2 and I'm getting a very nice speed boost in OpenGL mode :)
As for the SIMD logic, all I can say is that it makes my head hurt 😄 Before you created your PR, I was trying to devise a sequence of instructions that would work for unswizzling PSMT4, but I didn't get to the end of that. From what I can see, my strategy looks similar to yours, but I'll try to do some more work on my end to see if I get the same result as you.
I just looked at the disassembly real quick and I saw that the loads were not aligned. Don't think it makes a huge difference, but I think we could use aligned loads for this.
00007FF79A7B4E30 movdqu xmm0,xmmword ptr [rsi]
00007FF79A7B4E34 lea rax,[rsp+60h]
00007FF79A7B4E39 movdqu xmm1,xmmword ptr [rsi+10h]
00007FF79A7B4E3E lea r9,[rsp+90h]
00007FF79A7B4E46 movdqu xmm4,xmmword ptr [rsi+20h]
00007FF79A7B4E4B movdqa xmm8,xmm11
00007FF79A7B4E50 movdqu xmm7,xmmword ptr [rsi+30h]
The neon implementation hurt my brain but got there in the end. Would be interesting to see it on a device (tests pass on my PI but not tested any games). The perf test itself shows a big speedup on the pi though ... something like 50 quicker - so should have an impact on the parts of games that use 4 bit textures.
Fixed up the formatting issues with trailing whitespace. This should be ready to go (it should be tested on a real device).
Tested FF12 on my iPhone, it's working good :smiley: I'm getting improvements of a few FPS (like 3 or 4) in menus.
About the Linux failure, I guess it doesn't like the use of pshufb
since it's an SSSE3 instruction and we're targeting SSE2. The minimum requirement for the emulator is SSE2. Is there a way to use the intrinsics even if we don't specify the -mssse3 in the build flags?
About the Linux failure, I guess it doesn't like the use of
pshufb
since it's an SSSE3 instruction and we're targeting SSE2. The minimum requirement for the emulator is SSE2. Is there a way to use the intrinsics even if we don't specify the -mssse3 in the build flags? Yes, that would do it. It would be difficult to do it in pure SSE2 I think. Given that SSSE3 was introduced almost 15 years ago, are there practically any chips out that that do not have it which can run Play?
Fair enough, let's add the -mssse3
flag here: https://github.com/jpd002/Play-/blob/master/Source/CMakeLists.txt#L38.
https://walbourn.github.io/directxmath-sse-sse2-and-arm-neon/
I would prefer to target SSE4.1.
Edit: From my tests both gcc can vectorize most of the same code with both flags. However there are some differences. SSS3 SSE4.1
Oh, sorry, I mislead you with my tip. The -mssse3
flag needs to be set in here: https://github.com/jpd002/Play-/blob/master/Source/gs/GSH_OpenGL/CMakeLists.txt#L20. TARGET_PLATFORM_UNIX
includes all archs, so, the condition needs special care to exclude ARM and ARM64.
Thanks a lot for this!
Just to be sure, I've tested Xenosaga 2 on Android (nVIDIA Shield) and I'm getting a 5fps improvement in the title screen, everything looking good.
I can't wrap my head around the SIMD logic (I'm having a hard time focusing), so, I can't give you feedback about it, sorry. I'm still keeping a note in my task list about the potential gains of using SIMD for texture swizzling/unswizzling so I can take a look at it later. I think the gains are also heavily dependent on whether the game is GS bound or not since all that logic happens on the GS thread.
Maybe another opportunity for optimization would be to parallelize the swizzling/unswizzling on multiple threads, but I have no idea if it would be viable or not (ex.: split processing of pages per CPU core instead of doing everything in a single CPU core).
For multi-threading it would be easy to do each column on a separate thread as they are fully independent. You would need some dispatch mechanism to avoid the thread creation overhead (or use a lightweight thread). There are 2 big overheads I see now:
On Fri, 12 Mar 2021 at 16:16, Jean-Philip Desjardins < @.***> wrote:
Thanks a lot for this!
Just to be sure, I've tested Xenosaga 2 on Android (nVIDIA Shield) and I'm getting a 5fps improvement in the title screen, everything looking good.
I can't wrap my head around the SIMD logic (I'm having a hard time focusing), so, I can't give you feedback about it, sorry. I'm still keeping a note in my task list about the potential gains of using SIMD for texture swizzling/unswizzling so I can take a look at it later. I think the gains are also heavily dependent on whether the game is GS bound or not since all that logic happens on the GS thread.
Maybe another opportunity for optimization would be to parallelize the swizzling/unswizzling on multiple threads, but I have no idea if it would be viable or not (ex.: split processing of pages per CPU core instead of doing everything in a single CPU core).
— You are receiving this because you authored the thread. Reply to this email directly, view it on GitHub https://github.com/jpd002/Play-/pull/1031#issuecomment-797593785, or unsubscribe https://github.com/notifications/unsubscribe-auth/AAVHBO24E7F4QBREOR3TRALTDIV3ZANCNFSM4YLQIPBQ .
speeds up xenosaga II titles. Only SSE at the moment - arm still needs to be done (build will fail on arm for the moment - I'll hopefully have time to fix that up this coming week).