icculus / DirkSimple

A dirt-simple Dragon's Lair player
zlib License
41 stars 7 forks source link

NEON-based YUV->RGB conversion #23

Open icculus opened 1 year ago

icculus commented 1 year ago

This could really benefit ARM-based things like the Raspberry Pi, which might not have shaders in the libretro core to accelerate the YUV->RGB conversion, but have NEON, so it can lessen the CPU cost.

This is less important to the immediate concerns on Intel processors, but SSE could be nice, too.

icculus commented 1 year ago

Okay, status update:

I had written some NEON code for this for the Nintendo Switch before, which we shipped in Star Wars Republic Commando. It made video playback go from unusable to full speed.

BUT: it required aarch64 instructions, which we can't rely on for the Raspberry Pi, since the things are usually running in arm32 mode! Also, I got the NEON version running on an M1 Mac, and it didn't make one bit of difference to the speed, so there really wasn't any platform where this was worth keeping.

So I spent some time with the existing C code, and optimized a bunch of inefficiencies out of it.

This is still a CPU killer, but it's significantly faster now. On x86-64, it's probably 30-40% faster now.

On the Raspberry Pi, it's not such a dramatic improvement (but maybe my test is eating too much memory, so there are other bottlenecks?), but I think it's maybe dropped from 14 milliseconds per frame to 12. It might be juuuuust enough to keep this ahead of the game enough to keep audio flowing.

It may still be worth seeing if there's a NEON approach that works with this now that the normal C code has changed so much, too.

@vanfanel, if you want to try the latest in revision control again, I'd be curious if this helps (but it might not).

vanfanel commented 1 year ago

I have just tried latest GIT code just now, and the same audio dropouts in the Libretro core are still present. However, please incorporate 64bit NEON code: 64bit is the standard now in Aarch64, included the Pi3 and Pi4, the 64bit Raspberry Pi OS is not a BETA anymore, it's considered stable. There are many orther aarch64 SBCs, and Debian and derivatives run on that by now.

Don't look back at 32bit armhf as the standard, that's legacy for Pi1 and Pi2 now which won't run the core properly anyway.

icculus commented 1 year ago

I'm working on rewriting the NEON code; I've got something faster than the previous attempt that I'm still debugging. Stay tuned!

vanfanel commented 1 year ago

Woho! Thanks for the update! I for one will stay veeeeery tuned for this :dancers:

icculus commented 1 year ago

Okay, I pushed it. This is a good improvement over the existing code, and in aarch64 mode it'll likely be better since it won't run out of NEON registers.

This was tested on a Pi4 in 32-bit mode. I had to add -mfpu=neon to the compiler command line, I'm not sure if this will be necessary for aarch64. It worked on aarch64 macOS, too.

vanfanel commented 1 year ago

I just rebuilt from latest sources. These are the flags I use (I am on Aarch64):


cmake .. -DCMAKE_BUILD_TYPE=Release -DDIRKSIMPLE_SDL_DEFAULT=OFF \
-DCMAKE_CXX_FLAGS="-march=native -mtune=native" \
-DCMAKE_C_FLAGS="-march=native -mtune=native"

However, I see no difference with this version: I can still hear sound drop-outs during gameplay, very noticeable in the begining of the "Dirk resurrection" scene, for example.

icculus commented 1 year ago

Okay, we're going to move on to OpenGL then. 🤷‍♀️

vanfanel commented 1 year ago

I suspect OpenGL won't change much and the same dropouts will be present.

Just an idea: wouldn't it be possible to have the game logic converted to C? Really, I am sure it's LUA causing those audio dropouts on low-end ARM.

icculus commented 1 year ago

I really don't think Lua is the culprit here, but in case I'm wrong, let's test it: here's a patch that makes it only run the Lua logic once every 10 frames:

diff --git a/dirksimple.c b/dirksimple.c
index ebe9a66..412845b 100644
--- a/dirksimple.c
+++ b/dirksimple.c
@@ -1587,6 +1587,18 @@ static void push_inputs_table(lua_State *L, const uint64_t curbits)

 static void call_lua_tick(lua_State *L, uint64_t ticks, uint64_t clipstartticks, uint64_t inputbits)
 {
+    static int skipped_calls = 1000;
+    static uint64_t pendinginputbits = 0;
+    if (skipped_calls >= 10) {
+        skipped_calls = 0;
+        inputbits |= pendinginputbits;
+        pendinginputbits = 0;
+    } else {
+        skipped_calls++;
+        pendinginputbits |= inputbits;
+        return;
+    }
+

In theory, this will be a 90% reduction on Lua overhead. This isn't a solution, as it often won't accept correct moves in time, so expect to die a lot, but the question is whether the game stops dropping audio in this case.

icculus commented 1 year ago

Also, one more sanity check: this should cause the build to fail if NOT compiled with NEON support, just to make sure that definitely made it into the build:

diff --git a/thirdparty/theoraplay/theoraplay.c b/thirdparty/theoraplay/theoraplay.c
index a1e40f5..4eba68d 100644
--- a/thirdparty/theoraplay/theoraplay.c
+++ b/thirdparty/theoraplay/theoraplay.c
@@ -35,6 +35,8 @@
 #ifdef __ARM_NEON__
 #include <arm_neon.h>
 #define THEORAPLAY_HAVE_NEON_INTRINSICS 1
+#else
+#error We are definitely NOT building with NEON support.
 #endif

 #ifndef THEORAPLAY_ONLY_SINGLE_THREADED
vanfanel commented 1 year ago

First, NEON code wasn't building. No idea why __ARM_NEON__ isn't defined on my system (Raspberry Pi OS aarch64, updated to the latest kernel and packages). I had to enable NEON code building by commenting out #ifdef __ARM_NEON__ and it's corresponding #endif, and then it built. I can see the CPU usage difference, but the audio dropouts are still there.

Then, disabling most of the LUA calls in call_lua_tick using your patch didn't make any difference in the audio dropouts either, so it seems you are probably right about it, and it's not the culprit. Sorry to insist on that route!

Since these dropouts seem to be related to the beginning of video clips (there's a long drop when a new video scene plays, so audio starts very late) maybe it's the RetroArch audio system being disabled and enabled between the clips? That would cause the audio backend to stop receiving audio for a moment, which always results in a long dropout in RetroArch (like when you toggle the menu on and off, for example). Maybe that could be avoided somehow?

icculus commented 1 year ago

(Yeah, everything has different defines to indicate NEON support. It's infuriating. I'll find a better test for that.)

Hmm...we don't explicitly disable the audio system, but we don't feed it anything until there's audio to feed it...maybe we should feed it silence in these cases. I'll try that.