LUA is too slow for Pi4

vanfanel commented 1 year ago

Hi there, @icculus

I was amazed to see this core today, so I went and built it for testing on the Pi4 on RetroArch. Thing is, video decompression doesn't take much CPU (TOP reports ~30% among all CPU cores), but sections where LUA code kicks in take up as much as 80% of the CPU, causing massive sound drop-outs.

So, since this is now a very young project, would it be any possibility of using lua-jit or other technology that doesn't waste as much CPU as LUA normally does, so it's usable on mid-range aarch64 machines like the Pi4?

Thanks!

icculus commented 1 year ago

Hmm, this worked on a Pi4 last time I tried it, a month or so ago. I'll have to check again, in case I broke something.

In the meantime: are you sure it built with optimizations? It can make a huge difference on the Pi.

Also, I'm reworking the audio code right now, which will buffer a little better, which might solve the problem too. That work is almost done!

icculus commented 1 year ago

Just realized I tried the standalone SDL app on the Pi4 and not the libretro core.

I do think my audio changes will fix the problem, but I'll verify either way.

vanfanel commented 1 year ago

I will be testing the new audio code as soon as it's available, too! But from what I see here, CPU usage spikes come from LUA code, not from audio/video. As long as I leave the attract mode running, there are no CPU usage spikes, it's only during gameplay due lo LUA code being run. LUA is very slow on ARM in general: the LUA machine is veeeery slow.

icculus commented 1 year ago

I have not tried this on a Pi4 yet, but if you're bored and want to see if it has improved, go ahead and pull the latest.

Don't forget to replace the contents of the RetroArch/system/DirkSimple directory in your install with the latest in revision control, as the Lua code has changed and .wav files were added so Dragon's Lair can make beeps and buzzes.

vanfanel commented 1 year ago

I have been trying this on the Pi4 this morning. I still get those audio drop-outs when LUA code kicks-in: during active gameplay, or when Dirk "rises from the death" when he gets killed, etc. Same as with the previous code, I can see on TOP how CPU usage goes from 30% during attract sequence to 80% when LUA code/game logic runs.

Too bad since this looks fantastic on the Pi4 on KMS/DRM and on Wayland, with basic shaders and all.

icculus commented 1 year ago

Ok, I'll try it over here. Are you using stock RetroArch, or is this a RetroPie install?

vanfanel commented 1 year ago

I use latest RetroArch code, built against latest Wayland/wlroots using the labwc compositor (also Weston), against latest stable Mesa. Also RetroArch on KMS/DRM. Always the ALSA audio backend. And everything on Aarch6.

icculus commented 1 year ago

(Sorry, this is a giant pile of text to read.)

Okay, just as a sanity check, I ran this on a Pi4 running RetroPie (because it's what I had available). I simply quit EmulationStation and ran the core from the command line:

/opt/retropie/emulators/retroarch/bin/retroarch -L ./dirksimple_libretro.so ./lair.ogv

...and it played at full speed without audio dropouts. So something else is going on, but I don't know what yet.

I think the assertion that it's Lua causing the problem is a red herring. The Pi4 should be powerful enough to handle Lua in general, and we don't run very much Lua code per-frame (maybe 50 lines of code?)...and notably, the attract sequence runs the exact same Lua code as the rest of the game (attract mode just looks like a level where the only correct input is the start button). It wouldn't surprise me if Lua is the majority of our CPU time per-frame, because we don't do a whole lot per-frame on the main thread.

There is, in the libretro core, other places that would suffer from high CPU usage, though.

Notably, it has to convert the YUV video data from the .ogv file into RGB format, which is very CPU intensive. The SDL app gets around this by using an OpenGL shader to do the work at render time behind the scenes, but right now the libretro core is just being fed a software-rendered buffer of pixels. A debug build might not be able to do it in real time. The conversion happens in a background thread, so it won't show up in a profile, but audio can't be sent to the core for processing until that thread gets through the video frames that are in the way of more audio frames being decoded...which might explain the audio issues.

It might be worth trying the SDL app on this hardware, and seeing if it works better than the libretro core here, just as a data point. Improving this in the libretro core, to optionally use OpenGL so I can move this to the GPU, is on my wishlist.

Also, I'm working from a 480p resolution Dragon's Lair video; there are HD (even 4K!) upscales people have produced on the internet (and I think Digital Leisure shipped a 1080p Blu-Ray at some point?)...if you happen to be using these, they take a lot more CPU to decode the video, and compounds the YUV/RGB conversion on the CPU that the libretro core is doing. If you want to make sure we're working from the same video file just to rule this out, drop me an email and I'll get you what you need to test.

vanfanel commented 1 year ago

Hi again, Ryan! Don't worry about the wall of text, that was a pleasure to read, thanks for taking the time to test this, really :)

Some "special" configurations I use in RetroArch that may explain the differences between what you see there and what I see here with regards to CPU usage and audio dropouts:

-I use 2 buffers in Settings->Video. RetroArch uses 3 buffers by default, which adds another frame of input lag. Tried 3 buffers but I don't see much difference in the dropouts. -I use a basic shader with Dirksimple. Any simple shader (crt-pi, zfast...) will do, but I use Fakelottes. After all, that's one of the motivation for a libretro core, isn't it? :) Tried with no shaders, but well, it didn't make a difference either.

I also tried the SDL2 version, and there are no dropouts there, you are right. Everything LUA I have tried in RetroArch has massive CPU usage spikes, audio dropouts, etc. But it's not only LUA's fault, but the proverbial Pi memory bandwidth limitations kicking in when LUA is combined with RetroArch. In SDL2, it seems that hardware video conversion does a good job and alleviates part of the problem, hence the dropouts do not appear.

I am using a 480p video source, so that can be ruled out; I am sure we are already using a similar source.

Moving the core to be an optional OpenGL core will be probably the best move possible for platforms like the Pi! That should give it the same performance as the standalone SDL2 program has. The Pi4 also has good Vulkan support, and in fact that's what I use as my daily driver instead of OpenGL, because it's notoriously less laggy. So, if you are moving the core to OpenGL, please consider Vulkan instead.

icculus commented 1 year ago

Next thing to try, just as a guess:

Let's buffer more audio per-frame and see if it helps:

diff --git a/dirksimple_libretro.c b/dirksimple_libretro.c
index d0e8547..8e62c2c 100644
--- a/dirksimple_libretro.c
+++ b/dirksimple_libretro.c
@@ -622,7 +622,7 @@ static void feed_audio(void)
     // mix 3K samples of audio and attempt to feed it to the system.
     // 3K == 1.5 stereo frames per frame, which at 30fps is slightly more
     // than you need to not starve at 44.1KHz.
-    static float fl32buf[1024 * 3];   // make this static in case the system has a pathologically small stack. This function isn't ever reentered afaik.
+    static float fl32buf[1024 * 6];   // make this static in case the system has a pathologically small stack. This function isn't ever reentered afaik.
     int space_left = sizeof (fl32buf);
     int space_used = 0;
     for (DirkSimpleDiscAudioQueue *i = disc_audio_queue_head.next; space_left && (i != NULL); i = i->next) {

After trying this, I'm going to dig up the NEON version of the YUV->RGB conversion code I wrote before, which will definitely speed up that bottleneck on a Pi.

If all else fails, I'll try to get an OpenGL renderer in there (probably not Vulkan, because I just need the thing to draw some basic things on the GPU and don't want to add a thousand lines of code to do that).

One of these things is bound to fix the problem! :)

vanfanel commented 1 year ago

I built a version with that modification and doesn't seem to make any difference, sadly :(

icculus commented 1 year ago

I don't have progress to show yet, but the gameplan for dealing with this is twofold:

First, I'm writing some NEON code for the YUV->RGB converter, which will speed up the biggest bottleneck in the video decoding on a Pi. That's going on in #23.

Second, I'm going to have the libretro core use OpenGL if possible (and fall back to the existing software code if the system doesn't have a GPU). That's in #24.

vanfanel commented 1 year ago

Sounds great! May I ask that you do OpenGL/Software an option, please? I for one use Vulkan as my daily driver for RetroArch, and it provides precise buffer control on Wayland for low video lag (while GL does not), so using OpenGL would cause more lag on Wayland (triple buffer by default on OpenGL on Wayland, no way to make it 2 buffers).

icculus commented 1 year ago

Moving back over here, since the NEON work didn't pan out.

It was suggested that maybe the dropping of audio when seeking is causing issues, and that's worth exploring:

Hmm...we don't explicitly disable the audio system, but we don't feed it anything until there's audio to feed it...maybe we should feed it silence in these cases. I'll try that.

icculus / DirkSimple

LUA is too slow for Pi4 #17