snes9xgit / snes9x

Snes9x - Portable Super Nintendo Entertainment System (TM) emulator
http://www.snes9x.com
Other
2.63k stars 453 forks source link

Looking for the last little stretch in ARM optimization #265

Closed vanfanel closed 5 years ago

vanfanel commented 6 years ago

Hi there, guys

I had forgot about the possibility of running snes9x (not the -next branch) full speed on the Raspberry Pi 3 in low-latency mode (ie, using a double-buffer scheme on dispmanx+OpenGL or KMSDRM+OpenGL, or just plain KMSDRM). I simply thought the CPU and memory speed on this thing wasn't up to the task, and left it at it: it was impossible to get games doing extensive use of transparencies like Seiken Densetsu 3 intro running fullspeed and get true a lag-less experience on the device... or so I believed after my initial tryings when the Pi3 came out. Yes, I am using Rpi3-specific GCC flags.

However, after some unrelated coding tests, it came to my mind that I should try profile-guided optimization (PGO) on the snes9x building process, so I went and did a PGO building cyle and to my surprise every game not equipped with special chips was running great, even Seiken 3 intro, Chrono Trigger and FF6 most demanding effects... Wow! A low-latency snes on the Pi3, great! But eventually I found that games doing an specialized threatment of the Mode-7 PPU effects were still having slight trouble to keep up, thus causing small audio dropouts. No games like F-ZERO or Super Mario Kart, which run perfectly well with the CPU sitting at ~70% on an isolated core (I use isolated cores for emulation): the offending game is Contra 3: The Alien Wars, or Super Probotector, on the top-down stages. That's the last thing I expected since I thought Mode-7 effects weren't so demanding, but there it is, and I would like to correct it and produce a binary I can handle to the Lakka people and tell them: "with this, Snes9x is perfect in low-latency mode! Include it for the Rpi3 distro."

So, here are my questions: -I am not an snes internals expert, but you snes9x guys sure are: What is Contra III doing on the top-down stages that takes so much CPU in comparision with, let's say, Super Mario Kart? -What could I try to do an optimized build that would overcome this? Appart from PGO, which is already being applied in my experiments. I am guessing that the function that performs these rotations is S9xSetPPU() in ppu.cpp, right?

Thanks for your attention!

bearoso commented 6 years ago

I don’t know in particular what Contra 3 does that causes Snes9x to use more CPU time, but I can tell you that mode 7 is simply an affine transform applied when outputting a pixel that selects from a particular background layer. It’s not a discrete rotation that could be optimized separately.

I can also say that the PPU code in Snes9x is very heavily optimized, so your best bet would be to optimize elsewhere in order to give the PPU more time to run. Our use of byuu’s SPC added some extra power requirements that can be lowered if you use one of libretro’s legacy cores that use the old blargg SPC—those are a middle ground between current mainline Snes9x and the older, hack-filled Snes9x-next.

vanfanel commented 6 years ago

@bearoso : I didn't explain it well, since I didn't remember my previous measurements, but Contra III topdown stages are NOT using any more CPU than, let's say, Super Mario World on the map stage (60-70% of a single core according to TOP). Yet the game crawls slightly on those stages, while it's fullspeed during the rest of the game. That would leave out the SPU emulation code as the culprit, since it's the same in both cases.

What other factors could be at play with an equal CPU usage? Memory access? I read ARM processors have big trouble accessing non-aligned data. What data structures (structs, arrays, etc.. related to PPU emulation which are accessed very frequently during MD7 sequences) should I look for when trying to investigate this? I believe it's not about calculations but memory access on ARM.

bearoso commented 6 years ago

That's pretty bizarre. The CPU should be the only real influence in this case. The data isn't so far apart as to cause cache misses, and even so, both cache and memory wait states would be regarded by the OS as CPU "usage". The video output is just a dumb framebuffer, so that would be consistent all around.

I wonder if maybe the video driver is busy-waiting on vsync and making it seem like the CPU usage is higher than it actually is in those other games. Maybe you could time the buffer swap and see if it's waiting less time on Contra 3.

qwertymodo commented 6 years ago

Potentially stupid question, but did you run any Contra III top-down levels during your profile generation?

vanfanel commented 6 years ago

@bearoso I will measure vsync time then. Anyway, can you point me to the function blocking for vsync in snes9x when using the libretro code/port? It would save me some time since I don't know libretro from the programs side (I have done video drivers for it, but I don't know how games use it... strange but true!)

@qwertymodo : Yes, I ran these levels specifically for a long time while doing the profile generation for a build. There IS a difference, but not enough. Almost there... If only the Pi was a little bit faster! In fact, these levels are fullspeed in low-latency mode most of the time, but once in a while they are not for half a second. That's why I thought it could be a cache miss "problem".

joepogo commented 6 years ago

@bearoso Is there a way to communicate more directly with you? I have some questions about snes9xoptimizations and wanted to talk to you instead of clogging up this issue about it. Please let me know if you dont mind. Hope all is well! :)

Romain-Piquois commented 6 years ago

@vanfanel I just posted this few hours ago... Might be of interested to you. Can you help me on build for RPi and benchmarking ? https://github.com/snes9xgit/snes9x/issues/278

vanfanel commented 6 years ago

@Romain-Piquois : Of course I can help you! What do you need?

Romain-Piquois commented 6 years ago

1/ For now, I did a fork here, and will put my modifications in various branches : https://github.com/Laxer3a/snes9x

2/ My mail address is [nickname of my git in lower case] at hotmail.com. Please send me a test mail so we can get in touch asap. I would like you to build and run the code on the RPi on various games and benchmark it (millisecond or microsec would be best but I guess frame rate would be ok too, make sure there is no VSync involved ). No need for animation or play the game. Just stand still and measure a given game state before and after the change in specific part (different mode 7 setup). I would like you to contact me by mail for all the discussion. ( I can provide the game states I guess...)

3/ If everything is fine and endure the test of time (ie no bug compare to the current Snes9X), I will let the team here know. After that, it will be their decision to integrate the changes or not, I do clean and optimized code, but I am not really interested in politics on my free time :-P.

bearoso commented 6 years ago

@Romain-Piquois That sounds like a good plan.

Romain-Piquois commented 6 years ago

@vanfanel

I committed the code on my fork for now only for mode7. Please send me a mail and contact as soon as you read, it is important being able to contact you without bothering people here for every detail : [nickname of my git in lower case] at hotmail.com

https://github.com/Laxer3a/snes9x/tree/OPTIMIZE_MODE_7

vanfanel commented 6 years ago

@bearoso : I have found an emulation core that exhibits a GREAT performance difference on the Raspberry Pi using memory alignment for LONG variables. On a Pi3, CPU usage grows up a good ~30% with long alignment disabled. Long story short: "modern" ARM processors support unaligned access, but it causes a high performance impact.

As you can see here, the rpi platform activates the -DALINGN_LONG flag: https://github.com/libretro/Genesis-Plus-GX/blob/master/Makefile.libretro

And here you can see how long alignment is implemented for the VDP emulation: https://github.com/libretro/Genesis-Plus-GX/blob/master/core/vdp_render.c

Since it's the PPU what seems to be causing the massive slowdowns on the Pi, maybe long transfers can be aligned too on snes9x. I believe this will give the expected results on ARM.

bearoso commented 6 years ago

I don't think that would be much benefit here. We're mostly dealing with 16-bit offsets to memory, so there's hardly any 32-bit reads, if any. The only case I can think of is on the port-end, if converting to a different bit depth for output. Though, 32-bit color would be naturally aligned there, and 24-bit color is unlikely.

vanfanel commented 6 years ago

@bearoso : I have updated from a Pi3b to a Pi3b+, and the CPU usage on the 2nd stage demo (split-screen mode 7) is ~48% only with this latest Pi model, yet the game is getting massive slowdowns and crackling audio (I build the libretro core). TOP and HTOP only report 1 core in use. So, it doesn't seem to be a CPU power problem or RAM bottleneck... Even so, it only happens on ARM and not X86.

qwertymodo commented 6 years ago

Snes9x is single-threaded, which explains why you're only seeing one core in use. My guess is that the one core is at 100% and you are, in fact, hitting a CPU bottleneck.

bearoso commented 6 years ago

Top and htop don’t aggregate the cpu usage, so 48% is of one core. I’m not sure what could be the culprit. CPU usage is pretty much all that matters.

vanfanel commented 6 years ago

@qwertymodo : You can see individual CPU usage in TOP by hitting '1' on the keyboard of the computer accessing via ssh to the Pi, so it's 48% on ONE core, the other three are showing a 0% usage in this case.

@bearoso: have you seen recent snes9x on other ARM systems like Android phones, etc?

vanfanel commented 5 years ago

I have very good news on this. These strange slowdowns on Contra 3 2nd stage are gone! How? Well, I am now running a 64bit system on the Pi3, and I have built RetroArch and the snes9x core with the "-march=armv8-a+crc -mtune=cortex-a53" CFLAGS. Snes9x is now fullspeed even with max_swapchain=2 on the Pi3b+. using the KMS/DRM+GLES video driver and ALSA audio driver. No PGO involved. So, I finally got the last stretch on ARM I was looking for!