Slowdown snes9x 1.57 libretro

marleyap commented 5 years ago

Hello users snes9x Retroarch , wanted to do a little information about a drop of frames in snes9x 1.57 in the game Megaman X2, more precisely in the DEEP SEA BASE stage, in the part where I left that big robot fish, this fall of frames happens in TV box amilogic s905x 1.5 GHz, this drop is in the range of 2 fps less than full speed, now I do not know if it is due to lack of hardware of my tv box or lack of core optimization snes9x in tv box s905x, I wonder if programmers could solve this problem , even because the FX games run smoothly, without dropping frames ... Thank you all!

marleyap commented 5 years ago

I'm having the core updates snes9x 1.57 in the download options nightly Linux ARMHF, I use the Lakka OS 2.1.1 version for TV android box with AMILOGIC S905X processors, 1.51 GHZ AND with 1 GB RAM, I tested on two TV box models, the two present slowndonw in the certain bubble crab phase of the game Megaman X2, in the part when it appears that when Big robot fish the game has fall in up to 58 fps, and also tested with old standard SNES9X core that comes in the LAKKA OS in the case of Snes9x 1.54 .1 frame crashes are worse, reaching up to 54 fps, now my doubt is if this happens for lack of optimization of the updated snes9x core for TV box s905x or lack of TV box hardware that I use. Stranger than games with Fx chips that are much heavier to emulate than Megaman X2's Cx4 chips and without the use of the video threading option run like a feather without gagging anything at all.

marleyap commented 5 years ago

(update) I just tested the update snes9x libretro, which is already version 1.58, slowndonw has decreased slightly, now it is in 1.2 fps less than full speed, I try to drop up to 58.8 fps, still give a feeling slight noise and slowness in the audio in the game and part of the game I quoted earlier in my TV box. I hope they correct it soon I think it is a small lack of optimization of the core for the processor of my device ...

bearoso commented 5 years ago

I don't know why that part would be CPU-intensive. The optimization level could probably be bumped up to -O3 without any problems, though.

Dwedit commented 5 years ago

Briefly looked at a profiler log, unless there is something going on that triggers a degenerate case of graphics rendering, I don't see CX4 stuff in the profiler.

orbea commented 5 years ago

@marleyap Could you possibly bisect this issue and see if any specific commits introduced this issue?

marleyap commented 5 years ago

Worse than this fall of FPS in my Tv box android, it only happens in this particular phase of Megaman X2, just when the big robot fish appears, and when I end up destroying this enemy the FPS crashes stop immediately, I hope this problem is solved of performance in this part, and remembering use snes9x core libretro in LAKKA OS 2.1.1 S905X and thank you for the possible solution.

bearoso commented 5 years ago

@marleyap Could you possibly bisect this issue and see if any specific commits introduced this issue?

I think he's suggesting that it's always been there, and his box has struggled to keep up at this part.

Did you guys ever try -O3 with the libretro build? I'm wondering if it was tried and broke something, or if the original builder was just being careful. I run -O3 with my personal builds and haven't encountered any problems that I haven't already fixed.

joepogo commented 5 years ago

Just wanted to add some info here. It also does this on the xbox port too on (somewhere between 1.53 and 1.57). Any of the stages that render water. It also does this in Launch Octopus in megaman x. All other levels are fine. IS there something that can be done or ANY optimization for this? Only levels with water seem to be dropping frames.

I wonder if it is the same thing that affects biometal. The first stage on biometal chugs pretty badly as well.

Dwedit commented 5 years ago

Just profiled it...

The function DrawTile16AddS1_2_Normal1x1 has the greatest CPU usage out of all, followed by DrawTile16_Normal1x1.

This is the code of the function, DrawTile16AddS1_2_Normal1x1. It's generated entirely by macros, including the name.

static void MAKENAME(NAME1, AddS1_2_, NAME2) (ARGS)
{
#define MATH(A, B, C)   MATHS1_2(ADD, A, B, C)
    DRAW_TILE();
#undef MATH
}

When built with O2 and Link-time code generation in MSVC 2017 on X64, the disassembly is 1762 instructions long.

If there's any code that's suboptimal in there, maybe it could be replaced.

Edit: for anyone who wants to try optimizing the function, here is the ASM code that MSVC generates: https://pastebin.com/uG6QBe4L

joepogo commented 5 years ago

Thanks for doing that Dwedit! I can tell you after testing different versions of snes9x, this goes way back in builds. Will it be easy to optimize that function?

Also, do you know off hand is the same code that affects pretty much all the water stages in the megaman x games and the background stage in biometal? Seems like it is a pretty intensive layer or something.

Dwedit commented 5 years ago

Just rehosted the macro-expanded version and disassembly on Github Gist as well...

Expanded CPP code Disassembly

bearoso commented 5 years ago

You could try replacing the manual unrolling with for loops and see if the compiler can vectorize it then. Otherwise, you're not going to find any low-hanging fruit.

bearoso commented 5 years ago

Nope, Clang 9 and GCC 9 don't vectorize it, and they both automatically unroll. Clang's assembly is a bit shorter, though.

bearoso commented 5 years ago

GCC, manual unrolling: 3702 ops Clang, manual unrolling: 3356 ops GCC, for loops: 608 ops Clang, for loops: 566 ops

I guess this might be a hit if the instruction cache is pretty small. You'd have to test it, though.

bearoso commented 5 years ago

Ok, I measured and saw a small speed improvement with the for loop version, so I went and changed it in 2d97d2dd985c868b32d28dab74c584444c376633.

Dwedit commented 5 years ago

Usually when trying to optimize things, you look for unnecessary branches that could be conditional moves instead, or things being evaluated multiple times when they only need to be evaluated once.

bearoso commented 5 years ago

Feel free to try. Anomie wrote the renderer ages ago, and he was a crazy optimizer. I looked through the assembly, and both GCC and clang were using cmov where appropriate. And there are seemingly no unduplicatable branches. The instructions in the unrolled version would take up nearly a quarter of a modern x64 CPU's instruction cache, though, so I can imagine it being a performance hit on a smaller CPU.

Dwedit commented 5 years ago

Well I just rewrote COLOR_ADD to use fewer operations, but I'm not sure if it's any faster or not... Seems to use more registers and fewer instructions?

It figures out the carry result of R G and B, then multiplies by 0x1F, then ORs that with the sum in order to perform saturation, so there's no use of the lookup table anymore. R and B are done together, then G separately so the bits don't interfere.

(from gfx.h)

inline uint16 COLOR_ADD(uint16 C1, uint16 C2)
{
    const int RED_MASK = 0x1F << RED_SHIFT_BITS;
    const int GREEN_MASK = 0x1F << GREEN_SHIFT_BITS;
    const int BLUE_MASK = 0x1F;

    int a = C1 & (RED_MASK | BLUE_MASK);
    int b = C2 & (RED_MASK | BLUE_MASK);
    int c = a + b;
    int d = c & ((0x20 << RED_SHIFT_BITS) | (0x20 << 0));
    int e = C1 & (GREEN_MASK);
    int f = C2 & (GREEN_MASK);
    int g = e + f;
    int h = (g & (0x20 << GREEN_SHIFT_BITS)) | d;
    int i = h >> 5;
    int j = i * 0x1F;
    int k = (c & (RED_MASK | BLUE_MASK)) | (g & GREEN_MASK) | j;
    return (uint16)k;
}

bearoso commented 5 years ago

That is faster, but it won't work, it's doing what the old code did, and that's wrong. We have to saturate at the max brightness, not at 0x1f, hence the table.

bearoso commented 5 years ago

The previous code used a 64k table, so we could use that to clamp the final value. Give me a bit and I'll try it.

bearoso commented 5 years ago

Nope, a 64k lookup table slows it down drastically. Your version is a little faster than what's in git, and getting rid of the lookup table would have been great, but we need the brightness clamping.

As is, with the smaller 64 byte brightness table instead of the 65k lookup table in 1.59.2, we're a little faster, and with the removal of the excessive unrolling we're pretty far ahead, so don't worry about it unless you enjoy experimenting.

Dwedit commented 5 years ago

Okay silly idea... Two versions of the code, one for a cap of 1F, one for a cap of everything else, then select which function gets executed. I'd be willing to guess that 1F would be the most common cap.

Also I think rolling up the function helps with branch prediction too, so that could explain how it got faster.

bearoso commented 5 years ago

Two versions of the code, one for a cap of 1F, one for a cap of everything else, then select which function gets executed.

I thought about it, but making a function selectable prevents inlining, so function call overhead would wreck performance. We could add another expansion to the macros in tile.cpp, but you've seen how that is, and it would double the code size.

bearoso commented 5 years ago

Ok, I think I've cracked it.

Dwedit commented 5 years ago

Also redid COLOR_SUB. For the first operand, it ORs each channel with 0x20, does the subtraction, then the value of the 0x20 bit indicates whether it borrowed or not (0 bit means borrow, 1 bits means no borrow). The three 0x20 bits together get shifted right, then multiplied by 0x1F. This can be ANDed with the result to saturate to zero.

inline uint16 COLOR_SUB (uint16 C1, uint16 C2)
{
    int a = (C1 & (THIRD_COLOR_MASK | FIRST_COLOR_MASK)) | ((0x20 << 0) | (0x20 << RED_SHIFT_BITS));
    int b = C2 & (THIRD_COLOR_MASK | FIRST_COLOR_MASK);
    int c = a - b;
    int d = c & ((0x20 << RED_SHIFT_BITS) | (0x20 << 0));
    int e = (C1 & (SECOND_COLOR_MASK)) | (0x20 << GREEN_SHIFT_BITS);
    int f = C2 & (SECOND_COLOR_MASK);
    int g = e - f;
    int h = (g & (0x20 << GREEN_SHIFT_BITS)) | d;
    int i = h >> 5;
    int j = i * 0x1F;
    int k = ((c & (THIRD_COLOR_MASK | FIRST_COLOR_MASK)) | (g & SECOND_COLOR_MASK)) & j;
    return (uint16)k;
}

bearoso commented 5 years ago

Thanks, I committed the changes. I restructured your COLOR_ADD variables to be less rust-like and have more obvious names, and I might do the same with COLOR_SUB. The COLOR_ADD without a table makes things about 2% faster globally than the one with the lookups, which is now only used when needed.

bearoso commented 5 years ago

I fixed up the math to twiddle the extra green bit correctly. No real speed loss.

joepogo commented 5 years ago

Thanks for doing that Bearoso and Dwedit! Do you know if that layer DrawTile16AddS1_2_Normal1x1 affects the launch octopus stage or biometal stage 1 as well? All the water levels in the megaman x series seem to use some weird cpu intensive graphic layer or something that did not lag before in other versions. Any other stage is perfectly fine! It drops massively on xbox retroarch and if I recall correctly wii too when it did not before in like 1.51 I think. Starting from 1.53 those games took a huge hit for some odd reason while the majority of the others did not and are still fine.

Also, can anything be done to possibly optimize DrawTile16_Normal1x1 as well? For us users on the lower end of the spectrum, anything helps, including these 2 optimizations you just pushed through! I think that's about a 6% or so speedup from both commits from optimizing that code in that area now right? Figured I would put this in here while you guys are looking at it. Thanks for all that you do!

Dwedit commented 5 years ago

The only thing I can think of now is redoing how the Z buffering works. I'd look at the Z for 8 pixels at once, and evaluate the entire sliver for: not occluded, partially occluded, completely occluded, then pick one code path. Not sure if that would even improve speed or not, since it would add branches. Maybe also do it for the entire height as well.

If you know it's completely occluded, you get a fast code path (just skip everything), and if it's not occluded, you get a faster code path too (copy multiple pixels at a time, set multiple Z values at a time, etc). If it's partially occluded, you get the current code.

bearoso commented 5 years ago

Starting from 1.53 those games took a huge hit for some odd reason while the majority of the others did not and are still fine.

Everything took a hit in 1.53. We got cycle-level H-events then. In 1.56 we removed a huge part of the IRQ overhead, so we're likely faster than 1.53 now. We'll likely never get back to 1.51 levels, aside from using something like a parallel PPU, because that version simply didn't emulate the timing of the sound subsystem.

I'd look at the Z for 8 pixels at once, and evaluate the entire sliver for: not occluded, partially occluded, completely occluded

I'm sure it wouldn't be much of a performance benefit, despite complicating the code quite a bit.

joepogo commented 5 years ago

Thank you for responding! So just to clarify, on older consoles the 1.53 build is probably actually slower than the newer 1.56 due to irq changes? I think the versions on the consoles only go up to around 1.53 which is when a lot of changes were made and most devs never updated it past that due to the performance issues. So it might actually be worthwhile to update to at least 1.56 and then add these 2 optimizations to see what happens?

It really is only a small handful of games that took a hit, seiken densetsu III (I guess because it uses hi-res in its menus?), super bases loaded, megaman x-x3 water stages, biometal, and obviously the fx games, (starfox, Mario world 2, etc) Would it be feasible to try to update to the newer irq then? I had thought I read somewhere where the newer IRQ was actually more intensive than the older and affected consoles worse. Just wanted to clarify that.

And by the way, would it be hard to use a parallel ppu version? Or have an option to do so?

snes9xgit / snes9x

Slowdown snes9x 1.57 libretro #461