CPU bicubic texture upscaler speed-up(?)

fp64 commented 2 years ago

What should happen

Disclaimer: not sure if these ramblings actually belong in issues. Also it's possible that I messed something up, and the following is moot.

I'm only tackling (bi)cubic upscaling, not xBRZ or hybrid.

So, after reading some claims that "even most powerful CPU's are not powerful enough to upscale textures in real time", I decided to take a look at PPSSPP's CPU texture upscaler (this is it, right?), and... it does not seem very fast. Obviously, all cool kids use GPU to upscale their textures, but that's not the topic here. So I wrote an implementation of a cubic upscaler, and benchmarked it against PPSSPP's (and stb_image_resize.h for good measure). Here are the results on a test machine (Intel(R) Xeon(R) E-2286G CPU @ 4.00GHz):

Timings for cubic upscale.
  cubic  = cubic_upscale.c
  stb    = https://github.com/nothings/stb/blob/master/stb_image_resize.h
  ppsspp = https://github.com/hrydgard/ppsspp/blob/master/GPU/Common/TextureScalerCommon.cpp

Size       |Scale|Time, ns/pixel 
           |     |cubic  |stb    |ppsspp 
-----------+-----+-------+-------+-------
   16x16   |    2|    4.2|   15.2|   22.4
   16x16   |    3|    2.4|    9.6|   22.4
   16x16   |    4|    2.4|    8.8|   22.4
   16x16   |    5|    2.4|    7.2|   22.4
   32x32   |    2|    2.8|   12.0|   22.4
   32x32   |    3|    2.4|    8.8|   19.3
   32x32   |    4|    2.4|    8.0|   22.5
   32x32   |    5|    2.4|    6.4|   22.8
   64x64   |    2|    2.8|   10.4|   22.5
   64x64   |    3|    2.4|    8.0|   19.7
   64x64   |    4|    2.4|    7.2|   22.5
   64x64   |    5|    2.4|    6.5|   22.8
  128x128  |    2|    2.4|   10.4|   22.5
  128x128  |    3|    2.4|    7.4|   23.7
  128x128  |    4|    2.0|    8.0|   22.9
  128x128  |    5|    2.0|    7.3|   19.5
  256x256  |    2|    2.8|   10.4|   19.1
  256x256  |    3|    2.5|    7.6|   20.3
  256x256  |    4|    2.5|    7.6|   22.9
  256x256  |    5|    2.4|    6.5|   22.0

Timings are reported in nanoseconds per dst pixel (otherwise, the metric is scale-dependent). All tests are single-threaded. The benchmark is compiled with -march=native -O3.

Some observations in no particular order:

DISCLAIMER: it does not match the current behavior. I'm not sure, why PPSSPP does what it does, as it uses (I think) a non-separable filter (I guess it can be called radial Mitchell-Netravali, since it uses Euclidean distance) for some reason (less anisotropy?). Not sure how important that is.
The implementation uses no dynamic memory allocation, and supports working on subregions (for multi-threading). It also supports both wrap and clamp modes (as well as zero) for out-of-bounds samples (I think PPSSPP only does clamp). It also supports arbitrary positive integer scale (not limited to x2..x5 like PPSSPP's current implementation). The upscaling is done with a separable Mitchell-Netravali class filter with runtime-configurable B,C parameters.
The code (cubic_upscale.c) actually compiles fine as both C and C++.
The SIMD code path only uses SSE2 (there is a non-SIMD fallback). The upside is that most every x86 has SSE2, the downside is that more perf could have been squeezed with e.g. AVX2. It probably should not be too hard to port to NEON.
The implementation is only lightly tested. If you are uncomfortable adding it into the codebase, the stb_image_resize.h (public domain) is a more established option (for multi-threading stbir_resize_region can be used).
It seems that making the upscaler sRGB-aware should be relatively easy, and at only minor perf cost.

The speed-up may not be enough to entirely eliminate the stutter during upscaling, but it may help.

I'm not confident enough to muck with PPSSPP C++ codebase, so I tried making my upscaler reasonably self-contained. Hopefully, if someone is interested, it should be trivial to integrate.

Here is the code for both upscaler (cubic_upscale.c, cubic_upscale.h) and benchmark (stb_image* omitted, to fit into online compiler's code size limit): https://onlinegdb.com/7g36Bdzhco Note: to get a meaningful benchmark in the online compiler you need to enable optimizations (cog button->Extra Compiler Flags), e.g. -march=native -O3.

Who would this benefit

-

Platform (if relevant)

No response

Games this would be useful in

-

Other emulators or software with a similar feature

No response

Checklist

[X] Check the latest git build in case it's already implemented.
[X] Search for other requests of the same feature.

LunaMoo commented 2 years ago

Obviously, all cool people use texture replacement with AI algorithms to upscale their textures

Fixed that sentence for you. It also pretty much says why most people lost interest in CPU texture scaling nowadays and not because of just texture scaling done on the GPU which sure can still have some future simply due to performance, but even that fades away compared to texture replacement which works great and from few years can be done easily without artistic skills by just using any AI trained to upscale specific graphic style.

Automatic texture upscaling is mostly a quick and lazy fix which many people doesn't like, but sometimes use it anyway as it's one of the easier ways to work around texture glitches when rendering PSP games above native res, as far as I know bicubic does nothing to those problems.

To avoid requiring future CPU's PPSSPP limits the upscaling both per frame and by texture longevity(textures that refresh too often are just skipped), those textures which are scaled are also cached so the reason why it's fast enough for many is because it doesn't do anything 99% of the time, which is also why people that do know about it will continue to say it's not fast enough even for modern hardware.

IMO from all my testing bicubic doesn't do much as a texture scaler for PSP games other than adding cost and really the only reason it existed seems to be being part of one of the xBRZ variants.:c Would be cool to have something that does no visible change other than dealing with those common UI issues when rendering games above x1 res, but again bicubic unfortunately doesn't help with that.

fp64 commented 2 years ago

So, bicubic texture upscaler doesn't get much love, simply because it's just not that important? That... makes sense. It indeed does not seem to do much for textures in my limited testing.

hrydgard commented 2 years ago

This CPU scaling stuff is very old, and the comments along with the functionality itself was contributed, so I don't really endorse the statement "even most powerful CPU's are not powerful enough to upscale textures in real time".

We already have a GPU-based scaling path although it's currently Vulkan-only, but performs much better. Future scaling work should probably go in that direction for the other backends too...

Also I don't think any of our current scaling algorithm are generally very good. Some of them do look pretty good in 2D games but overall, the benefit just isn't that big IMHO...

I would merge a PR that just speeds up our bicubic as-is though :)

fp64 commented 2 years ago

I would merge a PR that just speeds up our bicubic as-is though :)

Non-separable filter and all? Hm, might be doable. Looking into it.

hrydgard commented 2 years ago

No I mean it's fine to replace the implementation too of course, as long as it looks about the same. I meant I won't require new work in this area to be for the GPU.

fp64 commented 2 years ago

as long as it looks about the same

So I grabbed a random DOOM2-ish door texture, and upscaled it x4.

Results

| Original at 400% | Bilinear | ---------|-----------------------|------------------------|------- | ![output](https://user-images.githubusercontent.com/106717720/182200337-328862f4-37e8-4691-b468-d94b2c7c5bf7.png) | ![output_b](https://user-images.githubusercontent.com/106717720/182208059-4986d7d1-26d6-4890-927d-cd243463c990.png) | `PPSSPP` | `scaleBicubicBSpline` | `scaleBicubicMitchell` | | ![output_p0](https://user-images.githubusercontent.com/106717720/182198365-cd63b852-054e-4ecf-a7aa-f04e2464cd9d.png) | ![output_p1](https://user-images.githubusercontent.com/106717720/182198368-d5cdcc4a-0712-4a5e-98a4-d8a2aae714f6.png) | `cubic_upscale.c` | B-spline (B=1, C=0) | "The" Mitchell-Netravali (B=1/3, C=1/3) | Catmull–Rom (B=0, C=1/2) | ![output_c0](https://user-images.githubusercontent.com/106717720/182198341-6c488434-56e0-47ca-a9a7-9d20bd8b77e4.png) | ![output_c1](https://user-images.githubusercontent.com/106717720/182198360-78a8544b-97ed-4479-8c43-a48d99439c51.png) | ![output_c2](https://user-images.githubusercontent.com/106717720/182198363-0f274a91-e61d-4e29-98c0-3158baab9ac4.png)

You might want to open images in a new tab.

I think scaleBicubicBSpline is used for "hybrid", and plain "bicubic" uses scaleBicubicMitchell.

I would merge a PR that just speeds up our bicubic as-is though :)

As I said, I'm not confident touching C++ codebase, but I may try...

hrydgard commented 2 years ago

I like the sharpness of Catmull-Rom best for sure! But compared to just using the original, it's not really that much of an improvement...

fp64 commented 2 years ago

Added 'bilinear' to the table above, which is what "just using the original" would probably look like if the game itself decided to set texture filtering to linear.

unknownbrackets commented 2 years ago

I think it's great to improve these, and I think the CPU scaling can be improved and probably can't be removed.

That said, just to set expectations, a lot of games have complex effects, use palettes in ways that are annoying for texture scaling or texture caching, or perform realtime CPU readback effects (like a sepia filter.) Many of these aren't throughout the game, but might just be at certain points - like only right after you defeat this boss, or only when you're walking around in that lava area, etc.

It would require a greater reduction than 20ns -> 2ns for me claim that any top-end PC can handle realtime texture scaling in games throughout. And I've definitely told people the opposite - that the current implementation (really, xBRZ, though anyway) can't handle realtime scaling even on the 8-core 4Ghz CPU they're so proud of. Because it can't.

That doesn't mean it's a waste to improve it. I wouldn't say the current software renderer is gonna run most games at full, realtime speed, either - even on someone's nifty new PC. That didn't stop me from making it a lot faster recently. Maybe it'll eventually run full speed (especially with more work and more core counts), and the same is possibly true of CPU texture scaling.

As I said, I'm not confident touching C++ codebase, but I may try...

PPSSPP doesn't go crazy with complex C++, generally. There are only a few templates and basic STL usage.

The scaler code I think uses some template params, mostly for optimization. At the end of the day, as long as you can give it a function that takes an array of pixels in, and array of pixels to write to, a width, and a y range (for threading), you'll be fine. If you've got it mostly working but have questions, you can ask here or on Discord.

Note: for the purposes of PPSSPP, you can assume SSE2 is supported if Intel. Supporting a non-SSE2 Intel CPU would require changes in a lot of places.

Also it looks like your benchmark doesn't use threads, currently. Probably scales just fine, but a PPSSPP implementation should ideally use threads or else a 12-thread (or better) CPU might be worse off after all. That's the main thing I'd suggest trying to support.

Though, maybe there's something to be said about considering an alternate threading model for that. Is it better to upscale 12 textures on 12 threads at once, or 1 texture on 12 threads? Might be an interesting thing to benchmark, especially with varied texture sizes.

Might also be interesting to see if clang can be coaxed into vectorizing well enough for NEON, etc.

-[Unknown]

fp64 commented 2 years ago

PPSSPP doesn't go crazy with complex C++, generally.

It's 'codebase' part that deters me -- build process and all ('C++' meant 'code that actually needs compilation').

Probably scales just fine, but a PPSSPP implementation should ideally use threads or else a 12-thread (or better) CPU might be worse off after all.

I would probably just replace the insides of scaleBicubicBSpline and scaleBicubicMitchell with calls to upscale_cubic (which does support subregions), and let the higher-level PPSSPP code (TextureScalerCommon::ScaleBicubicBSpline and TextureScalerCommon::ScaleBicubicMitchell) handle the rest.

That said, the multi-threaded performance was indeed not benchmarked.

fp64 commented 2 years ago

Since my code works on 16x16 dst blocks (anything smaller is temporary padded), there is a perf hit for both small textures and thin slices. Multi-threading might produce such slices, since MIN_LINES_PER_THREAD is only 4 (is that src lines?). I can trivially change the blocks to 8x8 though, with a slight perf hit.

fp64 commented 2 years ago

If you've got it mostly working but have questions, you can ask here or on Discord.

I'll take you up on that, I guess.

So I compiled PPSSPP from master on my 32-bit Linux (with gcc 10.2.1) by simply running ./b.sh. It compiled with a single compile-time warning:

In function ‘void GPURecord::EmitTextureData(int, u32)’,
    inlined from ‘void GPURecord::FlushPrimState(int)’ at /[...]/ppsspp/GPU/Debugger/Record.cpp:392:19:
/[...]/ppsspp/GPU/Debugger/Record.cpp:349:9: warning: ‘void* memcpy(void*, const void*, size_t)’ specified size 4294967280 exceeds maximum object size 2147483647 [-Wstringop-overflow=]
  349 |   memcpy(&framebufData[sizeof(framebuf)], p, bytes);
      |   ~~~~~~^~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~

and a single link-time warning:

/usr/bin/ld: ../ffmpeg/linux/x86/lib/libavformat.a(allformats.o): warning: relocation in read-only section `.text'
/usr/bin/ld: warning: creating DT_TEXTREL in a PIE

Resulting PPSSPPSDL works fine, but texture upscaling shows a bunch of artifacts, which are not present in 1.13 release (win32, which I run under Wine). It happens in multiple games (e.g. some text is replaced by rectangles in Valkyria Chronicles III (english patch), etc.). This is plain master without any of my changes, and also happens on xBRZ, not just bicubic. The window title also says PPSSPP v1.12.3-1620-g9af1f18d7 for whatever reason.

All settings are default, other than

Window resolution: x2
Rendering resolution: x2
Texture upscaling: x2, Bicubic
Screen upscaling: Nearest

That said, I did implement my proposed changes, and it seems to work, i.e. looks about the same as without them (so artifacts still present). I can create a pull request if desired.

hrydgard commented 2 years ago

I have been messing around with texture loading lately, trying to reduce the amount of duplicated code between the backends. I might have made a mistake causing those glitches, will take a look.

fp64 commented 2 years ago

While waiting for the artifacts thing to be resolved, I'll comment on:

Might also be interesting to see if clang can be coaxed into vectorizing well enough for NEON, etc.

I have a rather limited knowledge of ARM, but it seems, that both GCC and clang do autovectorize the upscaler with -mfpu=neon -O3 -funsafe-math-optimizations. Plenty of things like vmla.f32. This possibly works even without -funsafe-math-optimizations which is strange, considering what the docs say:

If the selected floating-point hardware includes the NEON extension (e.g. -mfpu=neon), note that floating-point operations are not generated by GCC’s auto-vectorization pass unless -funsafe-math-optimizations is also specified. This is because NEON hardware does not fully implement the IEEE 754 standard for floating-point arithmetic (in particular denormal values are treated as zero), so the use of NEON instructions may lead to a loss of precision.

You can press Ctrl-F10 on a line, to see which assembly "corresponds" to it (not that it's straightfoward with optimized code).

unknownbrackets commented 2 years ago

Since my code works on 16x16 dst blocks (anything smaller is temporary padded), there is a perf hit for both small textures and thin slices. Multi-threading might produce such slices, since MIN_LINES_PER_THREAD is only 4 (is that src lines?). I can trivially change the blocks to 8x8 though, with a slight perf hit.

It's been a while since that was set, it actually started life as a larger minimum dependent on threads: https://github.com/hrydgard/ppsspp/blame/85722511984a96dfe9c3931543fc6566a419ce9d/thread/threadpool.cpp#L63

It might actually be better in all cases for it to be 16.

So I compiled PPSSPP from master on my 32-bit Linux (with gcc 10.2.1) by simply running ./b.sh. It compiled with a single compile-time warning:

In function ‘void GPURecord::EmitTextureData(int, u32)’,
    inlined from ‘void GPURecord::FlushPrimState(int)’ at /[...]/ppsspp/GPU/Debugger/Record.cpp:392:19:
/[...]/ppsspp/GPU/Debugger/Record.cpp:349:9: warning: ‘void* memcpy(void*, const void*, size_t)’ specified size 4294967280 exceeds maximum object size 2147483647 [-Wstringop-overflow=]
  349 |   memcpy(&framebufData[sizeof(framebuf)], p, bytes);
      |   ~~~~~~^~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~

That just seems weird. bytes should never be that high, since it's constrained by ValidSize(). Personally, I've been using clang most of the time these days, though.

/usr/bin/ld: ../ffmpeg/linux/x86/lib/libavformat.a(allformats.o): warning: relocation in read-only section `.text'
/usr/bin/ld: warning: creating DT_TEXTREL in a PIE

You can recompile FFmpeg by running pushd ffmpeg && ./linux_x86.sh && popd, otherwise it may be using precompiled libraries (FFmpeg is harder to compile, so we have pre-compiled libraries for many platforms to make it easier for people to contribute... on Linux it's not so bad, you might need yasm.)

While waiting for the artifacts thing to be resolved, I'll comment on:

Might also be interesting to see if clang can be coaxed into vectorizing well enough for NEON, etc.

I have a rather limited knowledge of ARM, but it seems, that both GCC and clang do autovectorize the upscaler with -mfpu=neon -O3 -funsafe-math-optimizations.

Great, that sounds positive.

I think there were some fixes for texture scaling bugs, I have been a bit busy to keep up the last couple weeks, but it might already be better now.

-[Unknown]

fp64 commented 2 years ago

That just seems weird. bytes should never be that high, since it's constrained by ValidSize().

Yes, this is surprising to me too. GCC also somehow figured the value of bytes at compile-time. 4294967280 is, of course, just (u32)-16. I could try logging suspicious instances, but at what points does GPURecord::FlushPrimState even get called? Do I need to enable some option?

fp64 commented 2 years ago

It might just be a false positive in -Wstringop-overflow=. Google search finds several bug reports for that.

hrydgard commented 2 years ago

The corrupted texture scaling should be fixed now.

fp64 commented 2 years ago

I'm still seeing the glitches: both in PPSSPPSDL (locally compiled) and in win32 build from buildbot. Here are screenshots from the first mission of 'Valkyria Chronicles 3 (English v2)' (there are some glitches even before):

Screens

Old version works fine (v1.13, win32, Wine, Upscale x2, Bicubic): ![Screenshot from 2022-08-03 13-15-20](https://user-images.githubusercontent.com/106717720/183150260-970b7367-2b1b-4212-b364-528b37fcca30.png) New version shows glitches (PPSSPPSDL, Linux, Upscale x2, Bicubic): ![Screenshot from 2022-08-05 21-08-01](https://user-images.githubusercontent.com/106717720/183150510-a1d82c8d-7a47-4211-83e2-32569184301c.png) New version without upscaling is fine (PPSSPPSDL, Linux, Upscale off): ![Screenshot from 2022-08-03 13-03-08](https://user-images.githubusercontent.com/106717720/183150604-665ec7e7-2799-42cb-be38-163be1606900.png)

fp64 commented 2 years ago

Issue seems to be fixed now, so implementation is done.

8x8 blocks, edges in 'clamp' mode (because that's what PPSSPP did earlier; would 'wrap' be better?). The restriction of maximum upscale by x5 is lifted, but, obviously, xBRZ still has it (and so does hybrid), and there is no GUI for it.

fp64 commented 2 years ago

Well, can be closed.

hrydgard / ppsspp