Closed fp64 closed 2 years ago
Obviously, all cool people use texture replacement with AI algorithms to upscale their textures
Fixed that sentence for you. It also pretty much says why most people lost interest in CPU texture scaling nowadays and not because of just texture scaling done on the GPU which sure can still have some future simply due to performance, but even that fades away compared to texture replacement which works great and from few years can be done easily without artistic skills by just using any AI trained to upscale specific graphic style.
Automatic texture upscaling is mostly a quick and lazy fix which many people doesn't like, but sometimes use it anyway as it's one of the easier ways to work around texture glitches when rendering PSP games above native res, as far as I know bicubic does nothing to those problems.
To avoid requiring future CPU's PPSSPP limits the upscaling both per frame and by texture longevity(textures that refresh too often are just skipped), those textures which are scaled are also cached so the reason why it's fast enough for many is because it doesn't do anything 99% of the time, which is also why people that do know about it will continue to say it's not fast enough even for modern hardware.
IMO from all my testing bicubic doesn't do much as a texture scaler for PSP games other than adding cost and really the only reason it existed seems to be being part of one of the xBRZ variants.:c Would be cool to have something that does no visible change other than dealing with those common UI issues when rendering games above x1 res, but again bicubic unfortunately doesn't help with that.
So, bicubic texture upscaler doesn't get much love, simply because it's just not that important? That... makes sense. It indeed does not seem to do much for textures in my limited testing.
This CPU scaling stuff is very old, and the comments along with the functionality itself was contributed, so I don't really endorse the statement "even most powerful CPU's are not powerful enough to upscale textures in real time".
We already have a GPU-based scaling path although it's currently Vulkan-only, but performs much better. Future scaling work should probably go in that direction for the other backends too...
Also I don't think any of our current scaling algorithm are generally very good. Some of them do look pretty good in 2D games but overall, the benefit just isn't that big IMHO...
I would merge a PR that just speeds up our bicubic as-is though :)
I would merge a PR that just speeds up our bicubic as-is though :)
Non-separable filter and all? Hm, might be doable. Looking into it.
No I mean it's fine to replace the implementation too of course, as long as it looks about the same. I meant I won't require new work in this area to be for the GPU.
as long as it looks about the same
So I grabbed a random DOOM2-ish door texture, and upscaled it x4.
You might want to open images in a new tab.
I think scaleBicubicBSpline
is used for "hybrid", and plain "bicubic" uses scaleBicubicMitchell
.
I would merge a PR that just speeds up our bicubic as-is though :)
As I said, I'm not confident touching C++ codebase, but I may try...
I like the sharpness of Catmull-Rom best for sure! But compared to just using the original, it's not really that much of an improvement...
Added 'bilinear' to the table above, which is what "just using the original" would probably look like if the game itself decided to set texture filtering to linear.
I think it's great to improve these, and I think the CPU scaling can be improved and probably can't be removed.
That said, just to set expectations, a lot of games have complex effects, use palettes in ways that are annoying for texture scaling or texture caching, or perform realtime CPU readback effects (like a sepia filter.) Many of these aren't throughout the game, but might just be at certain points - like only right after you defeat this boss, or only when you're walking around in that lava area, etc.
It would require a greater reduction than 20ns -> 2ns for me claim that any top-end PC can handle realtime texture scaling in games throughout. And I've definitely told people the opposite - that the current implementation (really, xBRZ, though anyway) can't handle realtime scaling even on the 8-core 4Ghz CPU they're so proud of. Because it can't.
That doesn't mean it's a waste to improve it. I wouldn't say the current software renderer is gonna run most games at full, realtime speed, either - even on someone's nifty new PC. That didn't stop me from making it a lot faster recently. Maybe it'll eventually run full speed (especially with more work and more core counts), and the same is possibly true of CPU texture scaling.
As I said, I'm not confident touching C++ codebase, but I may try...
PPSSPP doesn't go crazy with complex C++, generally. There are only a few templates and basic STL usage.
The scaler code I think uses some template params, mostly for optimization. At the end of the day, as long as you can give it a function that takes an array of pixels in, and array of pixels to write to, a width, and a y range (for threading), you'll be fine. If you've got it mostly working but have questions, you can ask here or on Discord.
Note: for the purposes of PPSSPP, you can assume SSE2 is supported if Intel. Supporting a non-SSE2 Intel CPU would require changes in a lot of places.
Also it looks like your benchmark doesn't use threads, currently. Probably scales just fine, but a PPSSPP implementation should ideally use threads or else a 12-thread (or better) CPU might be worse off after all. That's the main thing I'd suggest trying to support.
Though, maybe there's something to be said about considering an alternate threading model for that. Is it better to upscale 12 textures on 12 threads at once, or 1 texture on 12 threads? Might be an interesting thing to benchmark, especially with varied texture sizes.
Might also be interesting to see if clang can be coaxed into vectorizing well enough for NEON, etc.
-[Unknown]
PPSSPP doesn't go crazy with complex C++, generally.
It's 'codebase' part that deters me -- build process and all ('C++' meant 'code that actually needs compilation').
Probably scales just fine, but a PPSSPP implementation should ideally use threads or else a 12-thread (or better) CPU might be worse off after all.
I would probably just replace the insides of scaleBicubicBSpline
and scaleBicubicMitchell
with calls to upscale_cubic
(which does support subregions), and let the higher-level PPSSPP code (TextureScalerCommon::ScaleBicubicBSpline
and TextureScalerCommon::ScaleBicubicMitchell
) handle the rest.
That said, the multi-threaded performance was indeed not benchmarked.
Since my code works on 16x16 dst blocks (anything smaller is temporary padded), there is a perf hit for both small textures and thin slices. Multi-threading might produce such slices, since MIN_LINES_PER_THREAD
is only 4 (is that src lines?).
I can trivially change the blocks to 8x8 though, with a slight perf hit.
If you've got it mostly working but have questions, you can ask here or on Discord.
I'll take you up on that, I guess.
So I compiled PPSSPP from master
on my 32-bit Linux (with gcc 10.2.1
) by simply running ./b.sh
.
It compiled with a single compile-time warning:
In function ‘void GPURecord::EmitTextureData(int, u32)’,
inlined from ‘void GPURecord::FlushPrimState(int)’ at /[...]/ppsspp/GPU/Debugger/Record.cpp:392:19:
/[...]/ppsspp/GPU/Debugger/Record.cpp:349:9: warning: ‘void* memcpy(void*, const void*, size_t)’ specified size 4294967280 exceeds maximum object size 2147483647 [-Wstringop-overflow=]
349 | memcpy(&framebufData[sizeof(framebuf)], p, bytes);
| ~~~~~~^~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
and a single link-time warning:
/usr/bin/ld: ../ffmpeg/linux/x86/lib/libavformat.a(allformats.o): warning: relocation in read-only section `.text'
/usr/bin/ld: warning: creating DT_TEXTREL in a PIE
Resulting PPSSPPSDL
works fine, but texture upscaling shows a bunch of artifacts, which are not present in 1.13 release (win32, which I run under Wine). It happens in multiple games (e.g. some text is replaced by rectangles in Valkyria Chronicles III (english patch)
, etc.). This is plain master
without any of my changes, and also happens on xBRZ, not just bicubic. The window title also says PPSSPP v1.12.3-1620-g9af1f18d7
for whatever reason.
All settings are default, other than
That said, I did implement my proposed changes, and it seems to work, i.e. looks about the same as without them (so artifacts still present). I can create a pull request if desired.
I have been messing around with texture loading lately, trying to reduce the amount of duplicated code between the backends. I might have made a mistake causing those glitches, will take a look.
While waiting for the artifacts thing to be resolved, I'll comment on:
Might also be interesting to see if clang can be coaxed into vectorizing well enough for NEON, etc.
I have a rather limited knowledge of ARM, but it seems, that both GCC and clang do autovectorize the upscaler with -mfpu=neon -O3 -funsafe-math-optimizations
. Plenty of things like vmla.f32
. This possibly works even without -funsafe-math-optimizations
which is strange, considering what the docs say:
If the selected floating-point hardware includes the NEON extension (e.g. -mfpu=neon), note that floating-point operations are not generated by GCC’s auto-vectorization pass unless -funsafe-math-optimizations is also specified. This is because NEON hardware does not fully implement the IEEE 754 standard for floating-point arithmetic (in particular denormal values are treated as zero), so the use of NEON instructions may lead to a loss of precision.
You can press Ctrl
-F10
on a line, to see which assembly "corresponds" to it (not that it's straightfoward with optimized code).
Since my code works on 16x16 dst blocks (anything smaller is temporary padded), there is a perf hit for both small textures and thin slices. Multi-threading might produce such slices, since
MIN_LINES_PER_THREAD
is only 4 (is that src lines?). I can trivially change the blocks to 8x8 though, with a slight perf hit.
It's been a while since that was set, it actually started life as a larger minimum dependent on threads: https://github.com/hrydgard/ppsspp/blame/85722511984a96dfe9c3931543fc6566a419ce9d/thread/threadpool.cpp#L63
It might actually be better in all cases for it to be 16.
So I compiled PPSSPP from
master
on my 32-bit Linux (withgcc 10.2.1
) by simply running./b.sh
. It compiled with a single compile-time warning:In function ‘void GPURecord::EmitTextureData(int, u32)’, inlined from ‘void GPURecord::FlushPrimState(int)’ at /[...]/ppsspp/GPU/Debugger/Record.cpp:392:19: /[...]/ppsspp/GPU/Debugger/Record.cpp:349:9: warning: ‘void* memcpy(void*, const void*, size_t)’ specified size 4294967280 exceeds maximum object size 2147483647 [-Wstringop-overflow=] 349 | memcpy(&framebufData[sizeof(framebuf)], p, bytes); | ~~~~~~^~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
That just seems weird. bytes
should never be that high, since it's constrained by ValidSize()
. Personally, I've been using clang most of the time these days, though.
/usr/bin/ld: ../ffmpeg/linux/x86/lib/libavformat.a(allformats.o): warning: relocation in read-only section `.text' /usr/bin/ld: warning: creating DT_TEXTREL in a PIE
You can recompile FFmpeg by running pushd ffmpeg && ./linux_x86.sh && popd
, otherwise it may be using precompiled libraries (FFmpeg is harder to compile, so we have pre-compiled libraries for many platforms to make it easier for people to contribute... on Linux it's not so bad, you might need yasm
.)
While waiting for the artifacts thing to be resolved, I'll comment on:
Might also be interesting to see if clang can be coaxed into vectorizing well enough for NEON, etc.
I have a rather limited knowledge of ARM, but it seems, that both GCC and clang do autovectorize the upscaler with
-mfpu=neon -O3 -funsafe-math-optimizations
.
Great, that sounds positive.
I think there were some fixes for texture scaling bugs, I have been a bit busy to keep up the last couple weeks, but it might already be better now.
-[Unknown]
That just seems weird. bytes should never be that high, since it's constrained by ValidSize().
Yes, this is surprising to me too. GCC also somehow figured the value of bytes
at compile-time. 4294967280
is, of course, just (u32)-16
. I could try logging suspicious instances, but at what points does GPURecord::FlushPrimState
even get called? Do I need to enable some option?
It might just be a false positive in -Wstringop-overflow=
. Google search finds several bug reports for that.
The corrupted texture scaling should be fixed now.
I'm still seeing the glitches: both in PPSSPPSDL
(locally compiled) and in win32 build from buildbot.
Here are screenshots from the first mission of 'Valkyria Chronicles 3 (English v2)'
(there are some glitches even before):
Issue seems to be fixed now, so implementation is done.
8x8 blocks, edges in 'clamp' mode (because that's what PPSSPP did earlier; would 'wrap' be better?). The restriction of maximum upscale by x5 is lifted, but, obviously, xBRZ still has it (and so does hybrid), and there is no GUI for it.
What should happen
Disclaimer: not sure if these ramblings actually belong in issues. Also it's possible that I messed something up, and the following is moot.
I'm only tackling (bi)cubic upscaling, not xBRZ or hybrid.
So, after reading some claims that "even most powerful CPU's are not powerful enough to upscale textures in real time", I decided to take a look at PPSSPP's CPU texture upscaler (this is it, right?), and... it does not seem very fast. Obviously, all cool kids use GPU to upscale their textures, but that's not the topic here. So I wrote an implementation of a cubic upscaler, and benchmarked it against PPSSPP's (and stb_image_resize.h for good measure). Here are the results on a test machine (
Intel(R) Xeon(R) E-2286G CPU @ 4.00GHz
):Timings are reported in nanoseconds per dst pixel (otherwise, the metric is scale-dependent). All tests are single-threaded. The benchmark is compiled with
-march=native -O3
.Some observations in no particular order:
wrap
andclamp
modes (as well aszero
) for out-of-bounds samples (I think PPSSPP only doesclamp
). It also supports arbitrary positive integer scale (not limited to x2..x5 like PPSSPP's current implementation). The upscaling is done with a separable Mitchell-Netravali class filter with runtime-configurable B,C parameters.cubic_upscale.c
) actually compiles fine as both C and C++.stbir_resize_region
can be used).The speed-up may not be enough to entirely eliminate the stutter during upscaling, but it may help.
I'm not confident enough to muck with PPSSPP C++ codebase, so I tried making my upscaler reasonably self-contained. Hopefully, if someone is interested, it should be trivial to integrate.
Here is the code for both upscaler (
cubic_upscale.c
,cubic_upscale.h
) and benchmark (stb_image*
omitted, to fit into online compiler's code size limit): https://onlinegdb.com/7g36Bdzhco Note: to get a meaningful benchmark in the online compiler you need to enable optimizations (cog button->Extra Compiler Flags
), e.g.-march=native -O3
.Who would this benefit
-
Platform (if relevant)
No response
Games this would be useful in
-
Other emulators or software with a similar feature
No response
Checklist