pspdev / pspdev

pspdev master repository
https://pspdev.github.io/
MIT License
168 stars 31 forks source link

Does the docker image build pspsdk in optimized mode? #51

Closed glebm closed 8 months ago

glebm commented 11 months ago

We've noticed rather slow performance of the WIP DevilutionX PSP port and I wondered if the SDK is actually built in optimized mode (e.g. -O2) in the docker image.

Looking at the scripts:

  1. configure.ac defaults to unoptimized mode: https://github.com/pspdev/pspsdk/blob/713aef7c52e1835b79f4d5c9e23a1ab470092dcc/configure.ac#L72-L74
  2. https://github.com/pspdev/pspdev/blob/master/scripts/002-pspsdk.sh doesn't pass/set any CFLAGS

When we fixed a similar issue for another homebrew SDK (nxdk for the original xbox), we saw DevilutionX FPS almost double. If this is indeed an issue, it'd be great if you could fix it! Thanks!

/cc @AJenbo @fjtrujy

fjtrujy commented 11 months ago

I think that it is using the -O2 flags already. Not sure from where they are being generated, but look at the generated Makefile

///
CFLAGS = -g -O2 -G0 -Wall
///
CXXFLAGS = -g -O2
///
PSPSDK_CFLAGS = -g -O2 -G0 -Wall
PSPSDK_CXXFLAGS = -g -O2 -G0 -Wall -fno-exceptions -fno-rtti
///
glebm commented 11 months ago

I see, makes sense.

It'd be good to add -fipa-pta to these flags because this optimization is not enabled at any level but can have significant (2-5%) impact on low-end systems (at the cost of build time but that's fine for building the Docker image).

fjtrujy commented 11 months ago

I see, makes sense.

It'd be good to add -fipa-pta to these flags because this optimization is not enabled at any level but can have significant (2-5%) impact on low-end systems (at the cost of build time but that's fine for building the Docker image).

Cool for sure! If it brings performance, I'm always super happy.

@davidgfnet do you agree as well?

davidgfnet commented 11 months ago

What kind of performance issues you guys seeing? I have the feeling that the SDK might not be the bottleneck, given it just calls to other OS syscall most of the time (so there's limited code/logic). I do not oppose to adding more flags like the one suggested. In fact we/you might wanna try LTO for some extra performance, particularly since binaries are more or less static (hence a lot of room for LTO optimization).

glebm commented 11 months ago

For sure it's not the bottleneck yeah. We just had a similar issue with nxdk where the lack of SDK optimization was indeed the issue but since this already has -O2 the SDK is not the bottleneck.

We're seeing 7 FPS in the DevilutionX (Diablo 1) WIP PSP port https://github.com/diasurgical/devilutionX/pull/5869

AJenbo commented 11 months ago

7 fps for full adapted view. At most 15 fps for a re-croped version with custom UI.

We like to hit 20+ fps before calling it a working port. The hunch is that most of the time is spend in the software 2D rendere, but we don't have developers with hardware so it's hard to accurately gauge where exactly the time is being spent.

Writing a custom PSP rendere for it might help, but that is a lot of work as well, so magically finding a 33-300% performance boost would be preferred...

sharkwouter commented 11 months ago

I'm pretty sure the performance bottleneck is in SDL2. The way the renderer in SDL2 deals with rendering texture which are made by writing each pixel to the texture individually is not optimized.

What needs to happen for this to work better is for the texture to be store in vram and I think it should not be swizzled. Do keep in mind that the PSP only has 2 mb of vram. Let me know if you need more info. There is another port available from @fjtrujy which might perform a bit better.

glebm commented 11 months ago

DevilutionX internally renders at 640x480x8 (8-bit indexed) to a software surface. We then render that software surface to textures, and then to screen (hoping that downscaling is done in hardware at that point).

640x480 is above PSP max texture size (which is 512x512), so we render it to two 512x480x16 textures (ABGR1555, left half and right half) and then blit those to screen.

That comes down to < 1 MiB texture size:

(2 2 512 * 480) / 1024.0 / 1024.0 => 0.9375

We create the textures like this:

SDL_CreateTexture(renderer, SDL_PIXELFORMAT_ABGR1555, SDL_TEXTUREACCESS_STREAMING, 512, 480))

Are we creating them wrong, i.e. does this not result in vram-stored textures?

Regarding swizzled textures, we don't do anything special for that. Are these textures swizzled currently?

sharkwouter commented 11 months ago

The issue is that using streaming textures is really slow on PSP with SDL2 currently. Streaming textures are not swizzled and they are stored in ram. They are copied into the screen buffer in vram after that and then rendered. This means that a GBC emulator There might also be some artifacts, since there are some bugs in the streaming texture implementation too.

I'm really sorry it is like that. We have not had someone with the knowledge to address this take a go at this in a while. I do think it's fixable, though.

davidgfnet commented 11 months ago

If you are internally rendering to a software buffer (relatively big! 640x480) it is likely that the CPU just can't keep up. Just back of the envelope math, that's 9M pixels (at 30fps), which means around (best case) ~37 instructions per pixel. This is quite tight on a CPU like the Allegrex. Just for reference the GBA emulator (gpsp) renders around 2.3M pixels per second (240x160 @ 60fps) and consumes around 50% of its run time just in pixel rendering (ie. blitting tiles and sprites), for a platform that only has 3-4 rendering layers). Any change to the rendering code that results in a couple more instructions and the performance already drops 5-10% easily. I think you are just giving the poor device too much to chew on :P I would start by severely cutting down on pixels. Given that half of the bottom screen is just some "UI widget" that barely changes, I assume that using 3 or 4 textures (and letting the GPU do the overlay for you) could greatly help. You would only render 50% of the area and save on data moves (the PSP bus is quite slow, even when using the fastest most-optimized copying techniques).

Going back to the Issue itself, I do not see any problem with the SDK build, you won't squeeze any perf out of it for your project. Feel free to chime in the PSP discord, you might find some knowledgeable people that can steer you in the right direction for your port :)

AJenbo commented 11 months ago

Thanks that's a good way to think about it. Testing seams to indicate that the PSP is only able to push about 3MP in the way we do things.

The game runs at 480p and target FPS is 20, so in full screen that would be 853x480x20 = 8MP, not that it's a lot better.

If we did a custom UI we can cut the game down to 640x360 (4.6MP).

At width <= 640 the UI is not re rendered, so most the saving is already done there (except the copying).

diamant3 commented 8 months ago

Hello guys, I will convert this now into discussion, feel free to continue!