raspberrypi / firmware

This repository contains pre-compiled binaries of the current Raspberry Pi kernel and modules, userspace libraries, and bootloader/GPU firmware.
5.17k stars 1.68k forks source link

GL_REPEAT + high UV multiplier = 5FPS #1461

Open oomek opened 4 years ago

oomek commented 4 years ago

When I try to draw a fullscreen quad with a texture that has set wrap mode to GL_REPEAT I'm getting very low fps when I start increasing the UV multiplier. It goes as low as few fps. Would someone please explain why that happens and if there is a way to fix it?

Raspberry PI3, latest firmware. Tested drawing using gles 1.1 and f/kms

popcornmix commented 4 years ago

Can you post minimal example code that shows the issue?

oomek commented 4 years ago

Is it ok to post an example that utilizes sfml-pi, or you rather prefer a pure opengl example? This I believe can be reporduced by modifying any "draw textured quad example" and modifying wrap mode to GL_REPEAT and modulating the UV by multiplying it with 1-100 range for example.

popcornmix commented 4 years ago

The simpler the example code, with the least dependencies the better.

oomek commented 4 years ago

Ok, I'm gonna write something simple when I'm back home.

oomek commented 4 years ago

I've modified a hello_triangle to show the problem, you can adjust the multiplier in cube_texture_and_coords.h back to 1.0f and it's gonna run back again in 60fps

hello_triangle_slowdown.zip

oomek commented 4 years ago

I've just set

   glTexParameterf(GL_TEXTURE_2D, GL_TEXTURE_MIN_FILTER, GL_LINEAR);
   glTexParameterf(GL_TEXTURE_2D, GL_TEXTURE_MAG_FILTER, GL_LINEAR);
   glTexParameteri(GL_TEXTURE_2D, GL_TEXTURE_WRAP_S, GL_REPEAT);
   glTexParameteri(GL_TEXTURE_2D, GL_TEXTURE_WRAP_T, GL_REPEAT);

for each texture and multiplied the UVs by:

static const float mul = 64.f;
popcornmix commented 4 years ago

Ah you said "Tested drawing using gles 1.1 and f/kms" which I interpreted as using the arm side driver with fkms / kms.

The test app is is using the legacy gl interface. Have you tried with the arm side gl driver?

oomek commented 4 years ago

I've tested all backends using SFML-PI and it yielded the same results. This example is legacy for convenience in building.

oomek commented 4 years ago

Backends I've tested: Dispmanx OPENGL ES 1.1 DRM Fake KMS OpenGL 2.1 DRM Full KMS OpenGL 2.1

oomek commented 4 years ago

Here is a video of how it looks like on my end https://youtu.be/wbTTzCIuYpU The minimum frame rate depends on the texture dimensions, in this example the minimum is 15fps

oomek commented 4 years ago

I forgot to mention, generating a mipmaps helps with the performance with some dips in the framerate between mipmaps 0-2 but I cannot use mipmaps unfortuately. I would have to regenerate the tiled rendertexture's mipmaps on each frame, which is another performance killer. My Signed Distance Field font renderer has to have mipmaps disabled to work properly.

popcornmix commented 4 years ago

That's surprising as legacy and arm gl drivers don't share any code.

We don't have anyone to support the legacy driver, but there are people actively working on the arm side one, so example code that uses that would be more useful.

But it sounds more like this code may be requesting the hardware to do something it can't handle well - e.g. exceeding its texture cache size.

oomek commented 4 years ago

That's what I thought, If the texture portion that is used to draw a quad needs to be first transfered to cache it will choke for large textures scaled down significantly. So I suspect there is no way to scale down big textures and expect decent framerate on any PI and any backend. That would be a disaster for my plans.

pelwell commented 4 years ago

Is it possible for you to do ad-hoc mip-mapping, with one or two reduced-size textures? Or at least replace the original texture with a small one (that will look poor at large scales) to see if that is indeed the bottleneck?

oomek commented 4 years ago

The rendertarget on which I draw a rotated logo is now 64x64. Not even a single hickup, the amount of drawn pixels on the screen is still the same. https://youtu.be/0uaqjZCfM6c

oomek commented 4 years ago

Have I just discovered a huge flaw in the design of the Video Core?

pelwell commented 4 years ago

Without prejudging the outcome of the investigation, one person's huge flaw could easily be another person's strange corner case.

oomek commented 4 years ago

I really hope in this case is the later.

oomek commented 4 years ago

To avoid redoing all that boilerplate I could just modify the kms cube and do the same. Would that be ok?

popcornmix commented 4 years ago

I think if you are the first person to notice a huge flaw that been present for 8 years, then maybe it is a corner case.

I don't think what you are trying to do (if I understand correctly, fetch a texture larger than 3d hardware's cache size and use it multiple times) is physically possible without huge sdram bandwidth on any platform.

oomek commented 4 years ago

The textures aren't huge, the rendert target for the zooming logo was 512x512 before and the texture with glyphs for my SDF is 1024x512. Would you consider it too big? Do you know what is the size of the cache that QPUs are operating on?

pelwell commented 4 years ago

It's unfair to take away the tool designed to solve this problem - mipmapping - and then complain when performance is affected.

oomek commented 4 years ago

As I said, mipmapping doesn't work on SDF font renderer unfortunately. It makes the font blurry like it was not using SDF at all.

oomek commented 4 years ago

And please don't mistake complaining with a desperate need to find a solution. I've spent a lot of time perfecting the readability of that font renderer, but when I run it on PI it was like a wet slap in the face.

pelwell commented 4 years ago

That sounds like a flaw in the renderer.

My suggestion is still to switch between two or more texture sizes depending on the scale factor, DIY mipmapping.

oomek commented 4 years ago

This is what SDF looks like without and with mipmaps. image

oomek commented 4 years ago

And this is the source bitmap BarlowCondensedRegular

pelwell commented 4 years ago

Is mipmapping a global (or per-scene) switch? Can't you enable it for some objects but not others? (I'm familiar with the concepts, just not the specifics of OpenGL).

oomek commented 4 years ago

Mipmaps are defined per texture generated with

glBindTexture(GL_TEXTURE_2D, m_texture);
GLEXT_glGenerateMipmap(GL_TEXTURE_2D);

The rest is handled by the driver

pelwell commented 4 years ago

So why would mipmapping your large PacMan texture affect your SDF font renderer?

oomek commented 4 years ago

No, it would not, I would just have to regenerate the rendertarget's mipmaps on each frame. it's less expensive than drawing without mipmaps, but SDF text suffers from the same slowdowns, as you can see the glyphs have scaled up UV's quite significantly.

oomek commented 4 years ago

I've been studying the VC IV Reference Guide so maybe it would lead me in the direction of finding an answer to whether those slowdowns are caused on a driver, or hardware level, but I think it's above my pay grade.

pelwell commented 4 years ago

I think you'd be better off assuming it is a fundamental hardware limitation and looking for a workaround.

popcornmix commented 4 years ago

I think the TMU (texture memory lookup unit) has a 4K L1 cache per slice and a shared 16K L2 cache. Your 1024x256 (x 32bpp?) texture is massively larger, so effectively every fetch of that texture is going straight to sdram. If you have dozens or hundreds of instances of it rendered to framebuffer that is going to be a lot of bandwidth. Assume you have a few GB/s of sdram bandwidth and look at what you need for 60fps of rendering your scene.

oomek commented 4 years ago

I'm slowly beginning to understand the underlying issue. I think the size of the source texture is not as important as how much of it needs to be fetched per slice by the tiled renderer. If my UVs cover a lot of the source texture then we have a problem as only a fraction of it can fit into the cache. Am I going in the right direction?

When I draw a scene with UVs and quad size in 1:1 ratio I can draw easily 1000s of quads.

popcornmix commented 4 years ago

Neither myself or @pelwell have done a lot of 3d programming, so there may be other readers who know more. None of this is specific to Raspberry Pi/VideoCore - the same concepts and optimisations would be needed with any tile based rendered (i.e. anything mobile/arm based).

Looks like framebuffer is rendered in 64x64 pixel tiles, one tile at a time. Ideally pixels fetched from textures to render that tile are comparable to pixels in tile (and using mipmaps you pick a resolution of texture that approximates that). If to render each tile you need a significant amount of a very large texture then you have a lot of sdram bandwidth required and it will be slow.

oomek commented 4 years ago

I've done some tests on a non tile based renderer, a very old AMD card that supposed to be slower than PI and those issues don't take place. I also wish I could hear from someone that knows the tile based rendering from the ground up and its limitations, who could confirm what we've discussed here. Thanks guys for your input.

oomek commented 4 years ago

None of this is specific to Raspberry Pi/VideoCore - the same concepts and optimizations would be needed with any tile based rendered (i.e. anything mobile/arm based).

True, but there may be some differences if the other tiled GPU has larger cache or dedicated VRAM.