Open oomek opened 4 years ago
Can you post minimal example code that shows the issue?
Is it ok to post an example that utilizes sfml-pi, or you rather prefer a pure opengl example? This I believe can be reporduced by modifying any "draw textured quad example" and modifying wrap mode to GL_REPEAT and modulating the UV by multiplying it with 1-100 range for example.
The simpler the example code, with the least dependencies the better.
Ok, I'm gonna write something simple when I'm back home.
I've modified a hello_triangle
to show the problem, you can adjust the multiplier in cube_texture_and_coords.h
back to 1.0f
and it's gonna run back again in 60fps
I've just set
glTexParameterf(GL_TEXTURE_2D, GL_TEXTURE_MIN_FILTER, GL_LINEAR);
glTexParameterf(GL_TEXTURE_2D, GL_TEXTURE_MAG_FILTER, GL_LINEAR);
glTexParameteri(GL_TEXTURE_2D, GL_TEXTURE_WRAP_S, GL_REPEAT);
glTexParameteri(GL_TEXTURE_2D, GL_TEXTURE_WRAP_T, GL_REPEAT);
for each texture and multiplied the UVs by:
static const float mul = 64.f;
Ah you said "Tested drawing using gles 1.1 and f/kms" which I interpreted as using the arm side driver with fkms / kms.
The test app is is using the legacy gl interface. Have you tried with the arm side gl driver?
I've tested all backends using SFML-PI and it yielded the same results. This example is legacy for convenience in building.
Backends I've tested: Dispmanx OPENGL ES 1.1 DRM Fake KMS OpenGL 2.1 DRM Full KMS OpenGL 2.1
Here is a video of how it looks like on my end https://youtu.be/wbTTzCIuYpU The minimum frame rate depends on the texture dimensions, in this example the minimum is 15fps
I forgot to mention, generating a mipmaps helps with the performance with some dips in the framerate between mipmaps 0-2 but I cannot use mipmaps unfortuately. I would have to regenerate the tiled rendertexture's mipmaps on each frame, which is another performance killer. My Signed Distance Field font renderer has to have mipmaps disabled to work properly.
That's surprising as legacy and arm gl drivers don't share any code.
We don't have anyone to support the legacy driver, but there are people actively working on the arm side one, so example code that uses that would be more useful.
But it sounds more like this code may be requesting the hardware to do something it can't handle well - e.g. exceeding its texture cache size.
That's what I thought, If the texture portion that is used to draw a quad needs to be first transfered to cache it will choke for large textures scaled down significantly. So I suspect there is no way to scale down big textures and expect decent framerate on any PI and any backend. That would be a disaster for my plans.
Is it possible for you to do ad-hoc mip-mapping, with one or two reduced-size textures? Or at least replace the original texture with a small one (that will look poor at large scales) to see if that is indeed the bottleneck?
The rendertarget on which I draw a rotated logo is now 64x64. Not even a single hickup, the amount of drawn pixels on the screen is still the same. https://youtu.be/0uaqjZCfM6c
Have I just discovered a huge flaw in the design of the Video Core?
Without prejudging the outcome of the investigation, one person's huge flaw could easily be another person's strange corner case.
I really hope in this case is the later.
To avoid redoing all that boilerplate I could just modify the kms cube and do the same. Would that be ok?
I think if you are the first person to notice a huge flaw that been present for 8 years, then maybe it is a corner case.
I don't think what you are trying to do (if I understand correctly, fetch a texture larger than 3d hardware's cache size and use it multiple times) is physically possible without huge sdram bandwidth on any platform.
The textures aren't huge, the rendert target for the zooming logo was 512x512 before and the texture with glyphs for my SDF is 1024x512. Would you consider it too big? Do you know what is the size of the cache that QPUs are operating on?
It's unfair to take away the tool designed to solve this problem - mipmapping - and then complain when performance is affected.
As I said, mipmapping doesn't work on SDF font renderer unfortunately. It makes the font blurry like it was not using SDF at all.
And please don't mistake complaining with a desperate need to find a solution. I've spent a lot of time perfecting the readability of that font renderer, but when I run it on PI it was like a wet slap in the face.
That sounds like a flaw in the renderer.
My suggestion is still to switch between two or more texture sizes depending on the scale factor, DIY mipmapping.
This is what SDF looks like without and with mipmaps.
And this is the source bitmap
Is mipmapping a global (or per-scene) switch? Can't you enable it for some objects but not others? (I'm familiar with the concepts, just not the specifics of OpenGL).
Mipmaps are defined per texture generated with
glBindTexture(GL_TEXTURE_2D, m_texture);
GLEXT_glGenerateMipmap(GL_TEXTURE_2D);
The rest is handled by the driver
So why would mipmapping your large PacMan texture affect your SDF font renderer?
No, it would not, I would just have to regenerate the rendertarget's mipmaps on each frame. it's less expensive than drawing without mipmaps, but SDF text suffers from the same slowdowns, as you can see the glyphs have scaled up UV's quite significantly.
I've been studying the VC IV Reference Guide so maybe it would lead me in the direction of finding an answer to whether those slowdowns are caused on a driver, or hardware level, but I think it's above my pay grade.
I think you'd be better off assuming it is a fundamental hardware limitation and looking for a workaround.
I think the TMU (texture memory lookup unit) has a 4K L1 cache per slice and a shared 16K L2 cache. Your 1024x256 (x 32bpp?) texture is massively larger, so effectively every fetch of that texture is going straight to sdram. If you have dozens or hundreds of instances of it rendered to framebuffer that is going to be a lot of bandwidth. Assume you have a few GB/s of sdram bandwidth and look at what you need for 60fps of rendering your scene.
I'm slowly beginning to understand the underlying issue. I think the size of the source texture is not as important as how much of it needs to be fetched per slice by the tiled renderer. If my UVs cover a lot of the source texture then we have a problem as only a fraction of it can fit into the cache. Am I going in the right direction?
When I draw a scene with UVs and quad size in 1:1 ratio I can draw easily 1000s of quads.
Neither myself or @pelwell have done a lot of 3d programming, so there may be other readers who know more. None of this is specific to Raspberry Pi/VideoCore - the same concepts and optimisations would be needed with any tile based rendered (i.e. anything mobile/arm based).
Looks like framebuffer is rendered in 64x64 pixel tiles, one tile at a time. Ideally pixels fetched from textures to render that tile are comparable to pixels in tile (and using mipmaps you pick a resolution of texture that approximates that). If to render each tile you need a significant amount of a very large texture then you have a lot of sdram bandwidth required and it will be slow.
I've done some tests on a non tile based renderer, a very old AMD card that supposed to be slower than PI and those issues don't take place. I also wish I could hear from someone that knows the tile based rendering from the ground up and its limitations, who could confirm what we've discussed here. Thanks guys for your input.
None of this is specific to Raspberry Pi/VideoCore - the same concepts and optimizations would be needed with any tile based rendered (i.e. anything mobile/arm based).
True, but there may be some differences if the other tiled GPU has larger cache or dedicated VRAM.
When I try to draw a fullscreen quad with a texture that has set wrap mode to GL_REPEAT I'm getting very low fps when I start increasing the UV multiplier. It goes as low as few fps. Would someone please explain why that happens and if there is a way to fix it?
Raspberry PI3, latest firmware. Tested drawing using gles 1.1 and f/kms