vindar / tgx

tgx - a tiny/teensy 3D graphics library
GNU Lesser General Public License v2.1
184 stars 31 forks source link

flat shading slower than Gouraud? #12

Closed belm0 closed 2 years ago

belm0 commented 2 years ago

On the ESP32 / naruto demo, I noticed that drawMesh() takes 37 ms for TGX_SHADER_FLAT vs. 25 ms for TGX_SHADER_GOURAUD. Does it seem odd, or is it just that the flat path hasn't been optimized?

vindar commented 2 years ago

Hi,

This is weird indeed. I did not do much testing on ESP32 as I mostly use a Teensy 4.

I will try to look into this later this evening...

vindar commented 2 years ago

Hi,

I just tested the Naruto example on a generic ESP32 board but I am not seeing the same results as you do. In my case, I find:

The time reported above is that of the drawMesh() method, not screen drawing or anything else.

So both methods use approximately the same time which I think is not very surprising because the screen frame buffer has few pixels in this example so most of the time must be spent in the vertex shader (i.e. computing vertex positions on screen) rather than in the pixel shader (i.e. drawing the actual triangles) which is where Gouraud/Flat shading differ.

belm0 commented 2 years ago

Thank you for checking. I agree with the rationale for why the performance of the two shaders should be roughly the same, but...

(There was a mistake in my original measurement, in that I was only measuring drawMesh() when the shader mode changes. Now that I've corrected that, the result is still strange and showing the flat shading as significantly slower.)

Here is average drawMesh() time across the animation (spinning and zoom-in / out):

texture / Gouraud: 47 ms
wireframe:  18 ms
flat: 32 ms
Gouraud:  27 ms

And more strangely, when I disable the zoom-in part of the animation, it's even more skewed.

texture / Gouraud: 39 ms
wireframe:  18 ms
flat: 45 ms
Gouraud:  26 ms

Note that the Gouraud modes are a little faster as expected (because the triangles are smaller), yet flat has become even slower. It seems like there is some huge per-triangle performance penalty for TGX_SHADER_FLAT.

I'm using an M5Stack Core, which I believe is a fairly stock ESP32 and has the same size screen as your setup.

vindar commented 2 years ago

Hi,

This is strange... :-)

I did some more testing and indeed, Gouraud seems to be a bit faster than flat shading in some cases but I did not see such a big discrepancy as the one you observe.

If you have a little time, can you run the script attached to this post so we use the exact same code and tell me how long it take when running the code with either SHADER =TGX_SHADER _FLAT or SHADER =TGX_SHADER _GOURAUD ?

For me, on my ESP32 board, I get:

Thanks.

NarutoTest.zip

belm0 commented 2 years ago
  • GOURAUD = 23ms average
  • FLAT=27ms average

I get the same with your code. It's about 15% difference in the wrong direction, so may be worth investigating.

As for why the numbers are different in my fork of demo, I'll try to narrow it down.

belm0 commented 2 years ago

As for why the numbers are different in my fork of demo, I'll try to narrow it down.

It seems to be related to calling setShaders(TGX_SHADER_FLAT) on every frame. The original demo code does that in the switch (loopnumber % 4) block.

calling setShaders(TGX_SHADER_FLAT) once at init: 26 ms calling setShaders(TGX_SHADER_FLAT) every frame : 38 ms

vindar commented 2 years ago

Hi,

Ok, I did some test and I think I understand now why gouraud shading is slightly faster than flat shading in our case.

In fact, using gouraud shading instead of flat shading only adds 2ms to the drawing time of the triangles in the pixel shader. This can be checked by simply replacing line 867 in shader.h buf[bx] = interpolateColorsTriangle(col2, C2, col3, C3, col1, aera) by buf[bx] = col2. This transforms the gouraud shading method into a (pseudo) flat shading method (i.e. no more linear interpolation along the triangle pixels). With this change, the previous code draws the mesh 21ms with TGX_SHADER_GOURAUD.

However, when using 'real' flat shading, the vertex shader must compute the color of the triangle. To do so, it needs to compute the normal of the triangle and then normalize it to compute phong's lightning and obtain the face color. On the other hand, when using gouraud shading the 3 colors on the vertices of the triangle are computed with phong lightning using the normals already provided by the mesh. These normals are already unit length and so the vertex shader does not need to normalize them.

It turns out that the normalization of a vector is VERY slow on the ESP32 (it is much faster on a Teensy with dedicated FPU) and account all the lost time between flat and gouraud shading. To see this, just comment out line 865 in renderer.inl: faceN.normalize();. Then flat shading only takes 18ms (but it is not accurate anymore as the color is not computed with the correct algorithm).

I will try to see if there is a way to speed up the normalization method...

As for the difference in drawing times in your case, It seems to be some optimization bug because setShader() takes virtually no time but may prevent the compiler from optimizing the code path... Can you post the exact code you are using ? Thanks.

vindar commented 2 years ago

Thanks to you I just realized that my vector normalization methods normalize() was using doubles instead of floats as base type (it called sqrt() instead of sqrtf() when using float) ! This is the reason why normalizing was so slow. I fixed the bug and I also added quake's famous inverse sqrt trick for an additional speedup. Now, the code from the previous message runs in 18ms when using flat shading... Nice ;-)

belm0 commented 2 years ago

Thank you!

(I guessed right here: "It seems like there is some huge per-triangle performance penalty for TGX_SHADER_FLAT")

It turns out that the normalization of a vector is VERY slow on the ESP32 (it is much faster on a Teensy with dedicated FPU)

The original ESP32 does have an FPU for 32-bit float, though I hear some later variants may not. I also read that although it has an FPU, management of the context may not be that efficient-- observed as large overhead for the first FPU operation following a thread context switch.

(Aside: it would be interesting to have a math / graphics library supporting fixed-point number types, for microcontrollers lacking FPU. Perhaps not too hard with tgx due to the heavy templating?)

(I wonder if there's a way to trap double operations on the ESP32, to prevent performance bugs.)

As for the difference in drawing times in your case, It seems to be some optimization bug because setShader() takes virtually no time but may prevent the compiler from optimizing the code path... Can you post the exact code you are using ?

I confirmed that setShaders() itself takes no time.

It's just a one line change to the code you've attached to this issue: call setShaders(TGX_SHADER_FLAT) just before drawMesh().

The interesting thing here again is that only TGX_SHADER_FLAT is affected. Setting the other shaders every frame appears harmless.

belm0 commented 2 years ago

(I wonder if there's a way to trap double operations on the ESP32, to prevent performance bugs.)

tgx seems to diligently use the f suffix on literals, but there is risk of omitting it and having the compiler generate double operations.

There are some 0.0 literals in tgx without the f suffix. Not sure if the compiler might generate some double conversion.

Mat4.h:                R.M[i + j*4] = 0.0;

Some compiler options that could help:

https://stackoverflow.com/questions/24348227/how-to-disable-double-precision-math-while-compiling-with-gcc-or-and-iar

vindar commented 2 years ago

Hi,

Yes, I know about the -fsingle-precision-constant option. It is enabled by default when compiling on Teensy MCU. Since I do not use ESP32 very often I never checked whether this options enabled...

As for the 0.0 litterals, I will try to change them (but in case of an assignment, there will be not time penalty because conversion is performed at compile time). Yet, putting 0 instead of 0.0 is safer and will prevent a possible compiler warning. I do not want to add the f suffix everywhere because I also use TGX on CPU where some computation can use double without much additional cost.

vindar commented 2 years ago

I found a few other instances of mismatched floating point literals throughout the library (but none that would impact speed performance). It should be good now, thanks !

I also tried but cannot duplicate the speed drop you see when calling setShaders() before each drawing operation. Here attached is the code I used but it still draws the mesh in 18ms... Does it work for you ? NarutoTest2.zip

belm0 commented 2 years ago

The problem is highly nondeterministic. I don't know if you can reproduce it, but in this file there are extra loaded shaders and some junk at the end of setup(), including a printf().

With the printf disabled, the timing is 18ms. When it's enabled, the timing is 34ms...

NarutoTest.ino.zip

belm0 commented 2 years ago

By expanding the junk at the end of setup() a little, the timing went up to 42 ms.

My sense is that the code could be anything, and that it's affecting the performance by shifting data or code around, affecting alignment or caching of something important.

    renderer.setTextureQuality(TGX_SHADER_TEXTURE_NEAREST);
    renderer.setTextureWrappingMode(TGX_SHADER_TEXTURE_WRAP_POW2);
    renderer.drawMesh(&naruto_1, false);                 
    renderer.drawWireFrameMesh(&naruto_1, true);
    renderer.setTextureQuality(TGX_SHADER_TEXTURE_NEAREST);
    renderer.setTextureWrappingMode(TGX_SHADER_TEXTURE_WRAP_POW2);
    renderer.drawMesh(&naruto_1, false);                 
    renderer.drawWireFrameMesh(&naruto_1, true);
    printf("sizeof(long): %d\n", sizeof(long));
vindar commented 2 years ago

This certainly a very strange bug... I tried your example and I also tried with the additional change suggested in the last message but I got no luck: it always takes 18ms for me... I agree that cache or memory alignment could yield some speed drop but going from 18ms to 42ms seems a lot. I also do not understand why this would happen with flat shading and not with gouraud shading...

I do not know how flash memory is accessed/cached with the ESP32. As I understand it, the PROGMEM qualifier (which I use for Teensy) is unneeded here but should still be harmless right ? It seems that the ESP32 has some cache to speed up access to flash memory. Maybe this is a case of bad cache miss ?

Here attached is your previous example but where I removed the PROGMEM and const qualifiers for the arrays acessed with flat shading (i.e. the vertex array and the faces arrays) so that they are now located in RAM. Does the slowdown still occur in this case ? This code runs in 16 ms for me (a 2ms improvement compared to having the arrays in flash). NarutoTest.zip

belm0 commented 2 years ago

Thank you. I'm new to ESP32-- the memory management certainly looks complicated.

From what I read and seem to have confirmed, PROGMEM does nothing on ESP32. But the data merely being const will put it in flash, and be cached to SRAM by the MMU. (While Teensy has an opt-in for running constant data from flash with PROGMEM, ESP32 is opt-out using DRAM_ATTR.)

I tried replacing PROMGEM with DRAM_ATTR on the naruto static arrays. The worst-case I'd observed did drop from from 42 ms to 35 ms. And with problematic code removed, I get 16 ms as you do.

I'm wondering if there isn't some thrashing of the cache on the program side, rather than data.

Note that I'm running on a board with meager SRAM-- only 520K.

belm0 commented 2 years ago

Do you have a guess of what the hottest loop would be for the flat ~shader~ geometry? I'd like try tagging the function with IRAM_ATTR so that the code does not use the cache.

It seems again to be a per-triangle issue, rather than per-pixel. Because when I zoom in on the model, the performance increases significantly, apparently due to clipped triangles, and despite 100% screen fill with pixels.

--> I tried putting IRAM_ATTR on relevant functions in Rasterizer / Shaders / Renderer3D, but it had no effect. Apparently IRAM_ATTR is not supported on templated functions.

belm0 commented 2 years ago

Thought it's been a good learning experience, the remaining issue is not critical for me and seems to be a cache thrashing issue-- OK to close this.

vindar commented 2 years ago

Hi,

Yes, I agree it look cache related. So I'll close in a few days if nothing new comes out of it. I do not really know what I can do about that. I thank you again for all your help (and we did uncover a bug anyway) ! :-)

Just in case you are interested, here is the call stack for drawing a mesh (with flat shading):

  1. drawMesh() [Renderer3D.inl 728] calls
  2. rasterizeTriangle() [Rasterizer.h 101] calls
  3. shader_select() [Shaders.h 1685] calls
  4. shader_Flat_Zbuffer() [Shaders.h 505]

Using another shader, 1. 2. 3. will remain unchanged, only the fourth function which is shader specific is changed.

belm0 commented 2 years ago

Thank you. I don't know if it can be trusted, but the z-buffering seems to push things over the edge (or maybe this is just shifting code around again).

original code: 34 ms

ZBUFFER_t& W = zbuf[bx];
const ZBUFFER_t aa = (ZBUFFER_t)(cw);
if (W < aa)
    {
    W = aa;
    buf[bx] = col;
    }

substitute: 15 ms

buf[bx] = col;
belm0 commented 2 years ago

If I artificially inflate the scanline stride, it's still 34 ms. So it would suggest that the slowness isn't actually due to rasterizing cycles. So I guess it's more evidence for instruction caching.

buf += stride * 4;
belm0 commented 2 years ago

The problem also goes away if I change "Core Debug Level" to None. Any other value like Error, Warn, Info has the problem.

belm0 commented 2 years ago

I was using an older version of esp32 core (1.0.6). I upgraded to latest (2.0.3) and can't reproduce the problem. Recent versions boast "footprint smaller than 1.0.6" so it may just be due to different management of memory-- hard to know without understanding the original problem.

vindar commented 2 years ago

Hi,

Yes, it really looks related to memory/caching but I do not know how to zero in on this problem. At least, it is good to know that updating esp32 core can mitigate the problem somewhat...