ptitSeb / gl4es

GL4ES is a OpenGL 2.1/1.5 to GL ES 2.0/1.1 translation library, with support for Pandora, ODroid, OrangePI, CHIP, Raspberry PI, Android, Emscripten and AmigaOS4.
http://ptitseb.github.io/gl4es/
MIT License
696 stars 159 forks source link

SuperTuxKart 0.6.2a #116

Closed kas1e closed 5 years ago

kas1e commented 5 years ago

Hi !

Today tried to port pretty old version of SuperTuxKart over gl4es : 0.6.2a (as it latest version which i was able to build with usage of MiniGL on amigaos4, so can compare the same builds).

For first, game works ! I.e. all renders correctly (that mean that at least we at the level with amigaos4 drivers that there almost no needs to polish bugs). Through, at begining (where it show menu), it draws only first entry, and didn't draw others , but if i go to that first menu, and back, then whole menu renderes correctly, see what i mean:

gl4es build first run:

alt text

And then after it running, i go to that "single player" menu, and exit from : now it renders correctly as how it should renders when you just run game first time:

alt text

Dunno what can cause it.. but that bug at moment probabaly not that important , what is important, is that for my surprise, there almost no speed differences :( In some tracks, in some places it can be a little bit faster than minigl build. But in other places of other tracks minigl build is faster. But in whole speed differences is vague :(

Take a look for sake of interst on videos where i run the same track called "racetrack":

minigl build: https://youtu.be/e6GUwU2OANk

gl4es build: https://youtu.be/6Pf7WoZvSEo

On those videos gl4es builds a little bit faster (expectually under the bridge), but as you can see there is no big difference , game pretty "jerky" by fps anyway :)

What i tried to speed things up, is to use LIBGL_BATCH 5 , 10, 15 and 20 , and it make things be slower and slower with each increase. I.e. 5 and 10 more or less the same, then 15 was made it slower on 3-4 fps, and 20 make it slower even more. The best result is not setting LIBGL_BATCH at all or to 0 (i.e. default value).

Maybe have any idea what else can be checked / tried , to made gl4es build be faster ? I will try in meanwhile to play with all settings, maybe something will be of any help..

ptitSeb commented 5 years ago

So, first, no idea why you are missing some button at first ?! Very strange. Unfortunatly, I don't have a SuperTuxKart 0.6.2 build to test (I only have 0.8.0 and 0.8.1).

About the speed. It seems on the video that it get smooth when other cars are not on the screen. That seems strange, because the cars doesn't look that high detailled. But still, I would guess you are CPU limited here? Also, BATCH mode doesn't work here because Irrlicht is already optimize to Batch by itself.

In case the issue is with the texture size too bing, you can try some LIBGL_SHRINK=7 to downsize larger texture.

kas1e commented 5 years ago

Yeah, with missing buttons looks strange, but that probabaly again on our side something :)

And as i see with 0.6.2a there wasn't IrrLicht, they use some Bullet library for physic (they start to use IrrLicht after 062a).

I also tried LIBGL_SHRINK=7, and nope, also didn't make difference. Intersting ! :)

ptitSeb commented 5 years ago

Also, did you tried with LIBGL_BEGINEND=0 to see if the menu is fixed?

I'll try to see if I can peek at the source code of the rendering, see if I can understand roughly how it work and why it would slowdown.

kas1e commented 5 years ago

Currently one of our devs finishing adding to glsnoop (tracing tool for ogles2 and warp3dnova) profiling (i.e. callcount of functions, how much of which takes in % from all time, etc). So we probabaly can see what SuperTuxKart mostly do , which, maybe can bring us something.

kas1e commented 5 years ago

Ah, and LIBGL_BEGINEND set to 0 didn't fix that issue with first menu, but, it did fix some very minor issue in other parts of menu later, but that one probabaly even not worth of mention at moment.

ptitSeb commented 5 years ago

That is odd. And I used some other games that use PLIB with no real issues before (like TORCS or Secret Maryo Chronicles).

kas1e commented 5 years ago

It can be not exactly PLIB of course (at least one compiled with MiniGL render correctly there). If you will have time, try plz that 0.6.2a version, its easy to build and intersting to know is it slower for you on Pandora will be , than for example 0.8.1 one used with IrrLicht.

kas1e commented 5 years ago

Btw, seein your yesterday's commits it seems you was able to build supertuxkart 0.6.2a , and trying to speed things up ?:) I also see changes in LIBGL_BATCH, what is the best settings for to try first now ?

ptitSeb commented 5 years ago

I didn't built the old supertuxkart, I used TORCS for testing. TORCS also use PLIB so it should be similar. What I noticed is that the engine try it hard to use glDrawElements with GL_TRIANGLE_FAN. That make a lot of small drawing command. That why I changed BATCH mode to make it a bit more efficient. It worked, but now there a re a lot of memcpy (because of how BATCH mode works), making the game not faster. I should try to optimize BATCH mode and try to avoid a memcpy, but it's difficult, not sure it'll works. You can try (try with LIBGL_BATCH=0-40 or something like that), maybe it's better for you.

kas1e commented 5 years ago

At least with LIBGL_BATCH 0-40 i see no differences at all :(

As our profiler is kind of working, i made some test by running the game, starting track, playing few seconds, and then exit from. Done both profiling for ogles2 and warp3dnova at the same time (so same level, same tracks, cars, etc). That what i have:

OGLES2:

OpenGL ES 2.0 profiling results:
--------------------------------
-> DrawArrays callcount 137928, duration 4977.818102 milliseconds, 59.32 % of total
-> DrawElements callcount 7109, duration 1727.744436 milliseconds, 20.59 % of total
-> CompileShader callcount 32, duration 1073.860889 milliseconds, 12.80 % of total
-> TexImage2D callcount 1256, duration 409.614268 milliseconds, 4.88 % of total
-> VertexAttribPointer callcount 488300, duration 179.157637 milliseconds, 2.14 % of total
-> SwapBuffers callcount 538, duration 11.084613 milliseconds, 0.13 % of total
-> BindTexture callcount 3734, duration 5.976421 milliseconds, 0.07 % of total
-> GenTextures callcount 142, duration 2.095481 milliseconds, 0.02 % of total
-> DeleteTextures callcount 142, duration 1.718170 milliseconds, 0.02 % of total
-> EnableVertexAttribArray callcount 7787, duration 1.053254 milliseconds, 0.01 % of total
-> TexParameteri callcount 426, duration 0.485383 milliseconds, 0.01 % of total
-> ShaderSource callcount 32, duration 0.234230 milliseconds, 0.00 % of total
Total recorded duration 8390.842884 ms, 13.04 % of total 64337.708425 ms
--------------------------------------------------------

Warp3DNOVA:

Warp3D Nova profiling results:
------------------------------
-> DrawArrays callcount 57256, duration 1588.027950 milliseconds, 56.87 % of total
-> BufferUnlock callcount 64253, duration 501.094117 milliseconds, 17.95 % of total
-> DrawElements callcount 3343, duration 338.133456 milliseconds, 12.11 % of total
-> CreateVertexBufferObject callcount 669, duration 140.691583 milliseconds, 5.04 % of total
-> VBOSetArray callcount 209122, duration 84.408429 milliseconds, 3.02 % of total
-> VBOLock callcount 58728, duration 70.160685 milliseconds, 2.51 % of total
-> DestroyVertexBufferObject callcount 669, duration 50.385371 milliseconds, 1.80 % of total
-> BindVertexAttribArray callcount 207526, duration 17.670089 milliseconds, 0.63 % of total
-> Destroy callcount 1, duration 1.587200 milliseconds, 0.06 % of total
-> DestroyFrameBuffer callcount 2, duration 0.040703 milliseconds, 0.00 % of total
-> FBBindBuffer callcount 2, duration 0.037053 milliseconds, 0.00 % of total
-> CreateFrameBuffer callcount 1, duration 0.002647 milliseconds, 0.00 % of total
-> SetRenderTarget callcount 2, duration 0.000441 milliseconds, 0.00 % of total
Total recorded duration 2792.239724 ms, 4.20 % of total 66428.862092 ms
--------------------------------------------------------

As i see from both reports, the timewaster are DrawArrays which takes about 60% of everything. And only 12-20% is taken by glDrawElements.

kas1e commented 5 years ago

And to be sure that its glDrawArrays which is timewaster for us there, i doing the same profiling tests with Quake3. And there we cleary can see that everything limited by how fast works glDrawElements, see:

OGLES2:

OpenGL ES 2.0 profiling results:
--------------------------------
-> DrawElements callcount 16661, duration 15241.116213 milliseconds, 94.09 % of total
-> TexImage2D callcount 2620, duration 445.739704 milliseconds, 2.75 % of total
-> DrawArrays callcount 442, duration 350.937282 milliseconds, 2.17 % of total
-> CompileShader callcount 4, duration 69.808558 milliseconds, 0.43 % of total
-> TexSubImage2D callcount 23, duration 43.077355 milliseconds, 0.27 % of total
-> BindTexture callcount 16160, duration 21.526567 milliseconds, 0.13 % of total
-> SwapBuffers callcount 559, duration 12.450375 milliseconds, 0.08 % of total
-> GenTextures callcount 477, duration 7.757589 milliseconds, 0.05 % of total
-> DeleteTextures callcount 477, duration 4.823114 milliseconds, 0.03 % of total
-> TexParameteri callcount 1431, duration 1.179853 milliseconds, 0.01 % of total
-> ShaderSource callcount 4, duration 0.024903 milliseconds, 0.00 % of total
-> VertexAttribPointer callcount 10, duration 0.005694 milliseconds, 0.00 % of total
-> EnableVertexAttribArray callcount 4, duration 0.001764 milliseconds, 0.00 % of total
Total recorded duration 16198.448971 ms, 58.19 % of total 27836.338613 ms
--------------------------------------------------------

Warp3DNOVA:

Warp3D Nova profiling results:
------------------------------
-> DrawElements callcount 16661, duration 2020.462445 milliseconds, 86.40 % of total
-> BufferUnlock callcount 19708, duration 125.265309 milliseconds, 5.36 % of total
-> CreateVertexBufferObject callcount 675, duration 65.347476 milliseconds, 2.79 % of total
-> DrawArrays callcount 442, duration 51.715724 milliseconds, 2.21 % of total
-> VBOLock callcount 17507, duration 27.010988 milliseconds, 1.16 % of total
-> VBOSetArray callcount 58002, duration 26.906725 milliseconds, 1.15 % of total
-> DestroyVertexBufferObject callcount 675, duration 17.456069 milliseconds, 0.75 % of total
-> BindVertexAttribArray callcount 51165, duration 3.733328 milliseconds, 0.16 % of total
-> Destroy callcount 1, duration 0.560212 milliseconds, 0.02 % of total
-> DestroyFrameBuffer callcount 2, duration 0.065405 milliseconds, 0.00 % of total
-> FBBindBuffer callcount 2, duration 0.036251 milliseconds, 0.00 % of total
-> CreateFrameBuffer callcount 1, duration 0.002807 milliseconds, 0.00 % of total
-> SetRenderTarget callcount 2, duration 0.000441 milliseconds, 0.00 % of total
Total recorded duration 2338.563179 ms, 8.98 % of total 26053.620002 ms
--------------------------------------------------------

So probably (if take in account that i understand those logs correctly), our issue with plib based games (at least with SuperTuxKart), is glDrawArrays, not glDrawElements.

kas1e commented 5 years ago

I also tried LIBGL_BATCH be 0-5, 0-10, 0-15, 0-20 , 0-25, 0-30, 0-40, 0-50, 0-100 : in all those conditions (and i see from gl4es log, that it working) i see no differences by FPS (even a little bit). Tried even something like 20-40 , 10-50, etc , but none do any changes visually.

What is more intersting, that setting LIBGL_BEGINEND to 0 , also made no visuall differences absolutly, exactly the same FPS as if it enabled.

ptitSeb commented 5 years ago

Interresting.

I'm surprised, in the first profile run, to see the CompileShader that high. Maybe the prcompile shader will help here, even if it stays in OGLES2 driver, because I don't see any Shader creation stuff on NOVA side.

Now, glDrawArray and glDrawElements are the drawing command. Almost eveyrthing to is drawn used this 2 commands (the only only other thing is glClear(...) ...), so it's not a surprise that it's on top of the list. What is interesting is that there is (summing both glDrawXXX) 137928+7109 draw commands, for 538 frames (number of SwapBuffer). That makes 269 draw command per frames (for comparison, quake3 gets 30 draw commands per frame!). But of courses, that's for the whole session, so including menu that must have a much lower draw command count per frame.

What is also strange, is that the glDrawXXX count on OGLES2 side and NOVA side don't match (on Quake3 they do match). Are the statistics from different run or is OGLES2 doing some merging too?

kas1e commented 5 years ago

It's statistic from different runs (one time ogles2, another nova), by doing the same in the game (as if i run logging of both ogles2 and nova at the same time, its pretty much slow down the game so hard to move even). I ask our glSnoop mantainer to add flag which will do only profiling, without logging to a file, so that will speed things up for profiling.

As for shaders : i think its can be false alarm, as that what i do : run the game (so some shader for menu), then go to the menu when i need to choice a car which rotated (that another shader), then start a track, and when everything shows (so another shaders), i then exit after few frames.

Maybe i need to do profiling with a bit longer gameplay, so we will see how it all will be in compare with just run/exit.

As for no "CompileShader" call in nova profiling, its probabaly just didn't catched/patched on our side

kas1e commented 5 years ago

I just made capthure only gldrtawarrays,gldrawelements and glshadersource/glshadercompile, so run everything at the same time, and this time played more than few frames in one track, that the result i have:

OpenGL ES 2.0 profiling results:
--------------------------------
-> DrawArrays callcount 442991, duration 44155.796888 milliseconds, 76.99 % of total
-> DrawElements callcount 27062, duration 11955.737980 milliseconds, 20.85 % of total
-> CompileShader callcount 36, duration 1214.908850 milliseconds, 2.12 % of total
-> SwapBuffers callcount 960, duration 25.837952 milliseconds, 0.05 % of total
-> ShaderSource callcount 36, duration 1.551229 milliseconds, 0.00 % of total
Total recorded duration 57353.832899 ms, 50.53 % of total 113512.275173 ms
--------------------------------------------------------

Warp3D Nova profiling results:
------------------------------
-> DrawArrays callcount 442991, duration 7280.778201 milliseconds, 74.13 % of total
-> DrawElements callcount 27062, duration 2538.918434 milliseconds, 25.85 % of total
-> Destroy callcount 1, duration 2.555720 milliseconds, 0.03 % of total
Total recorded duration 9822.252356 ms, 8.71 % of total 112793.552432 ms
--------------------------------------------------------

As you see now compileshader take pretty small amount (and i assume if play in the game more, then it will be less and less, just when it compiles it take 100% for one second, and then always in memory so % in profiling decreased with every second played in actual game).

Also drawarrays and drawelements the same in both cases (so kind of show there is nothing wrong between them in that terms), and yeah, as you say: (442991+27062) / 960 = 489 ! 489 draw commands per fram ! Even more when i play more in the game, and when there is less menu and stuff. Dunno if it good or bad, but if all those 489 draw calls used as different calls, and not batches, then sure that pretty much can slow things down

kas1e commented 5 years ago

I also got some answer from glSnoop author, that what he say:

In STK case, it seems that OGLES2 spent 13 % of "context lifetime" in known functions (that figure also includes W3DNova). This amount may change once we add the rest of functions so glSnoop becomes more aware of everything. But it's also possibile that 87% of time is done something else than drawing. Compare to Q3 case where OGLES2 spent 58 % of context lifetime so my interpretation is that Q3 is more GPU-oriented than STK, based on these rough stats.

But that high glCalls per frame probabaly the key there imho (?)

ptitSeb commented 5 years ago

Yes. Also, what is interesting is that it has spent 50% of CPU time in GLES2 driver, but less than 9% in NOVA. (keep in mind that you don't see the time spent in gl4es there, as you are cupturing calls to GLES2 or NOVA only). I guess most of the time is spent in BigEndian -> LittleEndian conversion or is it done in NOVA?

ptitSeb commented 5 years ago

So, I have spent a large part of my weekEnd on the last 2 commits. I have done some optimisation (hopefully), so may start to see some improvement. What you can try is play with GL_BATCH and see in your capture if the number of draw commands per frame get lower.

I haven't built that old STK yet, but I will do so, probably.

kas1e commented 5 years ago

So doing some tests. Sadly to say, i see no differences visually. I also tried to build for sake of interest some other game with new way of optimisation : neverball 1.6.0, there i also can't see any differences in speed (as in that game i use LIBGL_BATCH for speed).

But i anyway profile supertuxkart0.6.2a with old way of optimisation (2 commits back) and with your last changes , over last glSnoop (we added some more functions). What i do there is just play one track till finish line (through functions callcounts and time can be different because i can't pass the track all the time the same, sometime i alone on track (so it 20 fps), sometine i together with other 3 cars (so it 5 fps) ).

There results:

LIBGL_BATCH 0-20 , 2 commits back:

OpenGL ES 2.0 profiling results for Shell Process 'supertuxkart_gl4es_1913':
--------------------------------------------------------
-> DrawElements callcount 165466, duration 28368.489674 milliseconds, 57.67 % of total
-> DrawArrays callcount 757832, duration 16361.853270 milliseconds, 33.26 % of total
-> Clear callcount 7195, duration 1776.936119 milliseconds, 3.61 % of total
-> CompileShader callcount 36, duration 1199.285399 milliseconds, 2.44 % of total
-> VertexAttribPointer callcount 2749595, duration 634.155833 milliseconds, 1.29 % of total
-> TexImage2D callcount 1256, duration 378.992220 milliseconds, 0.77 % of total
-> UseProgram callcount 43050, duration 337.917071 milliseconds, 0.69 % of total
-> BindTexture callcount 50337, duration 73.644584 milliseconds, 0.15 % of total
-> SwapBuffers callcount 1693, duration 51.408790 milliseconds, 0.10 % of total
-> EnableVertexAttribArray callcount 29305, duration 4.400610 milliseconds, 0.01 % of total
-> GenTextures callcount 142, duration 1.965192 milliseconds, 0.00 % of total
-> DeleteTextures callcount 142, duration 0.985363 milliseconds, 0.00 % of total
-> TexParameteri callcount 426, duration 0.428520 milliseconds, 0.00 % of total
-> ShaderSource callcount 36, duration 0.255885 milliseconds, 0.00 % of total
Total recorded duration 49190.718531 ms, 47.12 % of total 104396.671412 ms
--------------------------------------------------------

Warp3D Nova profiling results for Shell Process 'supertuxkart_gl4es_1913':
--------------------------------------------------------
-> BufferUnlock callcount 1073795, duration 12564.124714 milliseconds, 41.01 % of total
-> DrawElements callcount 165466, duration 5683.124714 milliseconds, 18.55 % of total
-> DrawArrays callcount 757832, duration 5013.406625 milliseconds, 16.36 % of total
-> Submit callcount 111022, duration 2219.604443 milliseconds, 7.24 % of total
-> Clear callcount 7195, duration 1681.457072 milliseconds, 5.49 % of total
-> CompileShader callcount 36, duration 1099.687372 milliseconds, 3.59 % of total
-> VBOLock callcount 1016557, duration 1050.415447 milliseconds, 3.43 % of total
-> VBOSetArray callcount 2949549, duration 897.105907 milliseconds, 2.93 % of total
-> CreateVertexBufferObject callcount 851, duration 194.819545 milliseconds, 0.64 % of total
-> BindVertexAttribArray callcount 2851858, duration 134.753539 milliseconds, 0.44 % of total
-> DestroyVertexBufferObject callcount 851, duration 79.890604 milliseconds, 0.26 % of total
-> BindTexture callcount 55278, duration 14.653246 milliseconds, 0.05 % of total
-> SetShaderPipeline callcount 43051, duration 4.369732 milliseconds, 0.01 % of total
-> Destroy callcount 1, duration 2.287645 milliseconds, 0.01 % of total
-> FBBindBuffer callcount 2, duration 0.035249 milliseconds, 0.00 % of total
-> DestroyFrameBuffer callcount 2, duration 0.019008 milliseconds, 0.00 % of total
-> CreateFrameBuffer callcount 1, duration 0.002687 milliseconds, 0.00 % of total
-> SetRenderTarget callcount 2, duration 0.000602 milliseconds, 0.00 % of total
Total recorded duration 30639.758151 ms, 29.56 % of total 103656.293379 ms
--------------------------------------------------------

So, there we have 545 draw commands per frame.

LIBGL_BATCH 0-20, latest commit with new optimisation:

OpenGL ES 2.0 profiling results for Shell Process 'supertuxkart_gl4es_1915':
--------------------------------------------------------
-> DrawElements callcount 74480, duration 22945.887035 milliseconds, 52.07 % of total
-> DrawArrays callcount 756131, duration 16610.502587 milliseconds, 37.69 % of total
-> Clear callcount 9238, duration 1912.093676 milliseconds, 4.34 % of total
-> CompileShader callcount 36, duration 1213.173237 milliseconds, 2.75 % of total
-> VertexAttribPointer callcount 2496236, duration 622.852228 milliseconds, 1.41 % of total
-> TexImage2D callcount 1256, duration 372.808116 milliseconds, 0.85 % of total
-> UseProgram callcount 40571, duration 270.352568 milliseconds, 0.61 % of total
-> BindTexture callcount 44677, duration 63.850022 milliseconds, 0.14 % of total
-> SwapBuffers callcount 1668, duration 50.284116 milliseconds, 0.11 % of total
-> EnableVertexAttribArray callcount 28745, duration 5.195052 milliseconds, 0.01 % of total
-> GenTextures callcount 142, duration 1.959859 milliseconds, 0.00 % of total
-> DeleteTextures callcount 142, duration 0.999719 milliseconds, 0.00 % of total
-> TexParameteri callcount 426, duration 0.458596 milliseconds, 0.00 % of total
-> ShaderSource callcount 36, duration 0.256526 milliseconds, 0.00 % of total
Total recorded duration 44070.673337 ms, 46.65 % of total 94468.514537 ms
--------------------------------------------------------

Warp3D Nova profiling results for Shell Process 'supertuxkart_gl4es_1915':
--------------------------------------------------------
-> BufferUnlock callcount 915660, duration 10813.930104 milliseconds, 39.16 % of total
-> DrawArrays callcount 756131, duration 5356.431848 milliseconds, 19.40 % of total
-> DrawElements callcount 74480, duration 4598.563340 milliseconds, 16.65 % of total
-> Submit callcount 100750, duration 1939.222882 milliseconds, 7.02 % of total
-> Clear callcount 9238, duration 1788.698520 milliseconds, 6.48 % of total
-> CompileShader callcount 36, duration 1111.835987 milliseconds, 4.03 % of total
-> VBOSetArray callcount 2641358, duration 784.641617 milliseconds, 2.84 % of total
-> VBOLock callcount 864879, duration 780.323375 milliseconds, 2.83 % of total
-> CreateVertexBufferObject callcount 904, duration 220.871276 milliseconds, 0.80 % of total
-> BindVertexAttribArray callcount 2596178, duration 117.415728 milliseconds, 0.43 % of total
-> DestroyVertexBufferObject callcount 904, duration 88.351807 milliseconds, 0.32 % of total
-> BindTexture callcount 49878, duration 8.715403 milliseconds, 0.03 % of total
-> SetShaderPipeline callcount 40572, duration 4.732967 milliseconds, 0.02 % of total
-> Destroy callcount 1, duration 2.321490 milliseconds, 0.01 % of total
-> FBBindBuffer callcount 2, duration 0.031920 milliseconds, 0.00 % of total
-> DestroyFrameBuffer callcount 2, duration 0.022697 milliseconds, 0.00 % of total
-> CreateFrameBuffer callcount 1, duration 0.002687 milliseconds, 0.00 % of total
-> SetRenderTarget callcount 2, duration 0.000682 milliseconds, 0.00 % of total
Total recorded duration 27616.114328 ms, 29.48 % of total 93682.178811 ms
--------------------------------------------------------

So there we have 497 draw calls per frame.

LIBGL_BATCH 0-40 , 2 commits back:

OpenGL ES 2.0 profiling results for Shell Process 'supertuxkart_gl4es_1913':
--------------------------------------------------------
-> DrawElements callcount 209672, duration 33069.658499 milliseconds, 56.58 % of total
-> DrawArrays callcount 933718, duration 20385.736616 milliseconds, 34.88 % of total
-> Clear callcount 8307, duration 2037.443437 milliseconds, 3.49 % of total
-> CompileShader callcount 36, duration 1198.295465 milliseconds, 2.05 % of total
-> VertexAttribPointer callcount 3583973, duration 813.952200 milliseconds, 1.39 % of total
-> UseProgram callcount 50138, duration 416.812688 milliseconds, 0.71 % of total
-> TexImage2D callcount 1256, duration 373.822112 milliseconds, 0.64 % of total
-> BindTexture callcount 58091, duration 86.195372 milliseconds, 0.15 % of total
-> SwapBuffers callcount 1934, duration 55.154870 milliseconds, 0.09 % of total
-> EnableVertexAttribArray callcount 33748, duration 6.359787 milliseconds, 0.01 % of total
-> GenTextures callcount 142, duration 2.001404 milliseconds, 0.00 % of total
-> DeleteTextures callcount 142, duration 0.982556 milliseconds, 0.00 % of total
-> TexParameteri callcount 426, duration 0.422946 milliseconds, 0.00 % of total
-> ShaderSource callcount 36, duration 0.257208 milliseconds, 0.00 % of total
Total recorded duration 58447.095160 ms, 47.96 % of total 121877.825200 ms
-------------------------------------------------------- 

Warp3D Nova profiling results for Shell Process 'supertuxkart_gl4es_1913':
--------------------------------------------------------
-> BufferUnlock callcount 1332986, duration 14895.015760 milliseconds, 40.93 % of total
-> DrawElements callcount 209672, duration 6508.139431 milliseconds, 17.88 % of total
-> DrawArrays callcount 933718, duration 6335.959698 milliseconds, 17.41 % of total
-> Submit callcount 131319, duration 2692.369451 milliseconds, 7.40 % of total
-> Clear callcount 8307, duration 1925.917592 milliseconds, 5.29 % of total
-> VBOLock callcount 1264480, duration 1299.288006 milliseconds, 3.57 % of total
-> VBOSetArray callcount 3841080, duration 1140.127682 milliseconds, 3.13 % of total
-> CompileShader callcount 36, duration 1100.741067 milliseconds, 3.02 % of total
-> CreateVertexBufferObject callcount 875, duration 225.894614 milliseconds, 0.62 % of total
-> BindVertexAttribArray callcount 3714512, duration 170.865702 milliseconds, 0.47 % of total
-> DestroyVertexBufferObject callcount 875, duration 78.774351 milliseconds, 0.22 % of total
-> BindTexture callcount 63284, duration 10.920841 milliseconds, 0.03 % of total
-> SetShaderPipeline callcount 50139, duration 5.063761 milliseconds, 0.01 % of total
-> Destroy callcount 1, duration 2.292898 milliseconds, 0.01 % of total
-> FBBindBuffer callcount 2, duration 0.035810 milliseconds, 0.00 % of total
-> DestroyFrameBuffer callcount 2, duration 0.018567 milliseconds, 0.00 % of total
-> CreateFrameBuffer callcount 1, duration 0.002647 milliseconds, 0.00 % of total
-> SetRenderTarget callcount 2, duration 0.000602 milliseconds, 0.00 % of total
Total recorded duration 36391.428480 ms, 30.04 % of total 121137.292657 ms
-------------------------------------------------------- 

So there we have 591 draw call per frame.

LIBGL_BATCH 0-40, latest commit with new optimisation:

OpenGL ES 2.0 profiling results for Shell Process 'supertuxkart_gl4es_1915':
--------------------------------------------------------
-> DrawElements callcount 94606, duration 28722.920640 milliseconds, 50.61 % of total
-> DrawArrays callcount 1016825, duration 22311.293700 milliseconds, 39.31 % of total
-> Clear callcount 12143, duration 2780.711874 milliseconds, 4.90 % of total
-> CompileShader callcount 36, duration 1208.309059 milliseconds, 2.13 % of total
-> VertexAttribPointer callcount 3399766, duration 802.878293 milliseconds, 1.41 % of total
-> UseProgram callcount 50476, duration 390.844087 milliseconds, 0.69 % of total
-> TexImage2D callcount 1256, duration 373.238160 milliseconds, 0.66 % of total
-> BindTexture callcount 56723, duration 79.214340 milliseconds, 0.14 % of total
-> SwapBuffers callcount 2366, duration 70.471067 milliseconds, 0.12 % of total
-> DeleteTextures callcount 142, duration 5.362273 milliseconds, 0.01 % of total
-> EnableVertexAttribArray callcount 34347, duration 5.205077 milliseconds, 0.01 % of total
-> GenTextures callcount 142, duration 1.960581 milliseconds, 0.00 % of total
-> TexParameteri callcount 426, duration 0.439026 milliseconds, 0.00 % of total
-> ShaderSource callcount 36, duration 0.287364 milliseconds, 0.00 % of total
Total recorded duration 56753.135542 ms, 47.12 % of total 120434.271123 ms
-------------------------------------------------------- 

Warp3D Nova profiling results for Shell Process 'supertuxkart_gl4es_1915':
--------------------------------------------------------
-> BufferUnlock callcount 1212921, duration 13976.371496 milliseconds, 39.21 % of total
-> DrawArrays callcount 1016825, duration 7000.779364 milliseconds, 19.64 % of total
-> DrawElements callcount 94606, duration 5845.702049 milliseconds, 16.40 % of total
-> Clear callcount 12143, duration 2608.353250 milliseconds, 7.32 % of total
-> Submit callcount 132308, duration 2595.274893 milliseconds, 7.28 % of total
-> CompileShader callcount 36, duration 1110.067811 milliseconds, 3.11 % of total
-> VBOLock callcount 1143981, duration 1038.540442 milliseconds, 2.91 % of total
-> VBOSetArray callcount 3576550, duration 1037.991138 milliseconds, 2.91 % of total
-> CreateVertexBufferObject callcount 826, duration 181.778321 milliseconds, 0.51 % of total
-> BindVertexAttribArray callcount 3531305, duration 163.687051 milliseconds, 0.46 % of total
-> DestroyVertexBufferObject callcount 826, duration 70.044272 milliseconds, 0.20 % of total
-> BindTexture callcount 62278, duration 10.223042 milliseconds, 0.03 % of total
-> SetShaderPipeline callcount 50477, duration 6.143923 milliseconds, 0.02 % of total
-> Destroy callcount 1, duration 2.250832 milliseconds, 0.01 % of total
-> FBBindBuffer callcount 2, duration 0.032883 milliseconds, 0.00 % of total
-> DestroyFrameBuffer callcount 2, duration 0.018607 milliseconds, 0.00 % of total
-> CreateFrameBuffer callcount 1, duration 0.002647 milliseconds, 0.00 % of total
-> SetRenderTarget callcount 2, duration 0.000802 milliseconds, 0.00 % of total
Total recorded duration 35647.262822 ms, 29.79 % of total 119664.569234 ms
-------------------------------------------------------- 

So there we have 469 draw calls per frame.

From those results , we can say that:

  1. BufferUnlock() is the one which take all the time mostly (at least if take in account that we add already most usable functions of warp3dnova and of ogles2 to glSnoop)

  2. if not just pure luck that i pass both tracks when test latest optimisatin without many other cards on tracks (which may descrease calls per frame), then with new optimisation we reduce number of calls. In first case with 0-20 its less on about 50, in second case less on about 120. But that can be easyly because of "no other cars close" when pass tracks. At least visually i see no difference on start , when all 4 cars stays on position : in all cases i always have 5 fps.

kas1e commented 5 years ago

Btw , as saying author of glSnoop, in those figures and % need take in account that time spent inside OGLES2 functions includes also time spent inside Warp3D Nova functions.

ptitSeb commented 5 years ago

Btw , as saying author of glSnoop, in those figures and % need take in account that time spent inside OGLES2 functions includes also time spent inside Warp3D Nova functions.

Yes, I know. Still a lot of time spent in the creation of buffers. Does this operation really necessary? Can't NOVA works without buffer (with just main CPU memory).

About SuperTuxKart: I have finally succeeded in building, and after fixing a bug (memory overwrite in items because an array was too small), it ran on the Pandora. I reproduce the 1st time menu issue, and I also have slow speed in game (like 6 fps or less with other cars, a bit better when alone). And the BATCH mode have no effect. I have made some GLES capture, and I have no explanation for the menu issue yet. The items are mostly not here, not sure why. For the slow speed, I see that cars are made of a lot of little glDrawArrays(..) of 6 vertices or less. And the BATCH mode doesn't seem able to merge them. I have made a GL capture that I have yet to analyse, but it seems there are a lot of glCallList(...) involved. If each glDrawArrays(...) is in a list, then the BACH mode will not merge them.

My advise: move to SuperTuxKar 0.8.1. Not only it's faster (because Irrlicht is a more efficient engine than PLIB), but it's also graphicaly better, and the Gameplay is way better (AI is superious, there are a lot more track and more game mode). The 0.9+ changed again the engine and make it graphically even more intensive (but there is native support for GLES2), so maybe stick to 0.8.1 for a start.

kas1e commented 5 years ago

Still a lot of time spent in the creation of buffers. Does this operation really necessary? Can't NOVA works without buffer (with just main CPU memory).

I asking Hans about, but i think the main issue there, that we don't have GART support : Data is copied to VRAM when the buffer is unlocked, so this is probably the slowest operation. And as we still lack GART support, the transfer is slow...

About SuperTuxKart: I have finally succeeded in building, and after fixing a bug (memory overwrite in items because an array was too small), it ran on the Pandora.

Oh good ! Can you point me out on that bug-fix, so i can also apply it for amigaos4 build too ? On my side, i also have initially crashes on exit, when thing start to be freed (and that happens also with MiniGL version too), so i think maybe your fix can be related to it.

I reproduce the 1st time menu issue, and I also have slow speed in game (like 6 fps or less with other cars, a bit better when alone). And the BATCH mode have no effect.

good as well (at least its the same as on my side, so no amigaos4 drivers bug as it was usually:) ) ! And at least with the menu bug it will help to impove gl4es (as on minigl build there is no that issue with menu, so probably its something in gl4es then).

My advise: move to SuperTuxKar 0.8.1. Not only it's faster (because Irrlicht is a more efficient engine than PLIB), but it's also graphicaly better, and the Gameplay is way better (AI is superious, there are a lot more track and more game mode). The 0.9+ changed again the engine and make it graphically even more intensive (but there is native support for GLES2), so maybe stick to 0.8.1 for a start.

I wish i could use 0.8.1, but they introduce that mess with the pathes, with all those "//" in one place , and just "/" in another, and that all spreading all over the source code, not just in one single config file as it with 0.6.2a, where i only change root dir to "." and compile everything with -lunix flag.

Also loading start to be really slower. I.e. 0.6.2a starts right away, you wait just few seconds after run, but 0.8.1 take a minute or something for loading..

But yeah, anyway i need to deal with 0.8.1 soon or later, but making 0.6.2a be bug-free mean impoved gl4es (that menu bug) , and understanding of why it all so slow can help to impove gl4es as well imho (?).

ptitSeb commented 5 years ago

Oh good ! Can you point me out on that bug-fix, so i can also apply it for amigaos4 build too ? On my side, i also have initially crashes on exit, when thing start to be freed (and that happens also with MiniGL version too), so i think maybe your fix can be related to it.

I don't have my pandora with me, but from memory, it was in item_manager.hpp some array ???_item[] that needed to be define as XXX[ITEM_LAST] istead of what it was. I'll grab something more precise tonight.

I asking Hans about, but i think the main issue there, that we don't have GART support : Data is copied to VRAM when the buffer is unlocked, so this is probably the slowest operation. And as we still lack GART support, the transfer is slow...

Mmm. I see, but that's unfortunate. Most of those legacy games are not using VBO, so need to transfert data every times (and even if they were using VBO, gl4es would not anyway). VBO are bit special. Testing on the Pandora showed me that use of VBO actualy slowed down things on graphic intensive game (that was Doom III). And because gl4es is targetted to embedded SoC and SoC mostly have shared memory for VRAM, I'm still not convinced VBO would speed up anything. I still have some idea on were use VBO easily, but I don't much incentive to spent time on something that I'm not confident will improve anything (except complexity on gl4es).

good as well (at least its the same as on my side, so no amigaos4 drivers bug as it was usually:) ) ! And at least with the menu bug it will help to impove gl4es (as on minigl build there is no that issue with menu, so probably its something in gl4es then).

Is your minigl build also a fresh build using latest gcc? Anyway, the OpenGL trace will probably help me understand what's goin'on.

But yeah, anyway i need to deal with 0.8.1 soon or later

That 0.8.1 version include so much better driving handling and gameplay enhancement in my opinion, that make that old version useless and dumb to play.

kas1e commented 5 years ago

Is your minigl build also a fresh build using latest gcc? Anyway, the OpenGL trace will probably help me understand what's goin'on.

Yes, exactly the same source code, exactly the same gcc, even all the 3d party libs the same, the only differences its usage of libgl4es.a instead of libgl.a (minigl) , and builds of plib's libplibfnt.a and libplibssg.a different for minigl and gl4es. But all other the same 1:1.

That 0.8.1 version include so much better driving handling and gameplay enhancement in my opinion, that make that old version useless and dumb to play.

Is also FPS high enough on your side with 0.8.1 ?

ptitSeb commented 5 years ago

Yes, exactly the same source code, exactly the same gcc, even all the 3d party libs the same, the only differences its usage of libgl4es.a instead of libgl.a (minigl) , and builds of plib's libplibfnt.a and libplibssg.a different for minigl and gl4es. But all other the same 1:1.

Ok.

Is also FPS high enough on your side with 0.8.1 ?

It's a bit borderline, but I get from 10fps on heavy tracks (like the Scottish one), to 20fps on lighter tracks like the Egyptian one). It's mainly the number of triangle drawn, pushing the poor Pandora hardware to its limits...

kas1e commented 5 years ago

So on our side can be a bit faster then..

Btw, what i think is that if in end it will be indeed "but it seems there are a lot of glCallList(...) involved. If each glDrawArrays(...) is in a list, then the BACH mode will not merge them.", then, is ti possible to batch things when drawarrays in a list ? I mean currently no, but in theory ?

If not, and if it will let say made overcompilcate gl4es, maybe its better/easy to change that in Plib itself, like , replacing glDrawArrays internally in plib on glDrawElements, or , remove usage of glCallList() .. if of course, is that easy as it sounds.. (then it can help not only STK, but any other plib based games).

If again, its not STK itself guilty (as in end TORCS and SpeedDream are fast?)

kas1e commented 5 years ago

Btw, our glSnoop author works fast too , and there is adding more things in recent glSnoop, and that how looks like STK0.6.2a plaing one track now:

OpenGL ES 2.0 profiling results for Shell Process 'supertuxkart_gl4es_1915': 
-------------------------------------------------------- 
Total recorded duration 47195.816137 ms, 46.94 % of total context life-time 100551.372218 ms 
Drawcalls (glDraw*) per frame 565.925512 
Frames (buffer swaps) per second 16.110000 

-> DrawElements callcount 62262, duration 23267.922445 milliseconds, 49.30 % of recorded time (23.14 % of total context life-time) 
-> DrawArrays callcount 849444, duration 19443.886474 milliseconds, 41.20 % of recorded time (19.34 % of total context life-time) 
-> Clear callcount 6697, duration 1674.416850 milliseconds, 3.55 % of recorded time (1.67 % of total context life-time) 
-> CompileShader callcount 36, duration 1211.990055 milliseconds, 2.57 % of recorded time (1.21 % of total context life-time) 
-> VertexAttribPointer callcount 2792074, duration 781.801620 milliseconds, 1.66 % of recorded time (0.78 % of total context life-time) 
-> TexImage2D callcount 1256, duration 371.775033 milliseconds, 0.79 % of recorded time (0.37 % of total context life-time) 
-> UseProgram callcount 40110, duration 316.792758 milliseconds, 0.67 % of recorded time (0.32 % of total context life-time) 
-> BindTexture callcount 46597, duration 67.699122 milliseconds, 0.14 % of recorded time (0.07 % of total context life-time) 
-> SwapBuffers callcount 1611, duration 50.966556 milliseconds, 0.11 % of recorded time (0.05 % of total context life-time) 
-> EnableVertexAttribArray callcount 27326, duration 4.514256 milliseconds, 0.01 % of recorded time (0.00 % of total context life-time) 
-> GenTextures callcount 142, duration 2.333761 milliseconds, 0.00 % of recorded time (0.00 % of total context life-time) 
-> DeleteTextures callcount 142, duration 1.032843 milliseconds, 0.00 % of recorded time (0.00 % of total context life-time) 
-> TexParameteri callcount 426, duration 0.429242 milliseconds, 0.00 % of recorded time (0.00 % of total context life-time) 
-> ShaderSource callcount 36, duration 0.255123 milliseconds, 0.00 % of recorded time (0.00 % of total context life-time) 
-------------------------------------------------------- 

Warp3D Nova profiling results for Shell Process 'supertuxkart_gl4es_1915': 
-------------------------------------------------------- 
Total recorded duration 29230.469784 ms, 29.50 % of total context life-time 99092.776356 ms 

-> BufferUnlock callcount 988562, duration 11852.596824 milliseconds, 40.55 % of recorded time (11.96 % of total context life-time) 
-> DrawArrays callcount 849444, duration 5851.102619 milliseconds, 20.02 % of recorded time (5.90 % of total context life-time) 
-> DrawElements callcount 62262, duration 4598.397843 milliseconds, 15.73 % of recorded time (4.64 % of total context life-time) 
-> Submit callcount 102350, duration 2036.381602 milliseconds, 6.97 % of recorded time (2.06 % of total context life-time) 
-> Clear callcount 6697, duration 1583.779083 milliseconds, 5.42 % of recorded time (1.60 % of total context life-time) 
-> CompileShader callcount 36, duration 1111.113967 milliseconds, 3.80 % of recorded time (1.12 % of total context life-time) 
-> VBOSetArray callcount 2927999, duration 906.110158 milliseconds, 3.10 % of recorded time (0.91 % of total context life-time) 
-> VBOLock callcount 935958, duration 849.793359 milliseconds, 2.91 % of recorded time (0.86 % of total context life-time) 
-> CreateVertexBufferObject callcount 858, duration 203.448530 milliseconds, 0.70 % of recorded time (0.21 % of total context life-time) 
-> BindVertexAttribArray callcount 2897608, duration 144.847536 milliseconds, 0.50 % of recorded time (0.15 % of total context life-time) 
-> DestroyVertexBufferObject callcount 858, duration 75.912419 milliseconds, 0.26 % of recorded time (0.08 % of total context life-time) 
-> BindTexture callcount 51488, duration 8.746882 milliseconds, 0.03 % of recorded time (0.01 % of total context life-time) 
-> SetShaderPipeline callcount 40111, duration 4.861852 milliseconds, 0.02 % of recorded time (0.00 % of total context life-time) 
-> Destroy callcount 1, duration 3.291454 milliseconds, 0.01 % of recorded time (0.00 % of total context life-time) 
-> FBBindBuffer callcount 2, duration 0.064563 milliseconds, 0.00 % of recorded time (0.00 % of total context life-time) 
-> DestroyFrameBuffer callcount 2, duration 0.017965 milliseconds, 0.00 % of recorded time (0.00 % of total context life-time) 
-> CreateFrameBuffer callcount 1, duration 0.002727 milliseconds, 0.00 % of recorded time (0.00 % of total context life-time) 
-> SetRenderTarget callcount 2, duration 0.000401 milliseconds, 0.00 % of recorded time (0.00 % of total context life-time) 
--------------------------------------------------------
ptitSeb commented 5 years ago

If again, its not STK itself guilty (as in end TORCS and SpeedDream are fast?)

I don't notice the same effect in TORCS (I haven't tried SpeedDream lately), so I assume it's STK only.

then, is ti possible to batch things when drawarrays in a list ? I mean currently no, but in theory ?

That will be very tricky. Inside the same glList, glDrawXXXX can be merged. But there are not accross different glList, and doing soo will overcomplicate gl4es I'm afraid. And again, it will mainly help only that old version of STK, so I don't think it's worth it.

OpenGL ES 2.0 profiling results for Shell Process 'supertuxkart_gl4es_1915':


Total recorded duration 47195.816137 ms, 46.94 % of total context life-time 100551.372218 ms Drawcalls (glDraw*) per frame 565.925512

so, mostly 50% of time in GLES2 lib

Warp3D Nova profiling results for Shell Process 'supertuxkart_gl4es_1915':


Total recorded duration 29230.469784 ms, 29.50 % of total context life-time 99092.776356 ms

and 30% of actual drawing stuff.

ptitSeb commented 5 years ago

Oh good ! Can you point me out on that bug-fix, so i can also apply it for amigaos4 build too ? On my side, i also have initially crashes on exit, when thing start to be freed (and that happens also with MiniGL version too), so i think maybe your fix can be related to it.

So it was in src/items/item_manager.hpp and line 41 now is ssgEntity *m_item_model[ITEM_LAST]; (instead of ssgEntity *m_item_model[ITEM_SILVER_COIN+1];)

kas1e commented 5 years ago

Thanks !

so, mostly 50% of time in GLES2 lib and 30% of actual drawing stuff.

As author explain its like this : 50% of whole lifetime (known patched functions, other 50% of lifetime is unnnown at moment) are in gles2 , and that include warp3d time. So, is warp3d one are 30% , then gles2 by itself take only 20%. Probabaly we need simplify a bit final output :)

ptitSeb commented 5 years ago

Yes, I understood that.

Btw, I have just pushed a fix the glList issue that made the menu behave badly. But nothing for the performances, as it is small batch of draw command in many individual glList...

kas1e commented 5 years ago

Yeah, bug fixed.

There is another little cosmetic issue which you probabaly want to fix just to made gl4es be bug-free: when you run game, then hit "single player" , then "quick race", then choice a driver , then choice a track, you then have a screen, with some info, and "start race" button in the middle. Now, the bug part: if you press "start race" as it now, then that info screen didn't erased, and "loading race..." words come up on top of "Start race" words.

Then, if you doing the same, but with LIBGL_BEGINEND 0 , then, it reacts correctly : after pressing on start race, whole screen clears, and only "loading race ..." are shown.

That of course not big deal there, just seems there some issue flying around in gl4es

ptitSeb commented 5 years ago

Yes, I noticed that too, but I'm unsure what's going on there. I need to study the GL trace more carrefully to understand why gl4es would redraw everything.

kas1e commented 5 years ago

That one we can close at moment too imho. That last issue seems disappeared once you made LIBGL_USEVBO 1 as default (or maybe it was some other fix after we check it), so .. it works as should, speed are issues on our side with bufferunlock() speeds probabaly.