ptitSeb / gl4es

GL4ES is a OpenGL 2.1/1.5 to GL ES 2.0/1.1 translation library, with support for Pandora, ODroid, OrangePI, CHIP, Raspberry PI, Android, Emscripten and AmigaOS4.
http://ptitseb.github.io/gl4es/
MIT License
690 stars 158 forks source link

real VBO optimisation #124

Closed kas1e closed 5 years ago

kas1e commented 5 years ago

As we jump between closed issues there and there it seems time to open new ticket about :)

So, i do test last commit where you add real vbo use on all irrlichts examples. Results for most of them didn't changes (maybe, 1-2 fps less somewhere, somewhere 1-2fps more, etc). But 2 examples, 12.TerrainRendering and 26.OcclusionQuery start to be faster now really , and on linux and on amigaos4.

Linux:

12.terrainrendering : vbo0 : 877, vbo1: 1263 (so, + 386fps, yeah!)
26.occlusionquery:  vbo0: 1750 , vbo1: 1945 (so, +195 fps, yeah as well).

amigaos4:

12.terrainrendering : vbo0 : 56, vbo1: 554 (so, + 500fps, more than just "yeah")
26.occlusionquery:  vbo0: 1168 , vbo1: 1335 (so, +167 fps, yeah as well).

What i note through with 12.terrainrendering, once you press "w" , then speed drops very much again. At least on amigaos4 it drops from 500 fps to 28. On linux will test now.

kas1e commented 5 years ago

Btw, what games worth to test now with even that version ? All of them ? That old SuperTuxKart 0.6.2a maybe worth of checking ? Foobillard maybe ?

ptitSeb commented 5 years ago

Foobillard no, there will be no changes. SuperTuxKart yes, both super old 0.6.2a or old 0.8.1 use VBO IIRC, so may see some changes (I have seen a 10~20% speedup in 0.8.1, not sure in 0.6.2a).

Old game don't use VBO, in the source look for "glBindBuffer" (in .c/.cpp, not just .h) to see if they use it.

ptitSeb commented 5 years ago

And yeah speed x10 on the Terrain Sample, that's impressive!!!

kas1e commented 5 years ago

If it currently only for stuff which have glBindBuffer used, then supertuxkart 0.6.2a is also out, i didn't find such a function in the sources.

But i see frickingshark for example use it. Barony probabaly too a little. Also neverball 1.6.0 seems to use it too (see files share/part.c in the #ifdef PARTICLEVBO blocks, and share/solid_draw.c

As for terrain spedup yeah, indeed a lot :) Through strange why when you press "w" , everything drops to the same old slow-fps (but its only added skeleton at top of terrain)

ptitSeb commented 5 years ago

The Wireframe will not use VBO. It would make gl4es too complicated for a barely used function. So let's say it's normal, and that will not changed.

supertuxkart delegate most drawing function to a drawing library (PLIB), so maybe that lib use it (I don't remember).

If you want to see if VBO are used, go in src/gl/fpe.c and line 1162, remove the DBG macro call to have the "printf", or just add a single printf("Yeah!\n"); and you see if it's used.

Barony have the function in the cpp but doesn't actually use it if I remember correctly.

kas1e commented 5 years ago

Ah yeah, old supertuxkart use plib library, will check

kas1e commented 5 years ago

No luck with old supertuxkart, no differences.. So seems plib library is too old for use glBindBuffer and stuff :)

ptitSeb commented 5 years ago

Yep. But Irrlicht game should use VBO: Minetest, SuperTuxKart, IrrLamb or H-Craft Championship.

kas1e commented 5 years ago

Yeah i for sure will have needs to deal with all of them later. They all looks decent enough and worth of porting.

Btw, tried neverball/neverputt 1.6.0, also make no differences. In share/part.c, they have those #ifdef PARTICLEVBO and at top commented out define, so i tried and with and without, resutls the same (no changes).

ptitSeb commented 5 years ago

Friking-Shark should have some speed improvement I guess, it seems to be using VBO for a lot of stuff.

kas1e commented 5 years ago

Tried friking-shark as well : +7-8 fps to whole gameplay ! Without vbo (and previous released version) i have on start 112-113fps. With new vbo version i have on start 120, and in the game itself i can see that it add in whole ~10 fps everywhere.

Not that big , but still cool :)

kas1e commented 5 years ago

Eldritch also uses VBO, but Hans still didn't fix that issue in irrlicht's fragment shaders, so that for later. EDIT: ops , i forgot it use directly ogles2 , without gl4es :)

ptitSeb commented 5 years ago

Yep. I have tried Arx Libertatis that use VBO, and also OpenAstroMenace on the Pandora. while it worked, it didn't bring any significant speed improvement. I also tried OpenMW, but I'm not sure it use VBO (I thinkit does, but I haven't checked).

ptitSeb commented 5 years ago

So, I have just pushed a new VBO optimization: when using glNewList(...) this time. That should help (espacially on Amiga) games like foobillard++.

EDIT: this optim is activated by default, and can be disabled with the same LIBGL_USEVBO=0

kas1e commented 5 years ago

Oh that good! Sadly i cant test it till tomortow (not at home). But very interesting for sure! Did you tried on pandora fobillard with it yet ? Did it give any impovements on your side ?

ptitSeb commented 5 years ago

I haven't tried yet on the Pandora. I tested on a Linux VM and the speed increase was minimal, but I wasn't expecting much there.

kas1e commented 5 years ago

but I wasn't expecting much there.

Because of VM ?

ptitSeb commented 5 years ago

Yes, it's VM, and the "graphic card" is emulated (VirtualBox 3D support), so it's behavour is a bit odd sometime (especially performance-wise).

ptitSeb commented 5 years ago

So, I tested on the Pandora. I get something like 7% FPS increase in foobillard++. Not much, but still something (remember that Pandora is not much sensitive to VBO, as the GPU used shared memory with CPU).

ptitSeb commented 5 years ago

But in ManiaDriver, I go from around 16fps to 27fps! that's strangely good!

kas1e commented 5 years ago

Even +7fps sounds good, cross the fingers for tommorows tests.. Is ManiaDriver something opensourced so i can port it too ?

ptitSeb commented 5 years ago

Yes, Maniadrive is opensourced. But it may be tricky to port. http://maniadrive.raydium.org/ it's all there.

ptitSeb commented 5 years ago

I tried Critical Mass http://criticalmass.sourceforge.net/critter.php as it use glList. But getting reliable fps is difficut, and it's pretty smooth already, so I haven't notices any changes with or without VBO on the Pandora.

ptitSeb commented 5 years ago

Cubosphere https://sourceforge.net/projects/cubosphere/ also use glList, on tutorial level 6, I go from 45 to 49fps... so that's almost 10%.

kas1e commented 5 years ago

While i still not at home, we trying to understand what wrong with speed in some itrlich examples, and Daniel ask : all irllicht examples now use vbo, but only for vertex-data, not for indicies ? I.e. indicies are never vbo right now, right ?

ptitSeb commented 5 years ago

So, 2 things:

  1. The 02.QuakeMap sample doesn't use VBO. Its not that gl4es filter it, it's the sample itself: there is no glBindBuffer in the OpenGL trace of it.
  2. You are right, for now, for software that use VBO, only GL_ARRAY_BUFFER are used, the indices (GL_ELEMENT_ARRAY_BUFFER) are still emulated. It's a bit more complicated to use VBO there, but I have to think if I can find some easy solution. Note that for glList, when BO are used, indices can also go in VBO there :)
kas1e commented 5 years ago

So irrlicht didnt use vbo all the time, but sometime ? I.e. in terrainrendering there was glbindbuffer, while in quake3map are not ?

ptitSeb commented 5 years ago

Yep. I guess for the quake3map, the rendering is similar to quake3: the loop filter out non visible trianlge, sort them, group them by material, then do large drawing of them, like in quake3 (and this seems confirmed by the gles2 capture I have done if it). This kind of rendering doesn't use VBO, because the vertices data can be considered ever changing, so putting them in a VBO for a single access is a waste (well, truely, the vertices could be put in VBO and only the indices could be dynamic, but it seems it's not the case).

ptitSeb commented 5 years ago

So I just pushed handling on VBO for Indices. On the Pandora, the effect is really minimum, the TerrainSample goes from 97fps to 98fps, so that's roughly a 1% speed increase... You'll probably have a larger effect on the Amiga.

kas1e commented 5 years ago

Start testing. First implression after i test foobillard++ : OMG ! I mean OMG !!

foobillard++ give +100% increase. Before, it was 30fps there, 40there. Now, i have FPS values under 70 all the time, less than 70 quite rare. It is also start to react a bit different than before : before, when i incresise size of the balls/table (to make them very big ones), and then start play, everything may drops to 17-20 fps. Now, its still and all the time 60-70.

That is just VERY COOL. I even can see visually, how smooth everything start to play. Now its like you play on windows, when all smooth, etc :) Even those left/top panels now shows pretty fast.

Next thing i tried quake3. There no differences at all. No single fps (but that expected imho?)

Next thing i tried was frickingshark. +1-2fps, no more.

Next thing i tried are Irrlicht examples. There is table (this time, with libgl_fps, and taken average result after ~30 seconds, so to be more exact):

   Example                  5 commits back           last commit

02.Quake3Map                     145                 145 (no change)
03.CustomSceneNode              2026                2040 (+24 fps)
04.Movement                     1257                1252 (-5 fps)
08.SpecialFX                     259                 262 (+3 fps)
09.MeshViewer                    182                 179 (-3 fps)
10.Shaders                       697                 687 (-10 fps)
11.PerPixelLighting              420                 412 (-8 fps)
12.TerrainRendering              546                 590 (+45 fps)
13.RenderToTexture               673                 672 (-1 fps)
15.LoadIrrFile                   767                 784 (+18 fps)
16.Quake3MapShader                59                  59 (no change)
18.SplitScreen                    34                  34 (no change)
20.ManagedLights                 242                 245 (+3 fps)               
26.OcclusionQuery               1374                1533 (+159 fps)

So, those ones who was quite well increase because of VBO update (12.terrainrendering and 26.occlusionquery), boost again. Terraing rendering +45 fps (!), and occlusionquery +160fps (!). Pretty good. Other examples somewhere give 10-15 fps increase, shader one take -10fps by some reassons. But in whole, all examples are better by speed , and those two again, quite well better.

Also there can be seen, that those 3 "slow" examples, didn't changes even a little bit, so that some other part of info about them for other thread.

Also tried to check those ones which i also relase:

prototype: few fps less in compare with release i do half of year ago, so can say no changes

Neverball:

in release version i have in menu-animation seen 47fps max, while in game 75 fps max, Now, with latest gl4es and with setenv LIBGL_BATCH 0-40 , (i use that for release version too), i have in menu animation screen seens values 62, and in game itself instead of 75fps, i have now 90fps.

In Neverputt i see no differences.

So all in all pretty good work already.. But expectually Foobillard its just bomb :)

ptitSeb commented 5 years ago

Yeah, quake3 don't use VBO or glList, so not much was expected (unless the small shader optim do something, but your graphics card is pretty good so this optim probably doesn't change anything in a measurable state).

For Irrlicht sample, not all of them use VBO, so the ones not using them get nothing. I don't think Irrlicht use glList.

Prototype, I don't think it use VBO or glList, so it shouldn't change.

As expected, foobillard++ makes heavy use of glList, so you can see the nice boost.

Neverball should use a few glList, so you see some boost too.

The interresting point is foobillard++: here, there is a heavy use of glList (mostly eveyrthing drawn is in some glList): so using LIBGL_USEVBO=0 and =1, you can see exectly the cost of the BigEndian->LittleEndian conversion of the vertices and indices data. That can give you an indication of the time spent durring the conversion of Quake3Map Irrlicht sample (or even Quake3).

kas1e commented 5 years ago

Is there any other functions which can be wrapped by that "real-VBO" usage ? As i understand currently you tried 3 : glNewList, glBindBuffer (help a lot) and glLockArrays (that one sadly didn't help). Maybe there some others which can be tried ?

I checked for example Lugaru sources, and found there only one time call of glNewList , and it just for font creation, so no big help will be for sure. And whole game full of glBegin/glEnd block. Maybe it possible to use the logic of libgl_batch , but then put it all to real VBO instead of gldrawlements() or so ?

ptitSeb commented 5 years ago

No, batch mode cannot really use VBO.

VBO is useful if the data is use multiple time. It happens with glNewList(...), that's why it's so effective with foobillard++. Software that use VBO also reuse the data, that why it's effective with glBindBuffer. For glLockArray(...), the data is only used a few times (like between 1 and 6), and this seems not enough. Using OGLES2 "vbo cache" is more effective there.

And I don't see any other practical use of VBO.

(there are some old version of Lugaru that used VBO, and the source code seems to be almost ready to use VBO, but there are not there).

kas1e commented 5 years ago

Playing in foobillard for a half of a hour: i even put all the things at maximum. I mean full details,very high everythere, etc ,etc, cubespehere look : everything still fast and smooth.

But i found a bug ! Not sure of course if it gl4es or our drivers as usuall. But , i can say that this kind of bug happens and with previous version (without vbo), but with vbo (or maybe because of any other changes done since then), that bug almost disappear. Issue is that if you choice "View options / Ball Deatail / Very High", and then move camera veery close, then you can see, that in version without VBO, ball is start to be distored (and only when very close camera to it, so ball almost on 1/4 of screen. Once you move camera back to make it smaller, it start to be ok.

With VBO version, that bug almost completely disappear, but still some triangle in all balls are "out". See what i mean:

Distrored ball in "Very high" details when move too close in older gl4es:

http://kas1e.mikendezign.com/aos4/gl4es/games/foobillardplusplus/foo_very_hight_old.jpg

Distrored ball in "Very high" details when move too close in latest gl4es:

http://kas1e.mikendezign.com/aos4/gl4es/games/foobillardplusplus/foo_very_hight_new.jpg

That happens on all balls, not only on main one. On win32 seems all ok. And it can be that it "almost" fixes not because of VBO , but with some previous fixes. But worth to note probably.

ptitSeb commented 5 years ago

Yeah, when I did some capture on foobillard++ to check VBO, I noticed that, with ball detail very-high, some glList had some indices that needs GL_UNISIGNED_INT, but for now, glList only use GL_UNSIGNED_SHORT. So for array with more then 65535 indices, I know, there are some artifacts.

Adding GL_UNSIGNED_INT in glList is a lot lot of work, and will probably slowdown glList handling a bit. So i'm unsure if I'll add the support.

kas1e commented 5 years ago

Then yep, better skip it, artifact is very small, and there no needs for not-important fix loose speed for sure.

ptitSeb commented 5 years ago

You can, maybe, change the code a bit to avoid the "Very high" settings to have such large set.

In src/option.h I see

#define options_max_ball_detail_LOW 3
#define options_max_ball_detail_MED 4
#define options_max_ball_detail_HIGH 5
#define options_max_ball_detail_VERYHIGH 7

So maybe just change 7 to 6?

kas1e commented 5 years ago

Will check

Btw, i talked with Daniel about those VBO things, he say that indeed the first and natural thing to do are glLockArrays, and they must give speedup. Maybe its only quake3 fail by any reassons, but other ones may benefit from ?

ptitSeb commented 5 years ago

Well, the problem with quake3 and glLockArray(...) is that it does lock with, like, 3 active array (vertex, color and texture), so I create a VBO with all 3, but then it changes color and texture array, invalidating part of the VBO. So I did some change to create a VBO only for what seems relevent... But that still doesn't seems to be correct. I may have to trace stuff a bit again, to be sure what is happening is what I had in mind...

ptitSeb commented 5 years ago

I tried Minetest, a minecraft clone made with irrlicht. On my old build, I go (roughly, benchmark is not easy there) from 18fps without VBO to 25fps with VBO. So, nice boost on the Pandora here.

ptitSeb commented 5 years ago

@kas1e I don't if you retried, but I think SuperTuxKart 0.6.2a use glList, so you may see some speed boost with latest gl4es.

kas1e commented 5 years ago

Yeah, mintest is in my todo list as well (i already have binary, but had to disable some parts of code to made it compiles which need to be deal with first).

And yep, will try now SuperTixKart, yesterday wasn't tried , but thinking about :)

About glLockArrays, is it issue general one about 3 active arrays which can change color and texture arary, or its quake3 only ? I mean, glLockArrays probabaly used everywhere, but is it the same all the time as in quake3 , or it different ? If it different it probabaly can be worty (at least for testing purposes now) to keep LIBGL_USEVBO 2 for whole glLockArrays without spliting on 3 different vbos/changing some of them, and LIBGL_USEVBO 3 , for "quake3 kind behaviour" ?

ptitSeb commented 5 years ago

For glLockArrays(...) I was thinking of something like that (the LIBGL_USEVBO=3), but I still need to uderstand why current method doesn't work as intended. I need some GLES2 capture to analyse...

kas1e commented 5 years ago

Tested supertuxkart : probabaly add something a little, but not something which is really worth of note. Maybe 1fps somewhere , etc. I think, with that old supertuxkart, we can meet with the same issue we have with those few irrlicht examples.

As for glLockArrays, Daniel was just really sure it should speed things up. He say "Hm, having VBOs for lockarrays is the most natural thing to do and it should give the expected speedup". Like, it's something which for sure should speed things up.

But it also can be that on our hardware it didn't give speed up. But then, it should at least on Pandora then if that the case ..

kas1e commented 5 years ago

Checked old supertuxkart for glNewList : found it only in gui/widget.cpp, so probabably is of no big use there , and that why i can't see much speedup there.

ptitSeb commented 5 years ago

I thought PLIB used some glNewList, but I don't remember well. I'll do a GLES2 capture to see if I see some VBO created.

kas1e commented 5 years ago

Ah yeah, again PLIB, i again forgot to check PLIB and only check supertuxkart sources :)

kas1e commented 5 years ago

If it will be of any help, I tried to refresh the thing we discuss before about VBO usage with quake3 and stuff, and there was 2 ideas:

--first way:

Create a VBO with vertex position, color, texcoords, normals, but non interleaved. And only update changed color / texcoords. Then something like glBufferSubData() can be used to upload just the changed attributes.

Probabaly that one is what you tried to do right now ?

--second way:

Create a VBO with only vertex position. Color and Texcoords and Normals out of the VBO. At this time you was unsure how standard this thing is: some vertex attributes in a VBO and some in another VBO and you was unsure of how various GLESv2 driver will accept this kind of things.

But then we find this link: https://www.khronos.org/opengl/wiki/Vertex_Specification_Best_Practices#Formatting_VBO_Data which mean that "This is an entirely valid and normal way to use VBOs"

And also someone (it was Hans or Daniel), saying that remaining attributes would go into a separate VBO in this variation. Basically, you'd have one VBO for static data, and one VBO for the dynamic ones (with appropriate usage hints: GL_STATIC_DRAW and GL_STREAM_DRAW). This option would probably work best on OpenGL implementations where the driver can put GL_STREAM_DRAW buffers in GART space.

ptitSeb commented 5 years ago

For quake3, the current method should be "--second way". But I'll do a capture to be sure.

In the mean time, I did a capture of old STK, and I can see everything (at least the game objects) are inside VBO now. The issue is, on the starting line, there is 2002 draw call. So yeah, 2k draw call per frame, even with VBO, is still slow. Maybe add back the "printf" in src/fpe.c to check if VBO are used, to be sure your PLIB version do glNewList call.

Also, I notice a little space for optimization on VBO handling, I'll implement that (I don't think it will bring any speed update, but knows...)

kas1e commented 5 years ago

Maybe add back the "printf" in src/fpe.c to check if VBO are used, to be sure your PLIB version do glNewList call.

You mean there ? :

void fpe_glDrawElements(GLenum mode, GLsizei count, GLenum type, const GLvoid *indices) {
    DBG(printf("fpe_glDrawElements(%s, %d, %s, %p), program=%d, instanceID=%u\n", PrintEnum(mode), count, PrintEnum(type), indices, glstate->glsl->program, glstate->instanceID);)
    LOAD_GLES2(glBindBuffer);
    void* scratch = NULL;
    realize_glenv(mode==GL_POINTS, 0, count, type, indices, &scratch);
    LOAD_GLES(glDrawElements);
    int use_vbo = 0;
    if(glstate->vao->elements && glstate->vao->elements->real_buffer && indices>=glstate->vao->elements->data && indices<=(glstate->vao->elements->data+glstate->vao->elements->size)) {
        use_vbo = 1;

printf("We USE VBO, yeah !\n");

        gles_glBindBuffer(GL_ELEMENT_ARRAY_BUFFER, glstate->vao->elements->real_buffer);
        indices = (GLvoid*)((uintptr_t)indices - (uintptr_t)(glstate->vao->elements->data));
    }
    gles_glDrawElements(mode, count, type, indices);
    if(scratch) free(scratch);
    if(use_vbo) gles_glBindBuffer(GL_ELEMENT_ARRAY_BUFFER, 0);
}