ptitSeb / gl4es

GL4ES is a OpenGL 2.1/1.5 to GL ES 2.0/1.1 translation library, with support for Pandora, ODroid, OrangePI, CHIP, Raspberry PI, Android, Emscripten and AmigaOS4.
http://ptitseb.github.io/gl4es/
MIT License
686 stars 156 forks source link

quake3 on amigaos4 #52

Closed kas1e closed 6 years ago

kas1e commented 6 years ago

@ptitSeb Was able to build quake3 from your sources, for both, minigl and gl4es. In minigl version everything looks and works correct. In gl4es version, games starts , first image anymation come fine (that one with ID shoot in), and then when there should be shown "CD KEY , please enter your cd key, ACCEPT", i have instead nothing , just still ID there. If i then press "enter" again, so to go to menu, then i see, that something start changes visually, but all distorted and wrong. Have a look at screenshot:

http://kas1e.mikendezign.com/aos4/gl4es/games/quake3/first_run.jpg

And there is console output: http://kas1e.mikendezign.com/aos4/gl4es/games/quake3/first_run_console_output.txt

At moment it is yesterdays's version with enabled workoround we do and almost without debug enabled (only fpe one seems so).

So i will for now try to build latest commit version (where we have no workorounds), as wel as will enable whole debug everywhere.

ptitSeb commented 6 years ago

Got my trace. Now I need to analyse it image

kas1e commented 6 years ago

Its not only with mirror slow, just with it problem visibly. Whole framerate slower in about 3-4 times everywhere, so its not related to particular effect, but to some general things..

ptitSeb commented 6 years ago

I think I have a idea why it's slower: It seems to use some clipplane. Look a vertex shader:

#version 100
precision mediump float;
precision mediump int;
uniform highp mat4 _gl4es_ModelViewMatrix;
uniform highp mat4 _gl4es_ModelViewProjectionMatrix;
attribute highp vec4 _gl4es_Vertex;
attribute lowp vec4 _gl4es_Color;
attribute highp vec4 _gl4es_MultiTexCoord0;
// FPE_Shader generated
varying vec4 Color;
uniform highp vec4 _gl4es_ClipPlane_0;
varying mediump float clippedvertex_0;
varying vec2 _gl4es_TexCoord_0;

void main() {
vec4 vertex = _gl4es_ModelViewMatrix * _gl4es_Vertex;
clippedvertex_0 = dot(vertex, _gl4es_ClipPlane_0);
gl_Position = _gl4es_ModelViewProjectionMatrix * _gl4es_Vertex;
Color = _gl4es_Color;
_gl4es_TexCoord_0 = _gl4es_MultiTexCoord0.xy;
}

I know that the way I implemented clip planes are probably not the best way. I'll try to disable them to see if it improve things.

ptitSeb commented 6 years ago

Yes, If I disable clipplane (line 219 of scr/glx/hardext.c change to hardext.maxplanes = 0;//6;) I have 22fps (like on GLES1.1). But the mirror don't render correctly.

kas1e commented 6 years ago

Will check now. If it at least on the pair as GLES1.1 for you without, maybe it will be faster than minigl version for me now, 10 mins and we will know :)

kas1e commented 6 years ago

In my case sadly disabling clipplane almost make no differences, just +4 fps. I made i test check on timedemo1/demo four, and:

minigl: 1260 frames 15.1 seconds, 83.2 fps gl4es_no_clipplane: 1260 frames 56.6 seconds, 22.3 fps gl4es_all_as_before: 1260 frames 58.2 seconds, 21.7 fps

About 4 times slower, while should be probably 2-3 times faster :) At least in theory.

ptitSeb commented 6 years ago

Well, are you sure MiniGL is doing TnL in software?

Also, let's wait for some profiling from Daniel.

kas1e commented 6 years ago

Yes, 100%. And it also works throught Warp3D (that one which OGLES2 uses). And even if, then GLES2 version should't be slower, but at least the same, or a little bit faster (because of shaders). But as minigl have TCL in software, then gl4es verion should be 100% faster.

Let's wait what Daniel will find..

kas1e commented 6 years ago

While Daniel checking it, i got also some note from Hans (warp3d developer): Maybe something is flushing the pipeline like crazy? That'll give a performance hit, because there's a limit in how many draw operations/command-queues can be submitted per second.

But that probably not about gl4es, but about our ogles2 driver..

ptitSeb commented 6 years ago

The GLES2 Trace I have done shows that when facing the mirror, there is around 550 draw commands. This seem reasonable.

Many GLES2 hardware also doesn't like to have many draw command, so gl4es tries to group them as much as it can.

kas1e commented 6 years ago

As author of warp3d says "5fps * 550 = 2750 draw calls/s. We can manage a lot more than that, so something must be getting in the way.

Also if you say that gl4es tries to group them as much as it can... Then it should't be "fluhing the pipeline like crazy" issue then :(

ptitSeb commented 6 years ago

What is even stranger is that the other games that already work use similar stuff anyway. Still,there must be either something OGES2/Warp3D doesn't like in shaders or something in how the data are fed in the driver. Just to note, gl4es doesn't use any VBO for now (I plan to try use them, but for now, all VBO are emulated), and the array generated by glBegin(..) / glEnd() are not interlaced, they are separate arrays (I'll try to work on that also, it can helps performances I think).

kas1e commented 6 years ago

Oh,, emulated VBO :( That can the reassons maybe ? Daniel says that other projects also use VBO a lot, and all works fine, but he didn't know for now that gl4es do emulate them in software .. Imho , that for sure can be reasson ?

kas1e commented 6 years ago

And games in question which already works and which we tests just 3 : bloboats, letters fall and cadog: all os them very little, small, and can't show any problems with speed.. Quake3 imho first test which "a little make things harder".

ptitSeb commented 6 years ago

Just to note, VBO are not used by Quake3, like in most OpenGL 1.x games. But maybe OGLES2 driver expect all its data in VBO yes. Using actual VBO in gl4es require some work. It was not designed to use VBO in the first place, so I need to alter many critical place. Using real VBO is part of my TODO, as I expect some speed boost in some architecture (but not on the Pandora according to some preliminary tests done with Doom3), but, it's not a small change...

kas1e commented 6 years ago

Thanks for explain, will see what Daniel will say about, after he profiling it on our side.

kas1e commented 6 years ago

Btw, doom3 also works over gl4es ?

ptitSeb commented 6 years ago

No, regular Doom3 doesn't use GLSL and will not work on gl4es (but I have to try Dhewm3 https://github.com/dhewm/dhewm3 and with BFG edition, that I think support GLSL).

I use the Dante project, that is a direct GLES2 port of Doom3: https://github.com/omcfadde/dante (slightly adapted to the Pandora...)

ptitSeb commented 6 years ago

Mmmm, when I analysed the number of draw call, I used the regular Pandora version, so using glDrawElements(...) that are quite optimized by the idTech3 engine. But on AmigaOS/GLES for now, it's using the glBegin(...)/glEnd() code path, that I don't know well. I have to check in that case if there isn't something odd or broken happenning.

kas1e commented 6 years ago

@ptitSeb I have some very HOT discussion with ogles2 author, and .. from begining, he profilin it a bit, and that what he say:

The reason for Q3 being so slow is that the game does practically zero batching. ogles2 is flooded by glDraw-calls of practically always less or equal to 10 triangles. If I artifically limit ogles2 to ignore any draw-calls with more than 10 triangles, then everything looks like before. Drawing a scene like that is the ultimate most inefficient way to do things and one of the big "donts" in terms of GL. OGES2 is not optimized for what Q3+gl4es deliver right now and I probably won't optimize it for that kind of stuff. Eveything will be fine, as soon as you start to feed it with something else than single triangles. So the obvious solution is: extend gl4es instead to collect the data of such small draw calls and then issue a bigger one.

When he say "game", he probably mean "when it compiled over gl4es". I think he didn't mean quake3 code, as that one for sure should do things right ?

When you say " I used the regular Pandora version" , do you mean non gl4es version , but just some regular one ? I mean, while it use on amigaos/gles gl_Begin/gl_End, it should probably do the same and for all other gl4es port everywhere when they works over GLES2 backend ?

kas1e commented 6 years ago

@ptitSeb Also Daniel explain a bit futher about, so i jsut will copy+paste his answer, hope he doesnt mind (it will just help to make things better):

Right now Q3/gl4es draws a scene in a way that's no good with ogles2/Nova. The latter like rather big amounts of triangles. That's what they are designed for. And this is how you get good performance from it.

Making hundreds or thousands of draw-calls with less than 10 triangles each is missing the topic of those libs. And it was never a good idea. Apparently you're lucky and other ogles2 implementations on other hardware isn't hurt by that so much. And apparently you're lucky that MiniGL/Warp3D(SI) is of some help in the background.

The thing is, like said before, that this type of inefficient drawing is not what 99% of ogles2 programs do. That's why I'm absolutely not convinced that it makes sense to optimize ogles2 also for this niche task. IMHO something like that has to be implemented in the next higher level (where it's also most likely easier to do and where other systems also benefit from it). The next higher level in case of the constellation here would be gl4es.

ptitSeb commented 6 years ago

By regular, I mean using gl4es and all extension. So it use glDrawElement(...) and the calls are batched (with 550 calls to draw the initial scene in front of the mirror).

But, remember, on AmigaOS4, we have disabled that extension, and Q3 use a glBegin(...)/glEnd() loop to draw. Now, GL4ES should try to batch this calls. There some code to simply do that. But I haven't checked on Q3 if the "collapse" code is working. I'll check tonight on the Pandora: I'll disable glDrawElements extension and do another GLES2 Trace capture, to see what is happening. If there is many small batch of 10 triangles, the Pandora will just go at 1fps, so I'll see it. I'll then try to see why the collapse code is not working (as it should, batching drawing call is a sure source of fps!). In the mean time, you can try to use this env. variable LIBGL_BEGINEND=2 to make gl4es try harder to batch glBegin/glEnd call. Maybe it will help?

(also, don't forget that once the fix for the vertex attribs is done, you can enable all extension and have 550 calls for the frame)

ptitSeb commented 6 years ago

Just to be clear: I do agree that making many little call of a few triangles instead of single large call with many triangles is bad for the performances. I'm well aware of that, and that's why I worked on gl4es to try to avoid that, by batching as much as I can. Unfortunatly, this kind of things is not uncommon in games, but usealy, gl4es is abble to batch reasonable chunk of traingle. The Warp3D Hardware is more powerfull than what the Pandora have, so there must be something wrong indeed, and we'll find out what it is.

kas1e commented 6 years ago

Ok, thanks for help ! In meantime i will try LIGBL_BEGINEND=2.

But did i undestand you right, that once GL_EXT_compiled_vertex_array will work, it will do 550 calls as one call (i.e. batch them all), and not like its now when do 550 little small calls ?

ptitSeb commented 6 years ago

Yes, with GL_EXT_compiled_vertex_array you will have the same rendering as I had yesterday on the Pandora. idTech3 engine is pretty good at batching call. It was not the case with idTech1 and 2 (because of the software rendering), but, again, I don't know well the "glBegin/glEnd" path, maybe it's fragmented. I'll check tonight.

But you will have 550 "large" calls. If I beleive Daniel, current Q3 renders with thousand of small calls, not hundreds of large. 550 call per frame is a good value.

kas1e commented 6 years ago

Yeah, we can do on ogles2/nova much more thant 550 calls per frame, just throwing hundreds or thousands of such micro-draw-calls per frame at the ogles2 / Nova AND expect it to deliver fast results ... :)

If it will be 550 large calls which hold whole quake-triangles-data, that probably will boost perfomance a lot ?

ptitSeb commented 6 years ago

I took a quick look at the code.

Main drawing function is here: https://github.com/ptitSeb/ioq3/blob/master/code/renderergl1/tr_shade.c#L177 there is some #ifdef HAVE_GLES that is the pure 1.1 GLES renderer. But for gl4es, it's a regular build so HAVE_GLES is not defined. As you can see, because qglLockArraysEXT is undefined (it comes with the extension), the code will call R_DrawStripElements( numIndexes, indexes, qglArrayElement ); This function does that (quoting code comments):

/*
==================
R_DrawElements
Optionally performs our own glDrawElements that looks for strip conditions
instead of using the single glDrawElements call that may be inefficient
without compiled vertex arrays.
==================
*/

If you look at the code of this function (start here: https://github.com/ptitSeb/ioq3/blob/master/code/renderergl1/tr_shade.c#L177 ) you'll see it tries to do TRIANGLE_STRIP, so making more calls instead of on single glBegin/glEnd, as it was more optimised in the early days. I don't know how your AmigaOS miniGL handle this, but the only way to collapse thoses triangles strip is to put them back in individual triangles... gl4es is supposed to do that, but maybe it doesn't for some reason (and so LIBGL_BEGINEND=2 will not help here).

kas1e commented 6 years ago

I (well, not that I, but Daniel point me on it), that probably minigl (which works not over our warp3d which have shaders, but over some other older warp3d), have "batching of calls" inside.

But we need to know how it reacts for you once you disable extensions. Through, your ogles2 driver may have also "batching" inside ?

But, as with current gl4es compile we have many calls with less than 10 triangles , then probably "bachnig" code in gl4es dind't work in case when we disable extensisions ? But that to be seen once you run quake3 without extensions on pandora.

Btw, is there any way, so i can know that settings of environments works at all ? I.e. some simple test environment, which when i set, will be visibly that gl4es do change things ?

kas1e commented 6 years ago

Btw, about "550 calls", its about your Pandora's check, how many we have : dunno. All what Daniel say: its a lot lot lot of calls with less than 10 triangles each.

ptitSeb commented 6 years ago

Ok, I have done an analysis when using the glBegin/glEnd path: its bad. It seems that gl4es doesn't collapse the blocks (but it should). So I have a bug in gl4es to leads to bad performances in this case. I have 3906 drawing calls (instead of 550), and yeah, performances are terrible on the Pandora too.

I have to fix that, as the code for collapsing the call is there, so it's "just" a bug somewhere...

kas1e commented 6 years ago

Oh, thats start! That give you probably 1 fps only ?:)

ptitSeb commented 6 years ago

yeah, a few fps, 2 or 3, not sure... awfull anyway, I haven't let it run for long.

kas1e commented 6 years ago

:) well, as i have 5-6fps in that case, we can expect to have values better than on minigl (22fps there).

Fixing batching with glbegin/glend, will probably help all the other stuff too.

ptitSeb commented 6 years ago

Yep. batching help a few games...

kas1e commented 6 years ago

I can help with debug , but probably there is no needs as you have seen it all on pandora .. But as i can compile it all fast .. Let me know if i can be any of use there :)

ptitSeb commented 6 years ago

I found something. I'll push something soon.

ptitSeb commented 6 years ago

Ok, I have pushed the change. I get my 10fps back when facing the mirror. You should note some significant improvements too.

ptitSeb commented 6 years ago

I have made a capture, and it's back to ~500 calls. There are still a few draw calls that are not merge but seems to be compatible, but that much better then before.

ptitSeb commented 6 years ago

Ok, found that last small bug that made some command to not merge when they should.

Number of calls is now down to ~300. I hope you'll like it :p

kas1e commented 6 years ago

Tested with 500 calls, without clamp was 13 fps only, now will check with 300 :)

kas1e commented 6 years ago

With 300 have 14 fps only :( and that without clamp in shaders. Will enable them back and see what it will have.

kas1e commented 6 years ago

When put clamp in shaders back, it give abou the same 13-14 fps. I.e. for the same timedemo1/demo four, i now have:

minigl: 1260 frames 15.1 seconds, 83.2 fps gl4es: 1260 frames 39.1 seconds, 32.2fps

Better than before (~22 fps), but still faaar away from minigl with this TCL in software .. We moving step by step, but still .. uhm, i will have needs probably upload new quake3 binary to daniel for another profiling now ..

ptitSeb commented 6 years ago

Well, now, you can re-ask Daniel to do some profiling. Activating the glDrawElements(...) will not reduce the number of call at this stage, but they will reduce the cpu load / number of malloc()/free() done...

kas1e commented 6 years ago

Btw, when i enable GL extensision in MiniGL version, it give me 41-42 fps instead of 22 when i look at mirror.

ptitSeb commented 6 years ago

I have done some quick cpu profiling on the Pandora. With the extension, it's clearly GPU limited when facing the mirror, but without the extension, I can see it's more CPU limited (and I have 10fps vs 6fps at initial position). So there maybe some stuff to optimize, maybe.

kas1e commented 6 years ago

I just do some checking, that what i have now on aos4 with current minigl and gl4es/ogels2:

With GL extensions OFF:

MGL/SDL1: 50.2 MGL/SDL2: 54.9 GL4ES/SDL1: 31.8

With enabled GL extensions:

MGL/SDL1: 81.9 MGL/SDL2: 89.1 GL4ES/SDL1: 35.8

That cleary show, that extensions help A LOT... Which ones through , i do not know. List of support MiniGL extensions not that big as in the GL4ES, but on running Quake3 says only about the same 3 extensions:

using G_EXT_Texture_env_add using GL_ARB_multitexture using GL_EXT_compiled_vertex_array

So, is it probably GL_EXT_compiled_vertex_array which give that huge boost ? As other 2 in GL4ES version only add 4 fps.

But also "non extensions" version looks slower in 1.7 times.

kas1e commented 6 years ago

That for timedemo1/demofour

ptitSeb commented 6 years ago

You cannot really trust the benchmark for ogles2/gl4es with GL_EXT_compiled_vertex_array as long as the bug with vertex attribute is there. You don't know what it is trying to draw, and how drawing the garbage slowdown things.

kas1e commented 6 years ago

Of course, i just measure it all without GL_EXT_compiled_vertex_array, i have it messed in hardtex.c, so only 2 others extensions loaded.

But that yes, to be seen once it fixed.

Problem imho, is non-extensions version, which still slower than minigl with software TCL.. But let's see what Daniel say. He probably will just wait for the vertexattrib fix firstly..

ptitSeb commented 6 years ago

I have pushed a last optim on glArrayElements that also helps quake3 when extension is off. I think gl4es now works correctly, so I stop trying to get more fps for now on quake3 and the glBegin/glEnd code path.