Closed kas1e closed 6 years ago
@ptitSeb Btw, in our suggestion we have "glVertexAttrib with GL_UNSIGNED_BYTE .." , but didn't gl4es convert everything to 32 bits anyway before sending it to driver ?
Well, yes, except for this one (and technically, it's only accepted when size=4, so RGBA color fit in a 32bits data space, but yeah, it's 8 bits data basically). GL_UNSIGNED_BYTE are no "endian-sensitive": there are the same in big endian and little endian, so it should not be an issue, unless GL_UNSIGNED_BYTE are indeed not supported by Warp driver because only 32bits data is implemented. In that case, either gl4es or OGLES2 should take care of the data. I can probably force a workaround in gl4es to be sure no GL_UNSIGNED_BYTE are used (of course, that will convert a few data, so it would be better if Warp accept 8bits datas, speed-wise).
Now that think of it, there is another case of non 32bits data: the elements indices used in glDrawElements can be GL_UNSIGNED_SHORT (so 16bits), but this seems supported or nothing would work.
Hans answer some time ago that:
Warp3d driver currently can't handle anything other than 32-bit datatypes.
The problem is that Southern Islands and newer GPUs are little-endian only while our CPUs are big-endian. So the driver has to convert the endianness as its copied to the GPU. Right now it assumes that everything is 32-bit, and it probably returns an error if you try to use 8/16-bit vertex data.
Writing a system to perform the correct endianness swapping for the data is on the to-do list. It'll be a bitch to get right, because it's got to handle interleaved data, structures with different datatypes, etc. It's one of those things I wish I could get someone else to write...
Is it of any help for us ?:) Maybe we can try some simpe test-cases to see if it indeed 8bit vertex data behave that way ?
Well, I understand Hans point of view about 16bits data. But 8bits are the same in big endian and little endian. So adding support for that should be straightforward.
So, here are the question to awnser to understand what the next step should be
I will ask them about.. Through, we for now can't be sure that our issue because of that ? I mean, our previous workoround was more or less "it is or normalisation, or normalisation with ubyte, or something of that sort". Can we somehow reduce possible scenerios ?
Well, the workaround I gave you earlier should cover 90% of the cases. But it will not cover the case with software using Shaders and VA with GL_UNSIGNED_BYTE directly. Still, for Quake3 and most software you already tried, it should work o ensure no 8bits data are used. Do you have issues with that workaround?
I'll work on a better workaround soon (I first need to add some infrastructure to better handle data conversion in VA, for all platform, then the AMIGAOS case will be easy to add).
Plz wait a bit with working on, we seems to find something, need to clear it all a bit
sure, don't worry, I have plenty of other stuff to try out...
Good news ! Seems we deal with on our side ! There is some limitations in warp3d, and that what Hans says:
So the "32-bit data only" statement isn't strictly true any more, and hasn't been for a while.
NOTE: DBOs are still 32-bit only.
This is where I got a myself confused because I thought the hardware would treat floats and ints differently. However, 32-bit ints and floats are handled the same way: they get passed on to the shader unchanged (i.e., int VAs must go to an int shader input). So, 32-bit int VAs have probably been working all along.
The latest beta also correctly sets the VA descriptor for 8 and 16-bit attributes, including whether it's normalized. So, it should work provided you restrict each VBO to having one data size only (8-bit data in one VBO, 16-bit data in another, etc.). I haven't tested that, though, because I'd completely forgotten that my endianness handler was a bit more sophisticated than "32-bits only." Let me know what happens if you try it...
So, after that, Daniel just add in ogles2 some code which internally convert every GL_UNSIGNED_BYTE VA for client memory usage. For safety he converting to float internally, to not relying on this new 32bit integer support, because he better prefer to do his own normalization and because that way it should work with previous Nova versions too.
So, in end of all i checked it all, and quake3 with extensisions , as well as irrlicht engine examples works now , yeah !
I also do check quake3 with enabled extensions (so gldrawelements in use), to see speed differences, and results are:
640x480
q3_minigl_sdl1: 90.8 fps q3_minigl_sdl2: 86 fps q3_gl4es_sdl1: 74 fps
800x600
q3_minigl_sdl1: 87.5 fps q3_minigl_sdl2: 83.1 fps q3_gl4es_sdl1: 72.2 fps
1024x768
q3_minigl_sdl1: 82.2 fps q3_minigl_sdl2: 76.9 fps q3_gl4es_sdl1: 68.5 fps
1600x1200
q3_minigl_sdl1: 67.5 fps q3_minigl_sdl2: 68.9 fps q3_gl4es_sdl1: 60.2 fps
As you can see q3 even with gldrawelements a little slower still.. But at least in 1600x1200 it almost on pair. Is there any possible ways to accelerate gldrawelements() mode ?
Well, good news indeed.
Until (if?) GL_UNSIGNED_BYTE is handled by the hardware, using CPU to convert the data will introduce a slow down. As you see, the more HiRes (so more pressure on GPU, less on CPU), the less speed difference, so CPU used for Data conversion count here. Also, minigl is, I guess, made for Quake3 in mind and is probably heavily optimize for this engine. I don't it's completly fair to expect gl4es to be faster here (even when GL_UNSIGNED_BYTE handled by hardware). What gl4es will bring is more function, and faster speed when using advanced OpenGL functions (like TexGen or shaders), I don't think you'll see any speed advange of gl4es will using idTech3 based games, or simple game with low geometry and no complex opengl renderer. Remember Neverball is much faster with gl4es (and this one use TexGen). You may see also benefit from gl4es on SeriousEngine or maybe TORCS also (or SpeedDream too). Foobillard++ also should works better on gl4es (if it even works with minigl).
Checked the differences between our previous workaround in gl4es, and daniel's one done in ogles2 : daniel's one is faster in q3 for about 3fps.. Probably he just use less of memcopy routines (or they smaller or something?). Probably that also mean that when it will be done in hardware (in warp3d), it will give us another few fps more..
Btw, is initiall VBO support you add to gl4es some time ago works? I mean can we be sure it works at all ? I just tried to enable it to test with q3 / gldrawelements() way, and while i have in shell output that words "LIBGL: VBO used (in a few cases)" , i have no differences in q3 in terms of speed at all. I mean no single percent, which make me think it may not work ?
Well, the piece of code of the GL_UNSIGNED_BYTE was just something fast to test, not optimized or anything... But yeah, once handled in hardware, that should gives a few fps.
About VBO. I'm not sure. It was a quick hack, that I will probably remove at some point (and try to implement proper VBO handling). Even if it works, don't expect any speedboost as the VBO is only created for 1 glDrawElements
, so nothing usefull here.
Probably we can close that issue for now.. Thanks for help !
Hey @kas1e , I was just wondering: did you released Neverball with gl4es and If yes, what are the feedbacks?
I still waiting when latest warp3d and ogles2 will be released in public , as those fixes we have lately all in private-betatest state, and users didnt have them.. So once last ogles2 and warp3d will be released i can also release all gl4es based apps :)
Ah. Seems long. Do you why the fixes haven't been released yet? Working on more fixes or it just take time?
Its just company who own all this, release per some time some "enhancer pack" (like service pack for winxp), where they put all the stuff their devs works on (drivers, libs, devices, apps, tools, etc). So it usually take some time. As far as i aware it should be released "very soon". But very soon can mean as 2 weeks, same as few months :)
Ok, I see. Thanks for the info. Let's wait...
Hi ,
For now we have that workaroud in the ogles2.library by Daniel, but Hans want to add necessary conversion code to the Warp3d itself (so its will be more right, and maybe will make things be a bit faster).
That what Daniel wrote about that workaround when he made it:
Now, Hans trying to implemnt conversion (endian swap) code in Warp3d itself, and in latest vesrion he have added support VBOs with mixed data sizes (e.g., 8, 16 and 32-bit vertex attributes in one VBO)
Through he have issues with q3 still (when we trying to use ogles2.library without workaround) : Menu, etc, all is fine, but in the game itself we still have mess. Its differend kind of mess that before while this wasn't implemented (should i made a video, to show what i mean and for better understanding ?), but still game didn't renders correctly.
After some debugging, the last mail which i have from Hans some days ago was:
Last time i checked, Q3 itself doesn't use VBOs, but packing data into VBOs must be happening elsewhere.
But i've finally figured out what's going on: something upstream from Warp3DNova is writing data into VBOs in a different layout to the one declared.
On a hunch I temporarily made it 32-bit endian swap uint8 data, and all the triangles were drawn correctly (but with the wrong colours). Q3 has VBOs with two layouts, e.g.:
W3DN_SI.library (6): VBO 0x5C399318 has data of mixed sizes (e.g., 16 and 32-bit values). Setting up conversion table. W3DN_SI.library (8): Building endianness conversion table for 4 interleaved arrays W3DN_SI.library (10): Endianness conv: offset: 0, count: 3, type: float32 W3DN_SI.library (10): Endianness conv: offset: 12, count: 4, type: uint8 W3DN_SI.library (10): Endianness conv: offset: 16, count: 2, type: float32 W3DN_SI.library (10): Endianness conv: offset: 24, count: 2, type: float32 W3DN_SI.library (10): Endianness conv series: offset: 0, convCount: 4, stride: 32, blockCount: 703, size: 22496 W3DN_SI.library (6): VBO 0x5C398518 has data of mixed sizes (e.g., 16 and 32-bit values). Setting up conversion table. W3DN_SI.library (8): Building endianness conversion table for 3 interleaved arrays W3DN_SI.library (10): Endianness conv: offset: 0, count: 3, type: float32 W3DN_SI.library (10): Endianness conv: offset: 12, count: 4, type: uint8 W3DN_SI.library (10): Endianness conv: offset: 16, count: 2, type: float32 W3DN_SI.library (10): Endianness conv series: offset: 0, convCount: 3, stride: 24, blockCount: 998, size: 23952
In otther words. What's for now is happening, is Warp3DNova is told that a VBO has a particular layout, and then data with a different layout is copied to it. As a result, the endianness conversion is wrong.
I of course wrote to Hans back, that as if everything works fine on other platforms, then we can't blame gl4es at all, as well, as if Daniel's workaround works fine in ogles2.library, then, its for sure on our side (or still warp3d have some problems, or ogles2.library doing something (or not doing) ).
But then i have no answer on it, and as well have no answer from Daniel about. But Daniel married week ago, and Hans will be tomorrow, so i can't expect them to answer fast :)
Anyway, what do you think about ? Maybe you have some ideas ..
Thanks !
Mmm, I'll add some logging of the VBO stuff, with stride and some detail and make a run in quake3, to compare with what Hans has seen.
@kas1e : do you use LIBGL_USEVBO=1 ? If no, that gl4es makes no use of any vbo If yes, then I have to check that code (but this function is not really usefull and should not be used IMO).
I use only what gl4es use by default, i.e. set no special settings..
Yeah, so those VBO are created by OGLES2 driver... Nothing I can do at this point :( (also, I wonder if those VBO are important, performance wise).
Or they some internal VBO's of warp3d .. I wrote to Hans about, still waiting answer from Daniel.
For sake of tests i also trying to use LIBGL_USEVBO=1 , but that make no differences. Same problem, and no single differences about FPS. But if i remember right, that code which enabled when we do use LIBGL_USEVBO was some fast hack for tests, and onl y trying to do VBOs in some cases only.. Through, while i can understand why it make no differences in speed when no gl extensions enabled (as used glBegin/glEnd route), i don't understand why when we enable gl extensions (and so glDrawElements), it speed up nothing, even on 1 fps .. Maybe it just not real VBO's code now, but some emulated, so will make no differences ?
Yeah, the VBO code in gl4es is not trigger often, so it's pretty useless.
What do you call "enable gl extensions": I mean, how do you do that? Hacking in the code or using r_primitives in config file?
Just by enabling in the menu of q3 doing "on" for enable gl extensions , and restart. But its the same as r_primitives in config file, yep.
In the console log, check what it is using for drawing, you can see inidividual glIndexArray or glInterleavedArray (I don't remember the exact wording) or a single glDrawElements (or is it glDrawArrays).
You mean to enable gl extensions in q3 , and to check what functions it use when extensions enabled ?
If so, then, when NO gl extensions enabled (so it should use pure glBegin/glEnd route) we have:
rendering primitivies: multiple glDrawElement compiled vertex arrays: disabled
Then, when i eanble gl extensions in the q3 (so it should use gldrawelements), we have:
rendering primitives: single glDrawElement compiled vertex arrays: enabled
In theory when it use gl extensions (so as from console "single gldrawelement") it should probably use your VBO code when LIBGL_USEVBO=1 ?
Well, between multiple glDrawElements and single glDrawElements, don't expect much differences! Yeah the single glDrawElements should trigger the VBO code of gl4es (probably, I have to check), but it will not make any real difference here.
Really, that compiled vertex arrays extension is not usable by gl4es (but I guess miniGL does use it). What it does is tell the opengl driver that the vertex data (and only the vertex data) are set and will not change between glLockArrays(...)
and glUnlockArrays()
so a opengl driver that don't have hardware transform can transform the vertices... But has gl4es use Hardware T&L (in shaders), it's just useless.
And quake3 make changes to other arrays (colors, textures UV) in between, so I cannot really build anything stable...
At least when i do fps benchmark over enabled and disable gl extensions i have good difference . Like for disabled it give about 45 fps , and with enabled about 70
oh really ? that much ? Strange, I wouldn't have expected going from "multiple glDrawElements" to "single glDrawElements" to be that different. I was more expecting that kind of diffence between using glBegin/glEnd compare to glDrawElements...
Anyway, about VBO, I don't think they are coming from gl4es.
Something in wrong in our compare :) I mean, when in console log of q3 we have "multiple glDrawElements", that is when i do disable gl_extensions, which, in turn, mean, it should be pure glBegin/glEnd route. Why it wrote in console log "multiplu glDrawElements", i do not know, but that when glBegin/glEnd route works.
And then, when i enable glextensions, so, it use glDrawElements, it then wrote in q3 output "single glDrawElement".
But why it wrote "multiple glDrawElements", when we have disabled gl extensions (and glBegin/glEnd route should be in), i do not know.
If I remember correctly, there is actually 3 drawing path in quake3 engine games:
glBegin(..)
/ glEnd()
glDrawElements(...)
where it tries to make stripes out of trianglesglDrawElements(...)
where it just draw the triangles "as-is"You can control that with r_primitive (0, 1, 2) in the cfg. The gl_extension allow the use of any GL extension, on of them is the glLockArray to make the r_primitive=2 default (all this from memory, I haven't rechecked the code).
Ok.. doing some more tests via config file, so:
For first i set all extensions to 0 , and only play with r_primitivies:
seta r_primitivies "0" : 45.3 fps, in console writen "multiply glDrawElements".
seta r_primitivies "1" : 45.3 fps, in console writen "multiply glDrawElements".
seta r_primitivies "2" : 56.1 fps, in console writen "single glDrawElement".
So, for first as can be seen they even for glBegin/glEnd route, wrote "multiply glDrawElements". For second we can see that only swap from "multiply" to "single" give us 10fps+. (so that when we swap from glBegin/glEnd route to glDrawElements).
Then, i tried to allow gl_extensions, but enable only vertex_buffer_object one:
seta r_primitivies "0" : 56.1 fps, in console writen "single glDrawElements".
seta r_primitivies "1" : 45.3 fps, in console writen "multiply glDrawElements".
seta r_primitivies "2" : 56.1 fps, in console writen "single glDrawElements".
What it all mean, that seems compiled vertex arrays do nothing usefull ! Strange !
Then for sake of tests, i enabled compressed textures, with r_primitivies 2 , and it give me: same 56.1. Then i enable multitexture extension, and then it give good boost ! 68,3 fps ! Then i tried to add texture_env_add, and it nothing.
So ..what it mean, that compiled vertex arrays do nothing ! Really strange. Daniel was sure it will speed things up for us. Another strange thing, is that multitexture give about 10fps + as well !
Daniel all the time say that once compiler vertex arrays will works, it will give us huge speed up. But seems that all the speed up we have now , it's just from multitexture extension and swapping from glBegin/glEnd route to glDrawElements.
Weird ..
And doing some more interesting tests to prove that minigl version works faster only because that "compiler vertex array" extensions works : i use r_primitivies 2 (so single gldrawelement) , and only enable multitexture extensions, both version and minigl one, and gl4es one, give me about 70fps (gl4es one 68, minigl one 70), and then, i enable vertex compiler arrays extensions ,and while gl4es one give me the same 70 fps , minigl give 82.
What mean that this vertex arrays extensions do nothing in gl4es, and that was our misledading with Daniel and Hans before, as we was sure that once we will fix distortion mess, things will works a lot faster, but then, if it doing nothing, then no surprise that its give no boost.. Surprise that we even almost on the same pair as minigl :)
Is it possible to made something so that extension will works in gl4es and made something usefull ?
The glLockArrays extension, no, unfortunately. I already tried a few things, but it's really only good when you have to transform the vertices in full software, I have not been able to do anything usefull with it.
But again, they are many other engine that will give you better performances with gl4es than with miniGL (and also, more functions / effects).
Yeah, q3 is too oldschool .. Anyway, sorry for being dumb, but did i understand right, that extension called "compiled vertex array" in q3 are glLockArray thing ?
Doesnt you mind if i will discuss it with Hans and Daniel, and if they will have any ideas of how implement it i can bring it at you, and maybe we can made sonething helpfull ? One head is good, but 3 is better :) if of course you want to spend any time on..
Yep. Look at the official spec here: https://www.khronos.org/registry/OpenGL/extensions/EXT/EXT_compiled_vertex_array.txt
You can discuss it of course, and I'll be glad If you find an idea on how to use this extension for something usefull!
The problem with this extension is that you can enable Vertex Arrays after the Locking... Making this extension pretty unusable.
Lokk here for example: https://github.com/ptitSeb/ioq3/blob/master/code/renderergl1/tr_shade.c#L1279
You see the call to glLockArrays(...)
and then the engine enable GL_TEXTURE_COORD_ARRAY
and GL_COLOR_ARRAY
. That means those 2 arrays are not Locked, but still used for drawing (and the values of thoses 2 arrays will be changed between the Lock and Unlock)...
You mean "can't" enable" ? But how it works on non-shader implementations of different opengls ? Or you mean it can't in gl4es, because of how gl4es structured and works ?
No I mean CAN enable. The client software can enable Arrays that are not locked, so that make the Lock/Unlock mecanism useless (for gl4es at least), because a drawing command is part on locked Arrays (like Vertex coordinates) and part on unlocked Arrays (like vertex colors or UV).
Hi! At moment i have got answer from Hans only, and dunno how helpfull it can be, but that what he bring:
A few possible ideas:
No sure I understand but I do:
Well, (1) I can probably try to implement, but that seems to be quite some work, and I'm unsure of the performances gain. Plus I don't know what vertex attributes (color, how many texcoords, normals) will be needed for drawing. For (2) I'm unsure how standard this thing is: some vertex attributes in a VBO and some in another VBO. While I can probably try to implement that, again, I'm unsure of the performances gain (as you still need to transfert colors and other VA) and unsure how various GLESv2 driver will accept this kind of things.
Has this optimisation are only for old engine, and those engines probably are running pretty well on most hardware already, I don't think it's worth the risk of slowing other stuff down, and not worth the added complexity of the code.
For (1): Yes. Something like glBufferSubData() can be used to upload just the changed attributes.
For (2): The remaining attributes would go into a separate VBO in this variation. Basically, you'd have one VBO for static data, and one VBO for the dynamic ones (with appropriate usage hints: GL_STATIC_DRAW and GL_STREAM_DRAW). This option would probably work best on OpenGL implementations where the driver can put GL_STREAM_DRAW buffers in GART space (which is on the to-do list for Nova).
This is an entirely valid and normal way to use VBOs, see this link:
https://www.khronos.org/opengl/wiki/Vertex_Specification_Best_Practices#Formatting_VBO_Data
As for slowing things down: Hans doubt it'll slow stuff down because drivers for GLES2 level hardware use VBOs internally for the data, anyway. With option #2, you're actually giving the driver the hints to optimize each VBO for the data being sent.
All in all for first tests we can make it as experemental feature enabled via environment for testing purposes..
Mmm ok. I'll look at the separate VBO then, to change the current (quite inefective) VBO stuffs I implemented some time ago and see if this could be implemented. It should be easier to do (VBO for all vertex attrib active at the time of glLock, other vertex attrib will remain outside).
@kas1e : I have just pushed a change in gl4es: it will now try to use real VBO to optimize glLockArrays(...)
. You may get some slight speedup in Quake3 engine games...
And I have pushed another change, with a sloght change of strategie that could help efficiancy. All this needs testing now (with Quake3, and any other game that use glLockArrays(...)
/glLockArraysEXT(...)
).
Tested quake3, and sadly to say its the other way around : it 10 fps slower.
I.e. version of gl4es from week ago give me in 1024x768 92fps. But latest today's version give me in the same 1024x768 : 81fps. I.e. on 11 fps less.
Maybe some debug output left somewhere or kind ?
Some time ago we found some "hardcore" issue on amigaos4, which is seems to be or ogles2, or warp3d issue. But as our devs can't find roots of that issue easy, then probably that mean we need some more simple test cases there , so that can be analizied better. Hope ptitSeb can help there too, as always :)
So, issue is that in some apps (at this moment it is quake3 and irricht engine), something unknown happens, which lead to the total distortion of visuals. In quake3 it happend when we just enable Extensions (so, not just glbegin/glend route used, but glDrawElements). In Irrlicht it happens just as default (so probably they also use something like quake3 when extensions enabled).
There is how it looks like in quake3:
when menu should come: http://kas1e.mikendezign.com/aos4/gl4es/games/quake3/first_run.jpg
in game itself: http://kas1e.mikendezign.com/aos4/gl4es/games/quake3/ingame3.jpg
Ther is how it looks like in irrlicht engine 1.8.4 with their simple "hello world":
http://kas1e.mikendezign.com/aos4/gl4es/irrlicht/irrlicht_nohack.jpg
Of course in all other apps glDrawElements / glDrawArrays and stuff works fine, its just happens only at moment in those 2 cases.
We come only to suggestion that maybe it's the glVertexAttrib with 4GL_UNSIGNED_BYTE and Normalize TRUE that break things? It's use for the Colors (and it's converted to 4GL_FLOAT when using glBegin()/glEnd() code path).
So we do check that theory by making that patch:
To test this theory, you can modify gl4es: in src/gl/gl.c, in function glDrawElementsCommon line 1094 change:
if (p->enabled) gles_glColorPointer(p->size, p->type, p->stride, p->pointer); with
and line 1152 (after the insertion), before if(buffered) { add
And it make it works.
Then i send all the info to the Daniel (our ogles2 author), but nothing come up from it. He say that from his side is everything seems works correctly. He analize everything he can think of , and all looks like fine.
Through, lately, he add some info:
"as far as I remember it was for sure no issue with the uchar-normalization, AFAIR I ruled that out. It also doesnt make too much sense after all: the normalization can only affect non-float data (in case of Q3 only colors), so the worst you would get if there was sth. wrong with the normalization would be wrong colors, but not wrong geometry / wrong tex-coords.
Distortions like that could be caused by wrong client-data (either completely invalid RAM or alignment issues or wrong stride etc.) or a wrong VBO setup (either caused by wrong client data / config or an ogles2-internal-bug) or a Nova-bug with certain VBO or shader setups."
At this point as i underrstan, Daniel do all he can do having access to ogles2 only, and it seems it can be issue with warp3d in end.. But as Hans works on warp3d not that fast , seems we need some more simple test case for analize which will produce same effect.
Dunno, maybe just strip down whole quake source, just so it show only intro, and then menu and exit ? Then step by step reduce it too.. Or better to write some test case from scratch ? But then we still don't know what exactly cause issues ...