ptitSeb / gl4es

GL4ES is a OpenGL 2.1/1.5 to GL ES 2.0/1.1 translation library, with support for Pandora, ODroid, OrangePI, CHIP, Raspberry PI, Android, Emscripten and AmigaOS4.
http://ptitseb.github.io/gl4es/
MIT License
690 stars 158 forks source link

Implement Precompile Shader Archive #117

Open ptitSeb opened 5 years ago

ptitSeb commented 5 years ago

When using GLES2 backend, every Fixed Pipeline Function (so OpenGL 1.x) can lead to the creation of a new shader program. Some can take a bit of time to compile and link (like when a lot of lights are involved), giving some "hicup" to a game. That long loading time can be seen when launching Foobillard++ or Neverball for example.

Because Fixed Pipeline Emulator always generate the same shaders program, thoses could be saved for later use: by creating an Archive containing past build FPE program, and using the GL_OES_program_binary extension to save / load the program binary and avoid the compiling and linking part.

The PSA will be available only if the extension is present, and if it support at least 1 format for binary programs.

On linux, the Archive will be save in the HOME folder, as a hidden file (named .gl4es.psa) On AmigaOS4, it will be in PROGDIR: as a hidden file (same name as linux)

TODO: Were to put the archive on Android TODO: Were to put the archive on Emscripten

kas1e commented 5 years ago

I also checked the readme, and it says:

void* aglGetProcAddress(const char* name);

Returns the function pointer for the OGLES2 function with the respective name.
Returns nullptr if no such function exists.

So also looks pretty sane..

ptitSeb commented 5 years ago

Still, if it crash when gl4es tries to use it, there is something wrong...

ptitSeb commented 5 years ago

my question is, does this glClear(GLbitfield mask) function prototype, in OGLES2 really is that, or something like glClear(something_t OGLES2, GLbitfield mask) ?

kas1e commented 5 years ago

While Daniel seems sleep, i tried to check includes, and that what i find in 2 places:

define glClear(mask) IOGLES2->glClear((mask))

and void APICALL (glClear)(struct OGLES2IFace Self, GLbitfield mask);

but not sure if it can answer on your question of course.. need to wait Daniel's answer

kas1e commented 5 years ago

I may try to create some simple test programm maybe.. Something like pure SDL2/GL app, which just call glClear(), but not directly, but from aglGetProcAddress, etc ..

ptitSeb commented 5 years ago

Ah yes, that: void APICALL (*glClear)(struct OGLES2IFace *Self, GLbitfield mask); define the function pointer for glClear with 2 paramters !, the first one beeing OGLES2 itself. If the aglGetProcAddress gives that, then indeed it will crash, because OGLES2 will never get passed (as it's not in the regular glClear(...) convention, and the program doesn't even have the OGLES2 structure itself.

ptitSeb commented 5 years ago

To test, define the function pointer, with something like

typedef void (*GLCLEARPTR)(GLbitfield mask);

(if you don't have GLbitfield, a simple int will be enough) then in your main, after the GLES2 context is created

GLCLEARPTR my_glClear = (GLCLEARPTR)aglGetProcAddress("glClear");
if(my_glClear) {
 printf("Go!\n");
 my_glClear(GL_COLOR_BUFFER_BIT);
} else
 printf("glClear is NULL!\n");
kas1e commented 5 years ago

Do you mean comment out just EX(glClear) in gl4es, and all other stuff doing this inside of the programm's body, not in gl4es ?

ptitSeb commented 5 years ago

Nope, that just for your sample program using SDL2 / pure GLES2

kas1e commented 5 years ago

It say "Go!" and then crashes the same in glClear ..

ptitSeb commented 5 years ago

And now, try this changes: typedef void (*GLCLEARPTR)(struct OGLES2IFace *, GLbitfield); and

GLCLEARPTR my_glClear = (GLCLEARPTR)aglGetProcAddress("glClear");
if(my_glClear) {
 printf("Go!\n");
 my_glClear(IOGLES2, GL_COLOR_BUFFER_BIT);
} else
 printf("glClear is NULL!\n");
kas1e commented 5 years ago

Now also saying "Go!" , but this time no crash ! And i can exit from app without problems

ptitSeb commented 5 years ago

So I think you have enough information to open a bug to Daniel: for gl4es, I need the first one to work, not the second.

Now, if the first alternative (the pointer function to glClear(GLbitfield)) is not possible, I can think of an alternative, but that would be cumbersome, so I would prefer not to do it.

kas1e commented 5 years ago

Maybe it the way how amiga libraries works .. not sure he do anything special there, but will ask..

kas1e commented 5 years ago

Probabaly because of that reasson we have those static amiga functions as well..

kas1e commented 5 years ago

But is that not possible easy typedef things somehow, so first one will be just skipped, and second one will be taken ? I just not sure what exactly i need ask from Daniel to do ..

ptitSeb commented 5 years ago

You need to tell Daniel that the function pointer returned by aglGetProcAddress(...) for glClear(...) for example, is not for void glClear(GLbitfield) as it's supposed to be, but for void glClear(struct OGLES2IFace *, GLbitfield)

kas1e commented 5 years ago

Sended a mail to Daniel with describing everythign we do (with our tests and co) , cross the fingers that it can be fixed/changed. It can be easy that this just the way how our amiga libraries works. Because if in readme says "Returns the function pointer for the OGLES2 function with the respective name." , then or it a bug on implementation side, or the way how things works .. Waiting for answer from , probabaly still sleep :)

kas1e commented 5 years ago

Ok got an answer:

Okay, I get it. He's using GetProcAddr, casts the returned value to e.g. PFNGLGETPROGRAMBINARYOESPROC - which unfortunately doesnt take into account the "hidden" Amiga interface pointer.

So the solution is that I change my glGetProcAddr to not return the "normal" function pointer but instead I have to wrap everything into a dummy function without the struct pointer parameter, which I have to store as simple "global" (should be enough here).

Sure, he could work around by casting it to the correct type and by invoing it in the right way. But yes, it is cumbersome and doing it the way I outlined is the way to go for compatibility.

In other words Daniel will deal with, waiting for new beta :)

ptitSeb commented 5 years ago

Ah good :)

kas1e commented 5 years ago

Got fixed version : all start to works. .gl4es.psa created, etc.

Through at moment with Neverball it change nothing in terms of first loading : when i don't have .gl4es.psa and start the game and if i have .gl4es.psa and start the game : all the same by speed :(

Then i tried fricking shark (the most noticable by issues game) : also no changes.

I mean, absolutly no changes :) Even on single bit , which is strange ... Need invistigate a bit more.

I also notice that Neverball loose 20 fps by some reassons, but not sure if it new olges2, or latest gl4es, need to recheck..

ptitSeb commented 5 years ago

On the Pandora, on Neverball and Foobillard++ there is a very noticable change on 2nd start yes. There was a risk that the "SPIR-V" trick used for the ProgramBinary would not make things faster... That may be that indeed. Everything happens inside gl4es_useProgramBinary(...) if you want to check (you can try to put some printf log with some timestaps if you want to measures stuffs).

Also, you lost 20fps in Niverball? Strange, I haven't noticed any loss in speed lately in my testing (and I tested latest Neverball this week end). But you can do git bisect or just take an old build to see if it was better before. I'm interested if I made some "slowing down" changes.

kas1e commented 5 years ago

About neverball loosing the speed: sorry it was false alarm, its just i in process of testing enable VSYNC :)

Now to PSA:

Can you point out plz on the source where PSA is generated ?

Daniel checked one of generated files, and says that something is wrong (unless there some weird file format) : The first 8 bytes of every of my binary prog seems to be repeated. That's certainly not coming from me. So if my binary's 8 starting bytes would look like C0 0D A6 75 (4 bytes header) 00 00 1E 67 (4 bytes sizeof binary), then inside gl4es psa file this sequence is not followed by the actual data but instead this 8 bytes are repeated. THEN the data starts. But this is definitely not what I return.

The binary in question is our .gl4es.psa generated when running Neverball: http://kas1e.mikendezign.com/aos4/gl4es/PSA/gl4es_psa_neverball

ptitSeb commented 5 years ago

It's in multiple location, but basicaly:

A new shader is added in memory here: https://github.com/ptitSeb/gl4es/blob/master/src/gl/fpe_cache.c#L262 A shader itself is read from GLES2 here: https://github.com/ptitSeb/gl4es/blob/master/src/gl/program.c#L685

And it's writen to disk here: https://github.com/ptitSeb/gl4es/blob/master/src/gl/fpe_cache.c#L173

Also, as you can see, yeah, each shaders will have the format code, the size, but also the "fpe" data, that be a long sequense of "random" bytes (mostly 0 on many cases).

kas1e commented 5 years ago

Yeah, we found it yesterday, as well, Daniel understand format and foramt is simply : header string, state, bin type, bin size, payload, repeat. The state is basically a number with many bits which uniquely identifies a generated shader prog. Right ?

Anyway, we also do profiling of the NeverBall loading, and after that we can conclude that NOVA's CompileShader() function is crap and writen bad.

I wrote big post on our forum, maybe you will be in interest to read it:

Ok, now, some new problem/question.

In GL4ES ptitSeb add precompiled shader archive support (PSA), which mean that we can use precompiled binaries, and not compile shaders anymore. Daniel add to recent ogles2 necessary stuff to make it works, and so we meet with some problem.

For first, to explain, precompile shaders archive (PSA), mean to collect all the compiled shaders as binaries in some place, which is then on running of programm loaded to the memory, and then used just as glUseProgramm() without needs to compile them => no speed loss.

In our, amigaos4 case, shader compilation and execution looks like this:

Ideally, PSA binaries should be of that last translation (after NOVA have machine code ready from SPIRV), but as Nova didn't provide any public functions for us to save/get them , we at least trying to get rid of ogles2's step: compiling GLSL to SPIRV + patching.

So, we doing all this, just to understand that all the problems is not sending GLSL shader to ogles2, and not ogles2 doing translation from GLSL to SPIRV + patching (which is heavy thing!), but NOVA's CompileShader() :(

I do 2 profiling tests:

  1. without PSA support (so, all steps there, including ogles2's GLSL to SPIRV + patching):

http://kas1e.mikendezign.com/aos4/gl4es/PSA/NOpsa_neverball_profile.txt

As you can see, CompileShader() there take almost the same amount of time, showing that most of time spend in NOVA's version. But that all can be misleading of course, and so,

  1. second result WITH PSA support, i.e. binaries placed in the SPIRV format on the disk, loaded to the memory on running, and then ogles2 don't do any CompileShader() thing (as can be seen from the next profiling file), but only UsePRogramm() which take nothing:

http://kas1e.mikendezign.com/aos4/gl4es/PSA/psa_neverball_profile.txt

As can be seen, all the time still catched by the Nova's compileshader().

@Hans

Have you any idea why the translation / patching etc. of GLSL only takes a tiny fraction of the time of the Nova's SPIR-V assembly step (by tiny fraction means 5% only). Because i can understand if it was other way around, but not like it now.

The much-much more complex GLSL translation and all the internal source-code-patching in ogles2 only wasts 200 ms there.

Is there anything which can be taken care of ?

Also Nova's CompileShaders does not upload the stuiff to the actuall hardware (at least it should be that way?). It should really be just something like an assembler / low-level-compiler with register mapping which should works more or less fast (at least on level of how ogles2's GLSL->SPIRV+patching compiling happens).

So.. if there is any reall reassons why should Nova's CompilerShader() be that slow and can't be optimized by any of reassons, can you made some public functions, by which we can generate ready to use assembly code to binary, and use them via something like useprogramm() , etc. Yes, those binaries will be different on all machines, and need to be generated on each, but then, only one time when first time plays.

kas1e commented 5 years ago

Maybe we of course don't understand the "heavy" part of NOVA's CompilerShader(), but still it should't be that slow.

Did you know on Pandora what internals of CompileShader() ? I mean, if PSA change it for you pretty much visually, that mean that Pandora's CompileShader() not very fast too, and doint not only conversion to assembly ?

ptitSeb commented 5 years ago

From my understanding, the heavy phase is not the "Compile" but the "Linking". That's why they created the glProgramBinary(...) function, because the existing glShaderBinary(...) wasn't enough to save time, as it avoid only the "Compile" of the shader and not the link. So yes, on the Pandora, it's the Link phase that takes time, especially when there is a lot of uniform in the program.

Also, by design, the compile / linking of shaders program doesn't need to be fast. Also the execution of the program need to be fast. The compile and linking is supposed to be done at a loading phase of the game, where performance is not an issue. And most games do that: loading shaders program at loading, like other game assets. Dynamically creating new shader program in-game is not a common practice.

kas1e commented 5 years ago

Hi!

Its anyway turned out that NOVA's shader compiler is slow crap :) Hans says that he even have compiler optimisation DISABLED for it, because boost shared pointers somehow lost track of their usage counters with optimization enabled, resulting in objects being destroyed while they're still in use.

So no one know how good his code at all with usage of those shared ptrs (probabaly also not optimised), as well as bug happens when optimisation enabled, mean that there is real bug present which need to be fixed, but he instead just disable optimisation.

Anyway, i promised some little gift when you will add PSA, so i send just 50$ in hope to send another 50$ later when Hans will optimise things on his part, and maybe adding necessary functions for save the binary in machine code .. But that what he say:

That's not so easy because more needs to be saved than just the raw shader code. There are various register settings that need to be preserved. So, I'd have to come up with another file format to save the binary. I'm totally swamped with other work, so I'm unlikely to have time for looking into this, sorry.

So.. At least gl4es have PSA now, and ogles2 have necessary functions + we fix aglGetProcAddress because of this , which is good things too ..

Thanks !

ptitSeb commented 5 years ago

I'm not surprised he needs to save more then just RAW data. There is the result of the Link I guess, with all the internal setup of the shaders program (and what register are linked to what VA / Uniform and the Varying setup between Vertex and Fragment shaders for example). So yeah, a bit a of work to organise hat in a proprer format...

And ye, gl4es have psa now, and it's working well on the Pandora at least.

And thanks for the gift :)

Sisah2 commented 3 years ago

Hello, im just trying enable precompiled shader archive on android (openmw) and found that there is check in hardext.c for GL_OES_get_program (result is extension not found). Shouldnt it be GL_OES_get_program_binary?

Also destructor is not called at exiting is it openmw android bug?

ptitSeb commented 3 years ago

Ah maybe. But it does work with the other name too on my hardware. A double test should be done I guess. If you change the test to "GL_OES_get_program_binary", does it work on your side?

ptitSeb commented 3 years ago

The destructor an Android, well, if the Constructor attribute is disabled on android (it is the case to avoid lockup on old android version), the contructor and destructor of the lib need to be called manualy. So probably an issue on openmw side (but the system is supposed to do the cleanup anyway).

Sisah2 commented 3 years ago

Then it say GL_OES_get_program_binary is supported with 1 supported binary format. I think it will work. Just need to save them differently. I try to save .psa when calling glhint with custom arguments. Hope it will work :)