opensciencemap / vtm

a vector-tile map library written in java - running on android, desktop and within the browser
GNU Lesser General Public License v3.0
238 stars 176 forks source link

Crash in native code #52

Open baldur opened 10 years ago

baldur commented 10 years ago

Hi I just diving in to troubleshoot a crash that started happening after we updated to android 4.4. The devices affected are Samsung Galaxy S4, we have a nexus running 4.4 which doesn't seem to be affected or at least we have not experience the problem there, nore did we with our S4 prior to the 4.4 update.

I figured I would raise the issue here incase someone already knows about this issue or has some insights. I will follow up as I progress in my search.

********** Crash dump: **********
Build fingerprint: 'samsung/jflteuc/jflteatt:4.4.2/KOT49H/I337UCUFNB1:user/release-keys'
pid: 19670, tid: 19693, name: Thread-8277  >>> com.mapzen <<<
signal 11 (SIGSEGV), code 1 (SEGV_MAPERR), fault addr 80e6c000 
Stack frame #00  pc 0002225c  /system/lib/libc.so (__memcpy_base+227)
Stack frame #01  pc 00071107  /system/vendor/lib/egl/libGLESv2_adreno.so (rb_memcpy+282)
Stack frame #02  pc 0007d4a1  /system/vendor/lib/egl/libGLESv2_adreno.so (rb_vbo_cache_buffer+320)
Stack frame #03  pc 000465a7  /system/vendor/lib/egl/libGLESv2_adreno.so (cache_vbo_attrib+298)
Stack frame #04  pc 0004962d  /system/vendor/lib/egl/libGLESv2_adreno.so
Stack frame #05  pc 00049da5  /system/vendor/lib/egl/libGLESv2_adreno.so (core_glDrawElementsInstancedXXX+140)
Stack frame #06  pc 00049fd7  /system/vendor/lib/egl/libGLESv2_adreno.so (core_glDrawElements+10)
Stack frame #07  pc 00039767  /system/vendor/lib/egl/libGLESv2_adreno.so (glDrawElements+28)
Stack frame #08  pc 00020bcc  /system/lib/libdvm.so (dvmPlatformInvoke+112)
Stack frame #09  pc 00051927  /system/lib/libdvm.so (dvmCallJNIMethod(unsigned int const*, JValue*, Method const*, Thread*)+398)
Stack frame #10  pc 0002a060  /system/lib/libdvm.so
Stack frame #11  pc 00031510  /system/lib/libdvm.so (dvmMterpStd(Thread*)+76)
Stack frame #12  pc 0002eba8  /system/lib/libdvm.so (dvmInterpret(Thread*, Method const*, JValue*)+184)
Stack frame #13  pc 00063e75  /system/lib/libdvm.so (dvmCallMethodV(Thread*, Method const*, Object*, bool, JValue*, std::__va_list)+336)
Stack frame #14  pc 00063e99  /system/lib/libdvm.so (dvmCallMethod(Thread*, Method const*, Object*, JValue*, ...)+20)
Stack frame #15  pc 00058b6b  /system/lib/libdvm.so
Stack frame #16  pc 0000d278  /system/lib/libc.so (__thread_entry+72)
Stack frame #17  pc 0000d410  /system/lib/libc.so (pthread_create+240)
hjanetzek commented 10 years ago

The same crash with adreno chipset and kitkat was reported earlier today (via mail). The user found that GL.glDrawElements(GL20.GL_LINES ... ) in ExtrusionRenderer triggers the problem. I havent looked further into it yet, though google:'adreno kitkat rb_memcpy' shows the issue also happens elsewhere.

baldur commented 10 years ago

Just dumping info here for what it's worth: https://gist.github.com/baldur/9652381

it's a bit strange the map loads fine but as soon as you start interact with it it eventually will crash and it's almost always shortly after a GC run. I have tried fiddling with some of the code including the line you mentioned and I am not convinced it's the same issue. I've been unable to attach a gdb debugger due various issues http://developer.samsung.com/forum/thread/ndk-debugging-with-gdb/77/178834 being one of them.

I am sort of running out of ideas and if you have any advice on how I can help with debugging this further please let me know ... thankfully we do have devices still with 4.3 so we are not pressed for time but I can definitely spend a bit more time if you have ideas for what would be good to experiment with in order to identify the culprit.

hjanetzek commented 10 years ago

There were some reports for Unity providing similar traces. So I'm pretty sure it's a bug in the driver - or in Android memory management. One way to test if it is caused by buffer data is being garbage collected before moved to GL memory would be to comment out 'mUsedBuffers = releaseAll(mUsedBuffers);' in MapRenderer. If thats the case one could ensure to keep references to the Buffer objects and not reuse them until the corresponding VBOs are drawn once.

hjanetzek commented 10 years ago

Could you try changing GL_DYNAMIC_DRAW in BufferObject to GL_STATIC_DRAW? It might use a different path in the driver and circumvent the problem.

baldur commented 10 years ago

Tried both of those to no avail ... everything points to the problem with the driver as you mentioned. Next step I am planning on is to root the device and reset some of the settings as is suggested here https://developer.qualcomm.com/forum/qdevnet-forums/mobile-technologies/mobile-gaming-graphics-optimization-adreno/26936. This mentions opengl 3 so I am not sure if it applies to our situation but it's worth a shot.

The post does mention another workaround which I didn't quite understand but perhaps you understand what he means by:

I have found another thing to do that helps decrease the chance of it crashing, 
this is another terrible workaround but it "works." Every time I draw something with 
glDrawRangeElements, I insert a eglSwapBuffers. This has the downfall of absolutely 
murdering performance and introducing flickering, but again, it helps lessen the 
chance of crashing.
stleusc commented 10 years ago

I just had reported the same via email and then I found it here ;-) One user of my app reported same issue, same device, etc. Any idea here?

stleusc commented 10 years ago

Not sure if what they talk here can be applied (was about buffers...) http://www.tasharen.com/forum/index.php?topic=8415.msg42698#msg42698

hjanetzek commented 10 years ago

If you could check the ant traces file one might see if the crash is triggered by a rendering call from the same renderer - One reporter told me that it happens in ExtrusionRenderer but he didnt replied back to confirm that it's the only place. In this case one could disable 3D buildings for blacklisted drivers...

baldur commented 10 years ago

Just for the record I can crash without the building layer added ... so I am not sure if that approach will suffice.

bcamper commented 10 years ago

Also blacklisting two of the most popular GPUs doesn't feel like a great permanent solution (though maybe a short-term band-aid).

On Mon, Apr 14, 2014 at 11:38 AM, Baldur Gudbjornsson < notifications@github.com> wrote:

Just for the record I can crash without the building layer added ... so I am not sure if that approach will suffice.

— Reply to this email directly or view it on GitHubhttps://github.com/opensciencemap/vtm/issues/52#issuecomment-40381172 .

hjanetzek commented 10 years ago

Could you send the crash details to qualcomm? - It seems one can get direct feedback on their forum with such issues: https://developer.qualcomm.com/forum/qdevnet-forums/mobile-technologies/mobile-gaming-graphics-optimization-adreno/27030

stleusc commented 10 years ago

Any way to use opengl 3 on these devices? Read this fixed it in other apps.

hjanetzek commented 10 years ago

@stleusc where did you find that?

stleusc commented 10 years ago

I don't remember :-( Would it be hard to implement the change?

hjanetzek commented 10 years ago

I guess it wouldn't - If it were possible to enable gles3. For the driver it should make no difference as gles2 is a strict subset of the gles3 api - but from what I've read about adreno drivers[1] I wouldnt count on should :)

[1] https://dolphin-emu.org/blog/2013/09/26/dolphin-emulator-and-opengl-drivers-hall-fameshame/

stleusc commented 10 years ago

well according to this: http://developer.android.com/guide/topics/graphics/opengl.html you can check if gles3 is supported and if so, use it!

hjanetzek commented 10 years ago

might be worth a try, maybe it really switches the complete driver .so... In org.oscim.android.gl.GLView() add:

        setEGLContextFactory(new GLSurfaceView.EGLContextFactory() {
            private int EGL_CONTEXT_CLIENT_VERSION = 0x3098;

            public EGLContext createContext(EGL10 egl, EGLDisplay display, EGLConfig eglConfig) {
                Log.w("", "creating OpenGL ES3 context");
                int[] attrib_list = { EGL_CONTEXT_CLIENT_VERSION, 3, EGL10.EGL_NONE };
                EGLContext context = egl.eglCreateContext(display, eglConfig,
                                                          EGL10.EGL_NO_CONTEXT, attrib_list);
                if (context != EGL10.EGL_NO_CONTEXT)
                    return context;

                Log.w("", "creating OpenGLES2 context");
                attrib_list[1] = 2;
                context = egl.eglCreateContext(display, eglConfig, EGL10.EGL_NO_CONTEXT,
                                               attrib_list);
                return context;
            }

            @Override
            public void destroyContext(EGL10 egl, EGLDisplay display, EGLContext context) {
                egl.eglDestroyContext(display, context);
            }
        });

        setEGLConfigChooser(new GlConfigChooser());
        //setEGLContextClientVersion(2);
baldur commented 10 years ago

I had problems compiling your code sample but I set the clientVersion directly to 3 and also changed all the constants in AndroidGL to use GLES30 in lua of GLE20 but I still get crashes.

libGLESv2 seems to suggest that the driver for v2 is still being used so I wonder if this is not enough to get it to use gles3. Do you know what I can call in the running app to verify that I have successfully set it to use gles3?

********** Crash dump: **********
Build fingerprint: 'samsung/jflteuc/jflteatt:4.4.2/KOT49H/I337UCUFNB1:user/release-keys'
pid: 1143, tid: 1254, name: Thread-10097  >>> com.mapzen <<<
signal 7 (SIGBUS), code 2 (BUS_ADRERR), fault addr 7efd4940
Stack frame #00  pc 0002225c  /system/lib/libc.so (__memcpy_base+227)
Stack frame #01  pc 00071107  /system/vendor/lib/egl/libGLESv2_adreno.so (rb_memcpy+282)
Stack frame #02  pc 0007d4a1  /system/vendor/lib/egl/libGLESv2_adreno.so (rb_vbo_cache_buffer+320)
Stack frame #03  pc 000465a7  /system/vendor/lib/egl/libGLESv2_adreno.so (cache_vbo_attrib+298)
Stack frame #04  pc 0004962d  /system/vendor/lib/egl/libGLESv2_adreno.so
Stack frame #05  pc 00049da5  /system/vendor/lib/egl/libGLESv2_adreno.so (core_glDrawElementsInstancedXXX+140)
Stack frame #06  pc 00049fd7  /system/vendor/lib/egl/libGLESv2_adreno.so (core_glDrawElements+10)
Stack frame #07  pc 00039767  /system/vendor/lib/egl/libGLESv2_adreno.so (glDrawElements+28)
Stack frame #08  pc 00020bcc  /system/lib/libdvm.so (dvmPlatformInvoke+112)
Stack frame #09  pc 00051927  /system/lib/libdvm.so (dvmCallJNIMethod(unsigned int const*, JValue*, Method const*, Thread*)+398)
Stack frame #10  pc 00000214  /dev/ashmem/dalvik-jit-code-cache (deleted)
hjanetzek commented 10 years ago

If there is no /system/vendor/lib/egl/libGLESv3_adreno.so it is probably the correct library. What was the problem with the code above? It's the recommended way to query the gl version at http://developer.android.com/guide/topics/graphics/opengl.html - when the first call to eglCreateContext does return a context then you have a gles3 context.

Maybe we can figure out if one specific vtm renderer triggers the crash - There are not many uses of glDrawElements. Have you tried to turn off LabelLayer and BuildingLayer? Thinking about it I suspect LineTexLayer.Renderer.draw() is the one - just comment out the body to check. I could write a simpler version if that one makes trouble :)

hjanetzek commented 10 years ago
while (curLayer != null && curLayer.type == TEXLINE) 
   curLayer = curLayer.next;
return;

must remain in draw() though ...

baldur commented 10 years ago

I have previously tried pulling out both building and label layer ... and now I tried emptying out the body of the draw method and still having failures.

hjanetzek commented 10 years ago

So if there is no call to glDrawElements done by vtm anymore (all the other renderers use glDrawArrays) then glDrawElements may only be called by the Android UI or compositor, i.e. after a gl context switch.. stilll I would like to find out which vtm renderer is involved with it: could you disable draw() in LineLayer and PolygonLayer the same way?

baldur commented 10 years ago

@hjanetzek we have made some progress here and have identified the culprit: https://github.com/opensciencemap/vtm/blob/master/vtm/src/org/oscim/renderer/elements/TextureLayer.java#L192

By commenting out that GL.glDrawElements we have a running app that doesn't crash ... we found this by looking at which shaders where affected and the app also runs by making main methods in this shader blank:

https://github.com/opensciencemap/vtm/blob/master/vtm/resources/assets/shaders/texture_layer.glsl

We don't know yet how to fix it but we wanted to give you an update to see if you had thoughts in the light of this discovery.

hjanetzek commented 10 years ago

When an attribute is not used in the shader it will be optimized out and GL.glGetAttribLocation will return an invalid handle (< 0) So glDrawElements will probably fail before even fetching data from the vbo (fail in the usual GL way - just show nothing). If you are sure that only the texture renderer is involved, i.e. the crash happens when the texture renderer is alone active one could try to use dynamic vertex arrays instead of the vbo: https://github.com/opensciencemap/vtm/commit/b729a5298e9599868d0b4f33245483ff63eaf01e

baldur commented 10 years ago

Awesome thanks so much, this patch appears to be working. Looks like the icons for pois are missing though.

hjanetzek commented 10 years ago

Good to hear that this works. I've added no-vbo option to SymbolLayer now: https://github.com/opensciencemap/vtm/commit/bdc63d8e91a551ef258b664ce7425ec20a29fc5b

actually squashed the SymbolLayer change again and added useVBO option to ElementLayers for putting vertex data into a separate buffer. This only works when ElementLayers contains only TextureLayers though.

baldur commented 10 years ago

Awesome ... poi's are back and the map appears to be running smoothly on affected devices. Thanks again for fixing this.

stleusc commented 10 years ago

I also gave the fix to my affected users. Report back is that the issue is gone!

Great work! Thanks....

hjanetzek commented 10 years ago

Merged with a check in MapView to enable the workaround for Samsung devices running Kitkat - If you have the exact models for the affected devices this test could be made more specific, but I guess devices running Kitkat are fast enough to have no measurable performance difference using no VBO in this case.

bcamper commented 10 years ago

Thanks! We know the S4 (Adreno 320) and S5 (Adreno 330) devices are affected - those are two of the most popular (maybe most?) Samsung devices in the US.

On Mon, May 12, 2014 at 9:12 AM, Hannes Janetzek notifications@github.comwrote:

Merged with a check in MapView to enable the workaround for Samsung devices running Kitkat - If you have the exact models for the affected devices this test could be made more specific, but I guess devices running Kitkat are fast enough to have no measurable performance difference using no VBO in this case.

— Reply to this email directly or view it on GitHubhttps://github.com/opensciencemap/vtm/issues/52#issuecomment-42829673 .

hjanetzek commented 10 years ago

I could reproduce the crash with a S5 now. It seems the problem is actually the use of glBufferSubData (which seems to have realiably issues with adreno). Just disabling glBufferSubData makes it work for me. The crash probably shows up in text renderer because its vertex data is most frequently replaced. So I guess a more appropriate fix would be https://github.com/opensciencemap/vtm/commit/9c1ae887ea29ad92d1836db02414c2382005187e

baldur commented 10 years ago

@hjanetzek we found another device which has issues HTC One (M8) 4.4.2 HTC Sense version 6.0

Here is a gist from the logcat if that's useful https://gist.github.com/baldur/9dd383bfba1b83bb9593

As before we tryied commenting out the GL.glDrawElements call in: https://github.com/opensciencemap/vtm/blob/master/vtm/src/org/oscim/renderer/elements/TextureLayer.java Which stops the crashing from happening. We are happy to help troubleshoot this problem so feel free to ask us for more details or try things to sort this out.

hjanetzek commented 10 years ago

It seems to be the same problem. I wasnt pleased with the test for Samsung with Kitkat anyway - now I just found that one can get the vendor/renderer info via glGetString to disable use of glBufferSubData for these chips. Could you try https://github.com/opensciencemap/vtm/tree/testing-adreno

baldur commented 10 years ago

awesome that did the trick ... thanks

Bezzu commented 9 years ago

Hi, I have had the same issue on a samsung galaxy s4 mini with android 4.4.2 and i have solved it setting the variable, called "NO_BUFFER_SUB_DATA", in the file vtm/org/oscim/backend/GLAdapter.java to true.