navit-gps / navit

The open source (GPL v2) turn-by-turn navigation software for many OS
http://www.navit-project.org
Other
551 stars 173 forks source link

Navit gets killed on systems with hardened malloc #1165

Open albrechtd opened 2 years ago

albrechtd commented 2 years ago

I use Navit 0.5.6 on a Pixel 4a running the current GrapheneOS (Android 12 based). Routing works really fine, but after a while, Navit “spontaneously” moves into background, i.e. the start screen is displayed, and a dot on the Navit icon indicates that its still running (no, I did not touch the device, it just sticks in its mount). This happens randomly, after between (approximately) ~1 minute and ~20 minutes. Tapping the Navit icon brings it into foreground again, and the route is re-calculated. If I fail to tap the icon before the screen lock kicks in, I have to unlock the device. Needles to mention that this behavior is somewhat dangerous whilst driving – the app should just stay in foreground until I stop it manually… Is there an option to “force” Navit into foreground? Any other idea how I could fix this issue?

jkoan commented 2 years ago

Could you provide a logcat? Or at least scan logcat for any on* calls to navit.

mvglasow commented 2 years ago

Odd that I never had that issue. Just last weekend, I drove from Chemnitz to Augustów (some 10 hours), had the phone in the mount with Navit active most of the time, and it stayed where it was. I have never observed the issue described, though I occasionally use Navit in the manner described, for car trips lasting several hours. I am running a fairly recent build (straight from Git) on LineageOS 18.1 (Android 11). Might be something introduced with Android 12 (with every new version, Google finds new ways to break existing apps), or something specific to certain Android distributions or devices.

In any case, if you can provide a logcat, that might be helpful.

albrechtd commented 2 years ago

Thanks a lot for your feedback! Unfortunately, the log starts only after the last use of Navit (last Monday). Just entering a route and leaving the device on the desk doesn't trigger the issue. I'll try again next time I use my car (not before Friday, though) and get back to you.

mvglasow commented 2 years ago

Maybe you can narrow the issue down further by examining the differences between using Navit in the car vs. on your desk: apart from the obvious difference that the car is moving – when you use the phone in your car, is it plugged into a charger? Does it have a Bluetooth connection to your car stereo? You can try eliminating these one by one to find out if one of them is the culprit: start Navit and calculate a route while parked, unplug the charger, disable Bluetooth – do any of these influence how Navit behaves?

albrechtd commented 2 years ago

I today could reproduce the issue and read the log (attached: navit-issue.zip): started Navit at 11:30, and the issue happend at 11:43. I'm afraid the log is not very helpful, though:

11-20 11:43:02.322 14467 14467 F hardened_malloc: fatal allocator error: canary corrupted
11-20 11:43:02.322 14467 14467 F libc    : Fatal signal 6 (SIGABRT), code -1 (SI_QUEUE) in tid 14467 (itproject.navit), pid 14467 (itproject.navit)
11-20 11:43:02.388 14467 14467 F libc    : crash_dump helper failed to exec, or was killed
11-20 11:43:02.580  1433  5377 I ActivityManager: Process org.navitproject.navit (pid 14467) has died: fg  TOP 
11-20 11:43:02.582  1433  5377 W ActivityTaskManager: Force removing ActivityRecord{fa617a5 u0 org.navitproject.navit/.Navit t288}: app died, no saved state
11-20 11:43:02.605  1433  5377 W ActivityTaskManager: Can't find TaskDisplayArea to determine support for multi window. Task id=288 attached=false
11-20 11:43:02.605  1433  5377 W ActivityTaskManager: Can't find TaskDisplayArea to determine support for multi window. Task id=288 attached=false
11-20 11:43:02.606  1433 15231 D CompatibilityChangeReporter: Compat change id reported: 157929241; UID 1000; state: LOGGED
11-20 11:43:02.607  1433  2386 W WindowManager: Failed looking up window session=Session{4051771 14467:u0a10191} callers=com.android.server.wm.WindowManagerService.windowForClientLocked:5591 com.android.server.wm.WindowState$DeathRecipient.binderDied:3040 android.os.IBinder$DeathRecipient.binderDied:314 
11-20 11:43:02.608  1433  2386 I WindowManager: WIN DEATH: null
11-20 11:43:02.609  1433  1569 W ActivityManager: setHasOverlayUi called on unknown pid: 14467

Please note that I run Navit on GrapheneOS which uses a hardened malloc implementation, apparently throwing the message above. Unfortunately, I'm not familiar with Android development, so it is unclear for me whether this message is just a false-positive or points to a real memory issue. I also have no idea how I can activate the “crash_dump helper” which might produce a stack trace or something similar – any ideas, or should I ask the GrapheneOS community? It appears that it's quite easy to reproduce the issue on this platform, so please let me know if I can add more information.

mvglasow commented 2 years ago

So, the issue is not that Navit somehow moves to the background, we are looking at a crash here. More specifically, the OS killed Navit following a memory access violation. This also explains why I never encountered this issue: apparently LineageOS is less paranoid about memory corruption, so the same behavior would not cause any issues unless the corrupted memory is read from.

Looks like Navit is writing to some memory address that it never allocated, or has allocated in the past but already freed, or beyond the boundaries of what was allocated. I have run across (and fixed a bunch of such issues in Navit and found Valgrind a very helpful tool to do that. Often this kind if error will go unnoticed and the program will “accidentally” still act as expected, but this may fail unpredictably any time.

Drawback: I have no idea on how to do remote debugging with Valgrind; therefore I was only able to sort out the errors which I managed to reproduce with a Linux build of Navit on my local machine. I have written up an instruction for doing that, which can be found at https://wiki.navit-project.org/index.php/Eclipse#Valgrind.

If you can reproduce the crash on Android, you could try building Navit for Linux and see if you can reproduce the same behavior there. If so, you can run Navit inside Valgrind and examine where the corruption occurs. Valgrind will only report corruption of you actually manage to reproduce the error situation – in this case it will show you what instruction triggered the error, and, if applicable, where the memory address was allocated and freed.

In order to be really sure, you would have to run Navit inside Valgrind on the Android device and reproduce the error. However, I have no idea if this is possible at all (the VM which Valgrind uses comes at quite a performance penalty, so it might not be feasible on a mobile device), or how to do that. If at all possible, we would not even need an Android distro with hardened malloc – any distro will do; without hardened malloc Navit won’t get killed but Valgrind would report where the error occurred.

If someone is familiar with remote debugging on Android (e.g. using gdb), this might also help if you can reproduce the bug. To get something useful out of gdb, we need Navit to actually crash (i.e. by runnign it on a system with hardened malloc). You should see debug symbols (i.e. function names) in the stack trace, unless a function was called at a callback (in which case figuring out the function becomes a bit of guesswork).

Sorry I can’t help any further here, but I hope these pointers are helpful for someone else to pick up from here.

albrechtd commented 2 years ago

Hi @mvglasow, thanks a lot for your detailed reply!

So, the issue is not that Navit somehow moves to the background, we are looking at a crash here.

Yes. As the start screen still showed the “activitiy dot” on the Navit icon, I got the impression it was still running…

apparently LineageOS is less paranoid about memory corruption, so the same behavior would not cause any issues unless the corrupted memory is read from.

…that's the reason why I use GrapheneOS :sunglasses:…

Looks like Navit is writing to some memory address that it never allocated, or has allocated in the past but already freed, or beyond the boundaries of what was allocated. I have run across (and fixed a bunch of such issues in Navit and found Valgrind a very helpful tool to do that. Often this kind if error will go unnoticed and the program will “accidentally” still act as expected, but this may fail unpredictably any time.

Yes, Valgrind is an excellent tool for tracking such issues, but keep in mind that it drastically changes the timing behaviour of the application. In this particular case, it might be more helpful (at least on a desktop Linux box) to LD_PRELOAD the hardened malloc lib (or link statically) and analyse the core dump – no idea if this is possible with Android.

It might also be helpful to run a static analysis tool against the source code; cppcheck (oss), PC-lint Plus (€€), Eclair (€€€) or LDRA (€€€) might be candidates (the latter three if someone has access to them, maybe at work). I blindly ran cppcheck against the current git, and it threw a bunch of errors, but as I didn't configure it thoroughly, it is known to produce false positives, and I have no idea about the Navit code, I cannot judge if these are real issues or not.

Glancing over the code, I found some places where the pointer is not set to NULL after a free() or g_free() call which is the usual safeguard against double-free errors (and would always segfault for read-after-free). Again, as I'm not familiar with the code at all, I have no idea if this might help.

I also suspect that gps position changes are required to trigger this particular issue – just activating a route and leaving the device on the desk doesn't crash (afaict). Does something like a “gpsd simulator” exist, i.e. an application into which a, say, gpx file can be fed, and which outputs the position data, so it looks to Navit on a Linux desktop as if it would be used in a car? IMHO, this might simplify debugging a lot.

Just my € 0.01, though…

mvglasow commented 2 years ago

Valgrind is an excellent tool for tracking such issues, but keep in mind that it drastically changes the timing behaviour

Been there, done that, see wiki article ;-)

I found some places where the pointer is not set to NULL after a free() or g_free() call which is the usual safeguard against double-free errors (and would always segfault for read-after-free)

Why not fix that code (should be fairly trivial even if you’re not familiar with the code) and submit it as a pull request? If something stops working with that fix applied, that is a fairly strong indication that something is wrong. Refer to this issue in the PR and make it clear that the PR doesn’t break anything. If anything, it point out issues that have been there all along.

Does something like a “gpsd simulator” exist

We have a bunch of vehicle plugins, and I believe some of them can be “abused” to read input from a file. IIRC we have vehicle plugins for GPX and NMEA (there are tools out there to convert GPX to NMEA, if needed, such as gpsbabel). Apart from that, there is the demo vehicle, which simulates a vehicle which follows the route, if one is set, and otherwise stays in one place. Any of these might work to simulate a moving vehicle on a Linux desktop.