ExportWorker gets killed by Android 14 due to new excessive binder call tracker feature

chenxiaolong commented 9 months ago

On Android 14 (Google Pixel stock firmware), the scheduled exports seem to be crashing, even though exports done manually via the UI work fine. When ExportWorker tries it run, it fails during the message export phase with a binder error:

--------- beginning of main
2023-10-21 00:03:56.047852 -0400 I/hub.tmo1.sms_ie(12068): Using CollectorTypeCC GC.
2023-10-21 00:03:56.054832 -0400 E/hub.tmo1.sms_ie(12068): Not starting debugger since process cannot load the jdwp agent.
2023-10-21 00:03:56.071151 -0400 D/CompatibilityChangeReporter(12068): Compat change id reported: 171979766; UID 10302; state: ENABLED
2023-10-21 00:03:56.071324 -0400 D/CompatibilityChangeReporter(12068): Compat change id reported: 242716250; UID 10302; state: ENABLED
2023-10-21 00:03:56.079462 -0400 D/ApplicationLoaders(12068): Returning zygote-cached class loader: /system_ext/framework/androidx.window.extensions.jar
2023-10-21 00:03:56.079519 -0400 D/ApplicationLoaders(12068): Returning zygote-cached class loader: /system_ext/framework/androidx.window.sidecar.jar
2023-10-21 00:03:56.091642 -0400 D/nativeloader(12068): Configuring clns-6 for other apk /data/app/~~-0i8XvqMvE7-tBm7q61ghw==/com.github.tmo1.sms_ie-YdWZ4xiIg2BCTfZVOeQIKQ==/base.apk. target_sdk_version=34, uses_libraries=, library_path=/data/app/~~-0i8XvqMvE7-tBm7q61ghw==/com.github.tmo1.sms_ie-YdWZ4xiIg2BCTfZVOeQIKQ==/lib/arm64, permitted_path=/data:/mnt/expand:/data/user/0/com.github.tmo1.sms_ie
2023-10-21 00:03:56.104880 -0400 V/GraphicsEnvironment(12068): Currently set values for:
2023-10-21 00:03:56.105009 -0400 V/GraphicsEnvironment(12068):   angle_gl_driver_selection_pkgs=[]
2023-10-21 00:03:56.105040 -0400 V/GraphicsEnvironment(12068):   angle_gl_driver_selection_values=[]
2023-10-21 00:03:56.105567 -0400 V/GraphicsEnvironment(12068): ANGLE GameManagerService for com.github.tmo1.sms_ie: false
2023-10-21 00:03:56.105675 -0400 V/GraphicsEnvironment(12068): com.github.tmo1.sms_ie is not listed in per-application setting
2023-10-21 00:03:56.105793 -0400 V/GraphicsEnvironment(12068): Neither updatable production driver nor prerelease driver is supported.
2023-10-21 00:03:56.161398 -0400 D/CompatibilityChangeReporter(12068): Compat change id reported: 160794467; UID 10302; state: ENABLED
2023-10-21 00:03:56.175046 -0400 D/CompatibilityChangeReporter(12068): Compat change id reported: 263076149; UID 10302; state: ENABLED
2023-10-21 00:03:56.226278 -0400 I/WM-WorkerWrapper(12068): Work [ id=48895fe2-ca9c-4a5a-956b-9ab885fb34ea, tags={ com.github.tmo1.sms_ie.ExportWorker, export } ] was cancelled
2023-10-21 00:03:56.226278 -0400 I/WM-WorkerWrapper(12068): java.util.concurrent.CancellationException: Task was cancelled.
2023-10-21 00:03:56.226278 -0400 I/WM-WorkerWrapper(12068):     at y1.h.e(SourceFile:33)
2023-10-21 00:03:56.226278 -0400 I/WM-WorkerWrapper(12068):     at y1.h.get(Unknown Source:21)
2023-10-21 00:03:56.226278 -0400 I/WM-WorkerWrapper(12068):     at androidx.appcompat.widget.j.run(SourceFile:125)
2023-10-21 00:03:56.226278 -0400 I/WM-WorkerWrapper(12068):     at androidx.appcompat.widget.j.a(SourceFile:1)
2023-10-21 00:03:56.226278 -0400 I/WM-WorkerWrapper(12068):     at androidx.appcompat.widget.j.run(SourceFile:10)
2023-10-21 00:03:56.226278 -0400 I/WM-WorkerWrapper(12068):     at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
2023-10-21 00:03:56.226278 -0400 I/WM-WorkerWrapper(12068):     at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:644)
2023-10-21 00:03:56.226278 -0400 I/WM-WorkerWrapper(12068):     at java.lang.Thread.run(Thread.java:1012)
2023-10-21 00:03:56.248936 -0400 D/CompatibilityChangeReporter(12068): Compat change id reported: 194532703; UID 10302; state: ENABLED
2023-10-21 00:03:56.249372 -0400 D/CompatibilityChangeReporter(12068): Compat change id reported: 253665015; UID 10302; state: ENABLED
2023-10-21 00:03:56.295989 -0400 I/SMSIE   (12068): Beginning messages export ...
2023-10-21 00:03:56.380188 -0400 E/JavaBinder(12068): !!! FAILED BINDER TRANSACTION !!!  (parcel size = 352)
--------- beginning of system
2023-10-21 00:03:56.380548 -0400 I/ActivityThread(12068): Removing dead content provider:android.content.ContentProviderProxy@4cf8418
2023-10-21 00:03:56.380592 -0400 I/ActivityThread(12068): Removing dead content provider:android.content.ContentProviderProxy@4cf8418

Manually triggering with scheduled job with the following command also results in the same crash:

adb shell cmd jobscheduler run -f com.github.tmo1.sms_ie 8

When the crash occurs, it leaves behind a partially written zip file with just the local zip entry header for messages.ndjson:

❯ zipdetails -vv --walk messages-2023-10-21.zip

00000 00003 00004 50 4B 03 04 LOCAL HEADER #1       04034B50 (67324752)
00004 00004 00001 14          Extract Zip Spec      14 (20) '2.0'
00005 00005 00001 00          Extract OS            00 (0) 'MS-DOS'
00006 00007 00002 08 08       General Purpose Flag  0808 (2056)
                              [Bits 1-2]            0 'Normal Compression'
                              [Bit  3]              1 'Streamed'
                              [Bit 11]              1 'Language Encoding'
00008 00009 00002 08 00       Compression Method    0008 (8) 'Deflated'
0000A 0000D 00004 7C 00 55 57 Last Mod Time         5755007C (1465188476) 'Fri Oct 20 20:03:56 2023'
0000E 00011 00004 00 00 00 00 CRC                   00000000 (0)
00012 00015 00004 00 00 00 00 Compressed Length     00000000 (0)
00016 00019 00004 00 00 00 00 Uncompressed Length   00000000 (0)
0001A 0001B 00002 0F 00       Filename Length       000F (15)
0001C 0001D 00002 00 00       Extra Length          0000 (0)
0001E 0002C 0000F 6D 65 73 73 Filename              'messages.ndjson'
                  61 67 65 73
                  2E 6E 64 6A
                  73 6F 6E
#
# ERROR: Local Directory exists, but Central Directory not found
#
# Error Count: 1
#
# Done

I'm not sure why this is happening yet. I'll see if I can reproduce the problem this weekend with a debug build to get a better stack trace. (The logs above are from the precompiled 2.2.0 APK from the Github release page.)

chenxiaolong commented 9 months ago

Hmm, it looks like the app is getting killed for doing too many binder calls in the background:

10-21 16:09:13.031  1452  2008 D ActivityManager: freezing 30385 com.github.tmo1.sms_ie.debug
10-21 16:09:13.031  1452  2008 D ActivityManager: Kill app due to repeated failure to freeze binder: 30385 com.github.tmo1.sms_ie.debug
10-21 16:09:13.032  1452  1749 I ActivityManager: Killing 30385:com.github.tmo1.sms_ie.debug/u0a312 (adj 900): excessive binder traffic during cached
10-21 16:09:13.041  1452  1767 I libprocessgroup: Successfully killed process cgroup uid 10312 pid 30385 in 5ms

This behavior was introduced in Android 14 in commit https://android.googlesource.com/platform/frameworks/base.git/+/71d75c09b9a06732a6edb4d1488d2aa3eb779e14%5E%21/. I'll dig around and see if there's a way this can be worked around.

tmo1 commented 9 months ago

Hi,

Thank you for reporting this, and thank you very much for taking the time and making the effort to figure out the problem and create a (well-documented!) workaround. You're clearly a more advanced Android developer than I am :)

I would like to understand the problem better, though, before merging a workaround like your PR: what, exactly, is the app doing wrong? I thought that doing high volume and long running database queries and file I/O is exactly the sort of thing that background work is designed for. What is the "excessive binder traffic" that Android is complaining about? Perhaps if I understood the problem better, I could redesign the scheduled export code to be more compliant with what Android expects.

chenxiaolong commented 9 months ago

Sure thing!

I don't think sms-ie is doing anything incorrectly. The problem is mainly in these two calls:

https://github.com/tmo1/sms-ie/blob/c56fa0e7d65cf44b69e3698bd3fe9e38a81850a9/app/src/main/java/com/github/tmo1/sms_ie/ImportExportMessages.kt#L179-L181

https://github.com/tmo1/sms-ie/blob/c56fa0e7d65cf44b69e3698bd3fe9e38a81850a9/app/src/main/java/com/github/tmo1/sms_ie/ImportExportMessages.kt#L229-L233

Each of those queries involves inter-process communication between sms-ie and Android's MmsProvider component, which performs a binder transaction under the hood. While the second query could potentially optimized so that it's only done once (and mid is filtered client side), I don't think the same optimization can be done for content://mms/$msgId/addr because it's querying a different URI each time. There doesn't seem to be a content://mms/addr for querying everything at once: https://cs.android.com/android/platform/superproject/main/+/main:packages/providers/TelephonyProvider/src/com/android/providers/telephony/MmsProvider.java;l=1163;drc=c89bafa3f22a9dce56adb9b7fab2f9149d18c456.

In my case, I have 5000 or so MMS messages (well, RCS), so it ends up performing 10000+ queries. It takes around 4 minutes to export and Android usually kills the process about 1 minute 15 seconds in (~3k queries).

I thought that doing high volume and long running database queries and file I/O is exactly the sort of thing that background work is designed for.

That's unfortunately no longer true in recent Android versions. There are special exceptions to this though, like downloading files, but in the general case, without using a foreground service, it is very likely that long-running (multiple minute) operations will be killed.

tmo1 commented 9 months ago

Ah, I think I understand - thanks for the detailed explanation.

Rant:

I find Android development really aggravating. Everything is constantly changing, and unlike in Linux, where (with apologies to Perl) easy things are generally easy and hard things are usually at least possible, in Android it feels like easy things are generally hard and lots of things turn out to be impossible.

The sheer amount of rules and restrictions imposed by the system just feels developer hostile, and while I understand that the purpose is to protect users and provide them with good user experiences, the result seems to be an unfortunate lack of availability of quality apps for those same users. E.g.: on Linux, there are innumerable backup applications, ranging from simple and easy to use to complex and powerful, from CLI to GUI, while (due to Android design decisions) there's literally not one decent generally usable backup app for normal non-rooted Android devices (which is in part why SMS I/E is needed in the first place).

To be fair, I'm just an amateur, hobbyist developer, and perhaps this is not an entirely fair and accurate assessment of the state of things on Android, but this is what the situation feels like to me

End rant.

Thank you again very much for your work on this, though. I'll do some testing, and if everything looks good, I'll merge the PR. (You mention not having a telephony-capable Android <= 13 device for testing - did you try testing on a pre-14 AVD?)

chenxiaolong commented 9 months ago

I 100% agree with your rant and feel the same way. Usable Linux phones can't come soon enough.

(You mention not having a telephony-capable Android <= 13 device for testing - did you try testing on a pre-14 AVD?)

I wasn't aware they had the telephony components. I'll give that a try and report back.

chenxiaolong commented 9 months ago

I tested on a handful of versions via AVD:

Android 4.4 (API 19)
- sms-ie's minimum sdk version.
- Scheduled exports aren't supported, but I found a crash when opening the settings screen and pushed a fix for it in my branch: https://github.com/tmo1/sms-ie/pull/131/commits/c4aa39c7c1bab07ba93eef4519640598b2d7c303.
Android 8.0 (API 26)
- Picked some arbitrary older Android version with no foreground service restrictions.
- Always runs as foreground service, regardless of battery optimization setting.
- App is unlikely to be killed.
Android 11 (API 30)
- Last Android version before restrictions for how foreground services can be started were introduced.
- Always runs as foreground service, regardless of battery optimization setting.
- App is unlikely to be killed.
Android 12 (API 31)
- First Android version to start restricting foreground services.
- Runs as background service by default. Runs as foreground service if battery optimizations are disabled.
- App is unlikely to be killed either way.
Android 13 (API 33)
- Last Android version before binder restrictions for background services were introduced.
- Runs as background service by default. Runs as foreground service if battery optimizations are disabled.
- App is unlikely to be killed either way.
Android 14 (API 34)
- First Android version to start restricting background binder calls.
- Runs as background service by default. Runs as foreground service if battery optimizations are disabled.
- App is likely to be killed when exporting many MMS messages via the background service. Runs reliably via the foreground service.

tmo1 commented 8 months ago

I found a crash when opening the settings screen and pushed a fix for it in my branch: https://github.com/tmo1/sms-ie/commit/c4aa39c7c1bab07ba93eef4519640598b2d7c303.

Thanks! I'm not sure why I didn't hit that crash when I tested the API 19 code, but I did reproduce it now, and confirmed that your fix fixes it.

App is likely to be killed when exporting many MMS messages via the background service. Runs reliably via the foreground service.

Thank you again very much for all the work, testing, and documenting you've done on this!

I've reviewed the code in your patch, and I think I understand what it's doing, but before I merge it, I'm trying to reproduce the crash you're seeing with the old code (and the patched code with battery optimizations disabled) and confirm that the fix works, but so far I haven't been able to reproduce the crash. I just did a scheduled export of 15400 MMS messages (on a Pixel 7 API 34 AVD) and it worked fine. Do you have any ideas as to how I can trigger the crash?

chenxiaolong commented 8 months ago

No problem at all!

Do you have any ideas as to how I can trigger the crash?

I forgot to mention this in my last post. I can reproduce the crash reliably, even after a factory reset, on a Pixel 8 Pro nearly every time, but for whatever reason, never in the AVD. In both cases, I restored from the same backup (~5000 MMS messages) and triggered a scheduled export.

I'm not sure if Android's CachedAppOptimizer is disabled in the emulator or if there's something that prevents it from running and killing apps. I know certain actions will disable it temporarily, like adb shell pm dump <package>. There's a CachedAppOptimizer.dump() function that would print out a lot of helpful statistics about what's going on, but I can't find a good way to call it without doing something hacky with root access.

tmo1 commented 8 months ago

Do you have any ideas as to how I can trigger the crash?

I realized that when I tested, I left the app in the foreground, so that's likely why the killing / freezing behavior did not occur :|

I did a bunch more tests with the app removed from the foreground, on API 33 and 34 (all on AVDs), and I observed the following:

I never encountered the Kill app due to repeated failure to freeze binder and Killing ... : excessive binder traffic during cached messages that you encountered - perhaps, as you suggested, the emulator behaves differently from physical devices in this regard?
Instead, in my testing, when the scheduled export ran as a background service, it was often / usually killed by ActivityManager with a different message, along the lines of: Killing 12531:com.github.tmo1.sms_ie/u0a191 (adj 700): excessive cpu 169000 during 300075 dur=309031 limit=25. This occurred with both API 33 and 34. At least some of the time, this occurred after about ten minutes, so I assume that this is due to Android's limitations on background services not specified as "long running" - I saw this and this, although I can't find precise documentation of Android's service killing rules.
Your foreground service patch seemed to work consistently in my testing as well, although I haven't tested extensively. When running as a foreground service, I observed, after the first ten minutes and every five minutes or so after that, log entries like Letting 82e10f7 #u0a191/36 com.github.tmo1.sms_ie/androidx.work.impl.background.systemjob.SystemJobService continue to run past min execution time, which supports the hypothesis that without foreground execution, Android is killing the service even in the absence of binder issues.

So it looks like your patch is definitely necessary for the reliable running of large scheduled exports, whether to avoid the binder issue or the "excessive cpu" / long-running service problem, but I wish I had a better understanding of all this.

Another, likely unrelated question: when doing all this testing, I noticed that the notifications that the app is supposed to show upon completion of a scheduled export run are not being displayed (in 33 or 34). This used to work, and I'm not sure when / why it no longer does. Did you notice the completion notifications in your testing?

chenxiaolong commented 8 months ago

Thanks for the additional testing!

Another, likely unrelated question: when doing all this testing, I noticed that the notifications that the app is supposed to show upon completion of a scheduled export run are not being displayed (in 33 or 34). This used to work, and I'm not sure when / why it no longer does. Did you notice the completion notifications in your testing?

This is because sms-ie currently doesn't use the POST_NOTIFICATIONS permission, which must be declared in the manifest and also requested at runtime as of API 33. Maybe sms-ie could prompt for that permission when enabling scheduled exports?

tmo1 commented 8 months ago

Maybe sms-ie could prompt for that permission when enabling scheduled exports?

Sounds right - I'll try it.

tmo1 commented 8 months ago

Maybe sms-ie could prompt for that permission when enabling scheduled exports?

Done - thanks!

tmo1 / sms-ie

ExportWorker gets killed by Android 14 due to new excessive binder call tracker feature #129