rust-mobile / android-activity

Glue for building Rust applications on Android with NativeActivity or GameActivity
251 stars 49 forks source link

Fix deadlock on activity onDestroy #94

Closed sagebind closed 1 year ago

sagebind commented 1 year ago

Fix a deadlock that occurs when an activity is destroyed without process termination, such as when an activity is destroyed and recreated due to a configuration change.

The deadlock occurs because notify_destroyed blocks until destroyed is set to true. This only occurs when WaitableNativeActivityState is dropped, but the WaitableNativeActivityState instance is the very thing being used to await for destruction, resulting in a deadlock.

This doesn't really make much sense, so instead, wait for the Rust-side main thread to terminate to ensure cleanup can proceed.

ardocrat commented 1 year ago

For me its happening when Android Service is running.

rib commented 1 year ago

Ah, right, oops!

Your fix generally looked good to me but one notable thing was that it looked like it could theoretically enable a different deadlock if the the spawned thread for android_main returned immediately then the wait for running to get set could block the Java main thread indefinitely.

I replaced running with a tristate enum to avoid that possibility.

Although it's redundant I've currently kept the destroyed field which is now set immediately when onDestroyed is called from the java main thread.

It would be great if you're able to double check that this updated PR works for you @sagebind and @ardocrat

sagebind commented 1 year ago

@rib Good catch, thanks. Your patch does look better, and I just tested it and it works just as well in my app. I wasn't totally confident in my implementation since I'm not familiar with the codebase (and new to Android in general), which is why I started with a draft PR.

rib commented 1 year ago

Thanks for testing @sagebind

It'd be good to make a patch release with this fix, and to do that before landing the GameActivity 2.0.2 update

ardocrat commented 1 year ago

Sadly not fixed at my case.

At logs I can see deadlock happened at pthread_cond_wait(&android_app->cond, &android_app->mutex); of android_app_free function from android_native_app_glue.c file and no logs from Rust glue.rs file.

So as temporary dirty fix I am just killing my process after some time to relaunch application later without problems :)

image

rib commented 1 year ago

@ardocrat ah okey, if you're seeing a reference to android_native_app_glue.c that implies you're using the game-activity backend which wouldn't be affected by the fix here currently which only touches the native-activity backend.

I guess there must be a similar bug in the game-activity backend.

rib commented 1 year ago

@ardocrat I can't currently see an equivalent bug in the game-activity backend.

In the game-activity backend it blocks in android_app_free (where you see the deadlock) waiting for the native (Rust) thread to respond to onDestroy with:

    pthread_mutex_lock(&android_app->mutex);
    android_app_write_cmd(android_app, APP_CMD_DESTROY);
    while (!android_app->destroyed) {
        pthread_cond_wait(&android_app->cond, &android_app->mutex);
    }
    pthread_mutex_unlock(&android_app->mutex);

and the thing that sets android_app->destroyed is the android_app_destroy function which should get called by android_app_entry immediately after _rust_glue_entry (i.e. after the app's android_main returns):

static void* android_app_entry(void* param) {
    // <snip>

    _rust_glue_entry(android_app); // (calls android_main)

    android_app_destroy(android_app); // <- should set app->destroyed = 1 to unblock android_app_free
    return NULL;
}
static void android_app_destroy(struct android_app* android_app) {
    LOGV("android_app_destroy!");
    free_saved_state(android_app);
    pthread_mutex_lock(&android_app->mutex);

    AConfiguration_delete(android_app->config);
    android_app->destroyed = 1;
    pthread_cond_broadcast(&android_app->cond);
    pthread_mutex_unlock(&android_app->mutex);
    // Can't touch android_app object after this.
}

Are you maybe able to add a bit of logging to confirm that your applications android_main is returning?

ardocrat commented 1 year ago

Thank you @rib, I am using eframe, I forgot to call frame.close() to unblock android_main, but another problem happened here, eframe.close() method is leading to call of Activity's onPause() method and its impossible to call finish() after this, cause Activity is at background, moreover its blocking app thread for 5 seconds and if I am opening app between moment of exit and this timeout (5 seconds) I am getting ANR popup to force stop the app. At logs I am seeing:

18:07:29.004  D  ************** mainWorkCallback *********
18:07:29.004  D  mainWorkCallback: cmd=1
18:07:29.004  V  android_app_destroy!
18:07:29.031  D  onPause_native
18:07:29.031  V  Pause: 0x722f923370
18:07:34.222  I  Thread[6,tid=8884,WaitingInMainSignalCatcherLoop,Thread*=0x72af9196f0,peer=0x137c0110,"Signal Catcher"]: reacting to signal 3
18:07:34.222  I  
18:07:34.742  I  Wrote stack traces to tombstoned
---------------------------- PROCESS ENDED (8874) for package com.example.android ----------------------------

I guess this problem is related to eframe's close() method, it is pausing Activity by calling this method at same time on app destroy.

rib commented 1 year ago

ah, curious.

The android lifecycle model (https://developer.android.com/guide/components/activities/activity-lifecycle) will go through onPause and onStop before reaching onDestroy.

From your log I see that it looks like android_app_destroy is called before onPause_native

The android_app isn't really in a usable state after android_app_destroy.

The comment at the end of android_app_destroy actually says // Can't touch android_app object after this though since it doesn't ever technically actually free() the android_app we should at least still be able to check app->destroyed

In the NativeActivity backend we have try_with_waitable_activity_ref utility that generalizes upgrading a weak reference for the state that may get dropped when the app exits so any Activity notifications afterwards become safe noops.

It looks like the GameActivity glue is missing these kinds of checks to consider that the app may have already destroyed when an Activity native method (like onPause_native) gets called.

We can hopefully address this with a check for app->destroyed in GameActivity.cpp:onPause_native, and should probably have similar checks for other native callbacks too.

ardocrat commented 1 year ago

@rib At this log I tried to manually finish it before calling frame.close(). if I am not calling Activity::finish() manually, this call is coming after android_main, but leads to same problem, cause Activity can not be finished at background. eframe puts Activity at pause state after calling close().

rib commented 1 year ago

looking at the eframe implementation for close() it seems to just set a flag that results in Winit setting control_flow::Exit which should cause the winit event loop to exit.

Changing exactly where you finish() won't necessarily make much difference since GameActivity_finish just sends an async message to the java main thread.

My guess at the moment is that calling eframe.close() is working as expected and resulting in your android_main then returning normally which will result in calling GameActivity_finish (when sends an async message to main thread) and then android_app_destroy gets calls which leaves the android_app in an inconsistent state.

The async finish message can then trigger onPause -> onStop -> onDestroy lifecycle events but by the time onPause runs then android_app is already in an inconsistent state, and (more importantly) the native thread for android_main has exit.

because onPause_native doesn't recognize that the app has already been destroyed it is then going to try and send a message from the java main thread to your rust/native thread (which won't work because the Rust thread has already exit).

I think actually looking at this a little closer we can't address this directly in onPause_native because the GameActivity/NativeCode class doesn't actually know anything about the implementation details of android_app in android_native_app_glue.c (it just calls the callbacks provided).

I think the right fix here is to probably make android_app->destroyed be a value we read with atomic ops (or consistently read with the android_app->mutex held) and then each android_app Java callback (like onPause) should start with a check of android_app->destroyed and immediately return if its set.

It does give me some renewed motivation to think about starting to port GameActivity to Rust though. This same issue almost certainly used to exist in the old NativeActivity backend too when it was written in C.

rib commented 1 year ago

@ardocrat I wonder if you could try out this local patch to android_app_set_activity_state:

 static void android_app_set_activity_state(struct android_app* android_app,
                                            int8_t cmd) {
     pthread_mutex_lock(&android_app->mutex);
-    android_app_write_cmd(android_app, cmd);
-    while (android_app->activityState != cmd) {
-        pthread_cond_wait(&android_app->cond, &android_app->mutex);
+    if (!android_app->destroyed) {
+        android_app_write_cmd(android_app, cmd);
+        while (android_app->activityState != cmd) {
+            pthread_cond_wait(&android_app->cond, &android_app->mutex);
+        }
     }
     pthread_mutex_unlock(&android_app->mutex);
 }

that doesn't account for all callbacks but should make onPause and onStop gracefully return if the app has already exit.

You might also need this change for android_app_free to make onDestroy also act gracefully:

 static void android_app_free(struct android_app* android_app) {
     pthread_mutex_lock(&android_app->mutex);
+
+    if (!android_app->destroyed) {
+        pthread_mutex_unlock(&android_app->mutex);
+        return;
+    }
+
     android_app_write_cmd(android_app, APP_CMD_DESTROY);
     while (!android_app->destroyed) {
         pthread_cond_wait(&android_app->cond, &android_app->mutex);
rib commented 1 year ago

I can also aim to push a branch that you can try out with those fixes and more for the other callbacks, but just in the middle of some other android-activity changes atm and don't want to fully switch context just yet.

It's probably also worth noting that there's this PR that's looking to update the GameActivity backend and I'll also look at porting these changes to that branch if they help: https://github.com/rust-mobile/android-activity/pull/88

ardocrat commented 1 year ago

It moved further, now I can see onSurfaceDestroyed_native and NativeWindowDestroyed events at logs, but sadly only onPause was called from Activity lifecycle and it was not finished, so when I am opening back the app, I can see only black screen, without ANR (Application Not Responding) popup now.

image

rib commented 1 year ago

ah, okey, yeah I though you might get away with testing by just handling android_app_set_activity_state but yeah, android_app_set_window is another api that's called by one of the java callbacks that needs a guard. I made the changes locally but will try and share a branch soon.

For android_app_set window I added:

@@ -308,6 +308,10 @@ static void android_app_set_window(struct android_app* android_app,
                                    ANativeWindow* window) {
     LOGV("android_app_set_window called");
     pthread_mutex_lock(&android_app->mutex);
+    if (!android_app->destroyed) {
+        pthread_mutex_unlock(&android_app->mutex);
+        return;
+    }
     if (android_app->pendingWindow != NULL) {
         android_app_write_cmd(android_app, APP_CMD_TERM_WINDOW);
     }
rib commented 1 year ago

@ardocrat I've just created this issue to track the game-activity deadlock: https://github.com/rust-mobile/android-activity/issues/97

I'll treat it as a separate issue for now, so we can merge this fix as is first.