beginOauthFlow - Lockwise iOS Crashes

kaylagalway commented 4 years ago

There has been a large uptick in crash reports in the new version of Lockwise iOS (1.7.0) that seem to lead back to the beginOathFlow function. Within this update, the app was updated from Xcode 10 to Xcode 11.2.1. The application services framework was updated from v0.32.1 to v0.42.3.

The relevant portion of the crash logs are here:


CrashReporter Key:   5d58b0f26a009d004644ffd32b63bbe0af388ee9
Hardware Model:      iPhone11,8
Process:             Firefox Lockbox [23773]
Path:                /private/var/containers/Bundle/Application/7DB9E195-BD96-4D1B-8C51-49708761CBC6/Firefox Lockbox.app/Firefox Lockbox
Identifier:          org.mozilla.ios.Lockbox
Version:             3902 (1.7.0)
AppStoreTools:       11B48b
AppVariant:          1:iPhone11,8:13
Code Type:           ARM-64 (Native)
Role:                Foreground
Parent Process:      launchd [1]
Coalition:           org.mozilla.ios.Lockbox [1016]
Date/Time:           2019-11-25 14:58:48.1907 +0100
Launch Time:         2019-11-25 14:58:46.7381 +0100
OS Version:          iPhone OS 13.2.3 (17B111)
Release Type:        User
Baseband Version:    2.02.12
Report Version:      104
Exception Type:  EXC_BAD_ACCESS (SIGSEGV)
Exception Subtype: KERN_INVALID_ADDRESS at 0x0000000000000000
VM Region Info: 0 is not in any region.  Bytes before following region: 4339351552
      REGION TYPE                      START - END             [ VSIZE] PRT/MAX SHRMOD  REGION DETAIL
      UNUSED SPACE AT START
--->  
      __TEXT                 0000000102a54000-0000000102a58000 [   16K] r-x/r-x SM=COW  ...refox Lockbox
Termination Signal: Segmentation fault: 11
Termination Reason: Namespace SIGNAL, Code 0xb
Terminating Process: exc handler [23773]
Triggered by Thread:  0
Thread 0 name:
Thread 0 Crashed:
0   WebKit                          0x0000000191fe2a58 WTF::Detail::CallableWrapper<WebKit::WebsiteDataStore::fetchDataAndApply(WTF::OptionSet<WebKit::W... + 48 (Function.h:79)
1   JavaScriptCore                  0x00000001996b13a8 WTF::RunLoop::performWork() + 356 (Function.h:79)
2   JavaScriptCore                  0x00000001996b13a8 WTF::RunLoop::performWork() + 356 (Function.h:79)
3   JavaScriptCore                  0x00000001996b1624 WTF::RunLoop::performWork(void*) + 40 (RunLoopCF.cpp:38)
4   CoreFoundation                  0x000000018a50724c __CFRUNLOOP_IS_CALLING_OUT_TO_A_SOURCE0_PERFORM_FUNCTION__ + 28 (CFRunLoop.c:1922)
5   CoreFoundation                  0x000000018a5071a0 __CFRunLoopDoSource0 + 84 (CFRunLoop.c:1956)
6   CoreFoundation                  0x000000018a50690c __CFRunLoopDoSources0 + 184 (CFRunLoop.c:1992)
7   CoreFoundation                  0x000000018a5017d8 __CFRunLoopRun + 1068 (CFRunLoop.c:2882)
8   CoreFoundation                  0x000000018a501084 CFRunLoopRunSpecific + 480 (CFRunLoop.c:3192)
9   GraphicsServices                0x000000019474f534 GSEventRunModal + 108 (GSEvent.c:2246)
10  UIKitCore                       0x000000018e671670 UIApplicationMain + 1940 (UIApplication.m:4758)
11  Firefox Lockbox                 0x0000000102a58504 main + 68 (AppDelegate.swift:12)
12  libdyld.dylib                   0x000000018a380e18 start + 4
Thread 1 name:
Thread 1:
0   libsystem_kernel.dylib          0x000000018a376ccc __psynch_cvwait + 8
1   libsystem_pthread.dylib         0x000000018a2994fc _pthread_cond_wait + 680 (pthread_cond.c:591)
2   Firefox Lockbox                 0x0000000102daff30 std::sys::unix::condvar::Condvar::wait_timeout::h08dabe209171db44 + 220 (condvar.rs:157)
3   Firefox Lockbox                 0x0000000102dafbe0 std::thread::park_timeout::h582ababe4c2f3f4c + 356 (condvar.rs:51)
4   Firefox Lockbox                 0x0000000102c312f0 viaduct::backend::send::hab9fc07cb658defb + 9952 (wait.rs:122)
5   Firefox Lockbox                 0x0000000102bb6774 fxa_client::config::Config::remote_config::h3a19bf2f71470689 + 544 (lib.rs:83)
6   Firefox Lockbox                 0x0000000102b93ba8 std::panicking::try::do_call::h26bddc28d50c3df3 + 340 (config.rs:185)
7   Firefox Lockbox                 0x0000000102db4768 __rust_maybe_catch_panic + 36 (lib.rs:80)
8   Firefox Lockbox                 0x0000000102b9a00c fxa_begin_oauth_flow + 188 (panicking.rs:275)
9   Firefox Lockbox                 0x0000000102aef020 closure #1 in FirefoxAccount.beginOAuthFlow(scopes:completionHandler:) + 604 (FirefoxAccount.swift:208)
10  Firefox Lockbox                 0x0000000102a76138 thunk for @escaping @callee_guaranteed () -> () + 28 (<compiler-generated>:0)
11  libdispatch.dylib               0x000000018a230b7c _dispatch_call_block_and_release + 32 (init.c:1408)
12  libdispatch.dylib               0x000000018a231fd8 _dispatch_client_callout + 20 (object.m:495)
13  libdispatch.dylib               0x000000018a238450 _dispatch_lane_serial_drain + 568 (inline_internal.h:2484)
14  libdispatch.dylib               0x000000018a238e7c _dispatch_lane_invoke + 368 (inline_internal.h:1851)
15  libdispatch.dylib               0x000000018a241f20 _dispatch_workloop_worker_thread + 580 (queue.c:6445)
16  libsystem_pthread.dylib         0x000000018a297a94 _pthread_wqthread + 280 (pthread.c:2323)
17  libsystem_pthread.dylib         0x000000018a29dc7c start_wqthread + 8```

┆Issue is synchronized with this [Jira Story](https://jira.mozilla.com/browse/SYNC-445)
┆Sprint: SYNC - end 2019-12-06

kaylagalway commented 4 years ago

Entire example crash log is below (CLICK TO EXPAND)

rfk commented 4 years ago

4 Firefox Lockbox 0x0000000102c312f0 viaduct::backend::send::hab9fc07cb658defb + 9952 (wait.rs:122) 5 Firefox Lockbox 0x0000000102bb6774 fxa_client::config::Config::remote_config::h3a19bf2f71470689 + 544 (lib.rs:83)

This is us trying to load the fxa client config over the network.

rfk commented 4 years ago

@eoger is this something you might be able to help us dig in to?

rfk commented 4 years ago

As a side note, I'm surprised by the number of threads here that seem to be parked inside rust code; perhaps this is a default threadpool being managed by the tokio executor, and we could consider reducing its size given we don't expect to be doing lots of concurrent I/O from rust.

rfk commented 4 years ago

In the meeting today @kaylagalway mentioned that when compiling the lockwise-ios app, it emits a swift warning about nullability on some code in the app-services framework. If it's easy to catch, could you please link it in this bug for reference?

rfk commented 4 years ago

One odd thing I noticed here is that Thread1 is blocked calling beginOAuthFlow:

9 Firefox Lockbox 0x0000000102aef020 closure #1 in FirefoxAccount.beginOAuthFlow(scopes:completionHandler:) + 604 (FirefoxAccount.swift:208)

While Thread3 is initializing the FxA rust component as a result of clearing the store:

10 Firefox Lockbox 0x0000000102ab0944 AccountStore.initFxa() + 432 (AccountStore.swift:123) 11 Firefox Lockbox 0x0000000102ab01cc AccountStore.clear() + 1180 (AccountStore.swift:157)

I'm surprised to see both of these in flight at the same time; does it mean the app is trying to create a new instance of FirefoxAccount at the same time as it's operating on an existing one? (Such an attempt shouldn't crash, it just seems unusual and might be a clue here).

mhammond commented 4 years ago

Also Thread3 seems to be blocked waiting for the FirefoxAccount queue to yield so it can initialize:

1-7 appear to be waiting for the queue to become available 8 libswiftDispatch.dylib 0x00000001bf6a58f4 $sSo17OS_dispatch_queueC8DispatchE4sync7executexxyKXE_tKlF + 136 (Queue.swift:355) 9 Firefox Lockbox 0x0000000102aec030 FirefoxAccount.__allocating_init(config:) + 156 (FirefoxAccount.swift:68)

whereas the thread 1 stack frame you referenced is an observable for the login URL changing - and that thread is already executing on the queue. Best I can tell, there's one shared queue regardless of how many FirefoxAccount objects exist - so it may be that an "old" FirefoxAccount object is in the process of dieing? If we look further down Thread3 we see:

29 Firefox Lockbox 0x0000000102aa68a8 BaseDataStore.pushError(_:) + 880 (Dispatcher.swift:24) 30 Firefox Lockbox 0x0000000102aa865c closure #1 in BaseDataStore.reset() + 468 (BaseDataStore.swift:408)

So an exception has been caught resetting the store - and pushError in BaseDataStore.kt will dispatch an AccountAction.clear action - so the above sounds plausible.

That itself might be a clue to reproducing though - is an auth error or similar needed?

That said though, and contrary to what I said in the meeting, none of those rust threads look like they are crashing - the might be racing, but there's no evidence that in itself is causing the crash.

mhammond commented 4 years ago

That itself might be a clue to reproducing though - is an auth error or similar needed?

Another possibility is that rather than an auth error, it's just a normal oauth-token expiry? There's an issue in the android version when they expire, and that's difficult to reproduce as a dev because they live a (relatively) long time.

rfk commented 4 years ago

I'm also copying garvan's comment from slack, to put all the context in one place:

The odd thing to me in that trace is the main-thread is running WebKit code! I was sure that only happened in worker threads.

eoger commented 4 years ago

It is not quite clear what's happening here, but I wonder if https://github.com/mozilla/application-services/issues/1326 might help?

thomcc commented 4 years ago

KERN_INVALID_ADDRESS at 0x0000000000000000 means that this is a null pointer dereference. The rust code is parked / waiting on a condition variable at this time, and webkit is calling out to native code as shown by __CFRUNLOOP_IS_CALLING_OUT_TO_A_SOURCE0_PERFORM_FUNCTION__, Or I assume that's what that means.

rfk commented 4 years ago

The odd thing to me in that trace is the main-thread is running WebKit code! I was sure that only happened in worker threads.

I guess that this line:

0 WebKit 0x0000000191fe2a58 WTF::Detail::CallableWrapper<WebKit::WebsiteDataStore::fetchDataAndApply(WTF::OptionSet<WebKit::W... + 48 (Function.h:79)

Is the code in AccountStore.clear where it tries to wipe all the localStorage and cookie data from the webview used in the login flow.

This makes me wonder if you've got multiple concurrent attempts at tearing down this state, which are stepping on each other's toes.

Another possibility is that rather than an auth error, it's just a normal oauth-token expiry? There's an issue in the android version when they expire, and that's difficult to reproduce as a dev because they live a (relatively) long time.

This is a good option to try to reproduce. You may be able to simulate a token expiry by going into https://accounts.firefox.com/settings and revoking the Lockwise app from there, which destroys its active OAuth tokens.

rfk commented 4 years ago

@kaylagalway as discussed in slack, I made a branch of application-services that sets an artificially low oauth token duration, which can hopefully help ease testing of how to app responds to expired oauth tokens:

https://github.com/mozilla/application-services/compare/short-duration-tokens?expand=1

If we're able to incorporate this into a build of Lockwise for iOS, then it will need to fetch a new FxA OAuth token every 30 seconds. Combined with e.g. backgrounding or restarting the app, I'm hoping this may help us reproduce the crashy behaviour here.

kaylagalway commented 4 years ago

I used a build created from the branch @rfk posted above and was able to use the faster token expiration to reproduce the issue. A threading issue resulting in Webkit code being called off the main thread was causing our crash. It was on our end, not yours, but it was triggered by a chain of events after token expiration. This ticket can be closed and I will make the necessary changes on my end to resolve the issue.

Thank you all for your help!

rfk commented 4 years ago

I used a build created from the branch @rfk posted above and was able to use the faster token expiration to reproduce the issue.

@kaylagalway would it be worth us exposing some sort of API to let the app set the token expiration time, to make it easier to test/debug things like this in future?

kaylagalway commented 4 years ago

@rfk Yeah, that would actually be great!

rfk commented 4 years ago

👍 filed as https://github.com/mozilla/application-services/issues/2334

mozilla / application-services

beginOauthFlow - Lockwise iOS Crashes #2279