Closed kaylagalway closed 4 years ago
4 Firefox Lockbox 0x0000000102c312f0 viaduct::backend::send::hab9fc07cb658defb + 9952 (wait.rs:122)
5 Firefox Lockbox 0x0000000102bb6774 fxa_client::config::Config::remote_config::h3a19bf2f71470689 + 544 (lib.rs:83)
This is us trying to load the fxa client config over the network.
@eoger is this something you might be able to help us dig in to?
As a side note, I'm surprised by the number of threads here that seem to be parked inside rust code; perhaps this is a default threadpool being managed by the tokio executor, and we could consider reducing its size given we don't expect to be doing lots of concurrent I/O from rust.
In the meeting today @kaylagalway mentioned that when compiling the lockwise-ios app, it emits a swift warning about nullability on some code in the app-services framework. If it's easy to catch, could you please link it in this bug for reference?
One odd thing I noticed here is that Thread1 is blocked calling beginOAuthFlow
:
9 Firefox Lockbox 0x0000000102aef020 closure #1 in FirefoxAccount.beginOAuthFlow(scopes:completionHandler:) + 604 (FirefoxAccount.swift:208)
While Thread3 is initializing the FxA rust component as a result of clearing the store:
10 Firefox Lockbox 0x0000000102ab0944 AccountStore.initFxa() + 432 (AccountStore.swift:123) 11 Firefox Lockbox 0x0000000102ab01cc AccountStore.clear() + 1180 (AccountStore.swift:157)
I'm surprised to see both of these in flight at the same time; does it mean the app is trying to create a new instance of FirefoxAccount
at the same time as it's operating on an existing one? (Such an attempt shouldn't crash, it just seems unusual and might be a clue here).
Also Thread3 seems to be blocked waiting for the FirefoxAccount
queue to yield so it can initialize:
1-7 appear to be waiting for the queue to become available 8 libswiftDispatch.dylib 0x00000001bf6a58f4 $sSo17OS_dispatch_queueC8DispatchE4sync7executexxyKXE_tKlF + 136 (Queue.swift:355) 9 Firefox Lockbox 0x0000000102aec030 FirefoxAccount.__allocating_init(config:) + 156 (FirefoxAccount.swift:68)
whereas the thread 1 stack frame you referenced is an observable for the login URL changing - and that thread is already executing on the queue. Best I can tell, there's one shared queue regardless of how many FirefoxAccount
objects exist - so it may be that an "old" FirefoxAccount
object is in the process of dieing? If we look further down Thread3 we see:
29 Firefox Lockbox 0x0000000102aa68a8 BaseDataStore.pushError(_:) + 880 (Dispatcher.swift:24) 30 Firefox Lockbox 0x0000000102aa865c closure #1 in BaseDataStore.reset() + 468 (BaseDataStore.swift:408)
So an exception has been caught resetting the store - and pushError
in BaseDataStore.kt will dispatch an AccountAction.clear
action - so the above sounds plausible.
That itself might be a clue to reproducing though - is an auth error or similar needed?
That said though, and contrary to what I said in the meeting, none of those rust threads look like they are crashing - the might be racing, but there's no evidence that in itself is causing the crash.
That itself might be a clue to reproducing though - is an auth error or similar needed?
Another possibility is that rather than an auth error, it's just a normal oauth-token expiry? There's an issue in the android version when they expire, and that's difficult to reproduce as a dev because they live a (relatively) long time.
I'm also copying garvan's comment from slack, to put all the context in one place:
The odd thing to me in that trace is the main-thread is running WebKit code! I was sure that only happened in worker threads.
It is not quite clear what's happening here, but I wonder if https://github.com/mozilla/application-services/issues/1326 might help?
KERN_INVALID_ADDRESS at 0x0000000000000000
means that this is a null pointer dereference. The rust code is parked / waiting on a condition variable at this time, and webkit is calling out to native code as shown by __CFRUNLOOP_IS_CALLING_OUT_TO_A_SOURCE0_PERFORM_FUNCTION__
, Or I assume that's what that means.
The odd thing to me in that trace is the main-thread is running WebKit code! I was sure that only happened in worker threads.
I guess that this line:
0 WebKit 0x0000000191fe2a58 WTF::Detail::CallableWrapper<WebKit::WebsiteDataStore::fetchDataAndApply(WTF::OptionSet<WebKit::W... + 48 (Function.h:79)
Is the code in AccountStore.clear
where it tries to wipe all the localStorage and cookie data from the webview used in the login flow.
This makes me wonder if you've got multiple concurrent attempts at tearing down this state, which are stepping on each other's toes.
Another possibility is that rather than an auth error, it's just a normal oauth-token expiry? There's an issue in the android version when they expire, and that's difficult to reproduce as a dev because they live a (relatively) long time.
This is a good option to try to reproduce. You may be able to simulate a token expiry by going into https://accounts.firefox.com/settings and revoking the Lockwise app from there, which destroys its active OAuth tokens.
@kaylagalway as discussed in slack, I made a branch of application-services that sets an artificially low oauth token duration, which can hopefully help ease testing of how to app responds to expired oauth tokens:
https://github.com/mozilla/application-services/compare/short-duration-tokens?expand=1
If we're able to incorporate this into a build of Lockwise for iOS, then it will need to fetch a new FxA OAuth token every 30 seconds. Combined with e.g. backgrounding or restarting the app, I'm hoping this may help us reproduce the crashy behaviour here.
I used a build created from the branch @rfk posted above and was able to use the faster token expiration to reproduce the issue. A threading issue resulting in Webkit code being called off the main thread was causing our crash. It was on our end, not yours, but it was triggered by a chain of events after token expiration. This ticket can be closed and I will make the necessary changes on my end to resolve the issue.
Thank you all for your help!
I used a build created from the branch @rfk posted above and was able to use the faster token expiration to reproduce the issue.
@kaylagalway would it be worth us exposing some sort of API to let the app set the token expiration time, to make it easier to test/debug things like this in future?
@rfk Yeah, that would actually be great!
There has been a large uptick in crash reports in the new version of Lockwise iOS (1.7.0) that seem to lead back to the beginOathFlow function. Within this update, the app was updated from Xcode 10 to Xcode 11.2.1. The application services framework was updated from v0.32.1 to v0.42.3.
The relevant portion of the crash logs are here: