mixpanel / mixpanel-swift

Official iOS (Swift) Tracking Library for Mixpanel Analytics
https://mixpanel.com
Apache License 2.0
434 stars 234 forks source link

Async dispatch inside of certain lifecycle delegate callbacks will result in unexpected behavior/crashes #481

Closed RamblinWreck77 closed 3 years ago

RamblinWreck77 commented 3 years ago

Hey team, I've been chasing a "Task had still not ended" crash on-background for a while I believe I have finally tracked down the cause.

In MixpanelInstance.swift lines 428 -> 490 we have 3 app lifecycle-related delegate callbacks that either crash or cause other issues:

1) applicationDidEnterBackground

beginBackgroundTask needs to be called as soon as we detect a didEnterBackground. Per it's documentation:

Call this method as early as possible before starting your task, preferably before your app actually enters the background. The method requests the task assertion for your app asynchronously. If you call this method shortly before your app is due to be suspended, there’s a chance that the system might suspend your app before that task assertion is granted. For example, **don’t call this method at the very end of your applicationDidEnterBackground(_:) method and expect your app to continue running**. If the system is unable to grant the task assertion, it calls your expiration handler.

I'd recommend calling it first and moving everything else in this function below it

when applicationDidEnterBackground is called on main, and the following code paths are async:

trackingQueue is a utility QoS serial queue. The app is shutting down/going to background, and when main hits this function we're basically queuing up some work on a low-priority background thread that will (almost certainly) never execute before we're terminated. We're also starting a background task synchronously on main and killing it async on that same low-priority thread. I've addressed similar issues in my code base by adding an "immediate" flag on functions (defaulting to false) that allows sync-dispatch under the hood. It's a bit more work but it's nice to have the option.

Either way, I'd recommend

A) Remove line 451's async dispatch to end the task entirely, since the termination condition on another queue (while main is running flush/archive?) is non-deterministic

and

B) move the endBackgroundTask to inside beginBackgroundTask's expiration handler. iOS should always call this expiration handler only once. This way on every background we start a task, queue up some work, have anywhere from 20-120 seconds to let it run, and when iOS calls back to let us know it's time to die we synchronously (very important!) mark the task as ended within that block's context to prevent iOS from deciding our entire process must be killed. I've also read iOS does some shady behind-the-scenes logic on your background behavior, and getting killed for not ending tasks properly will penalize your app. iOS will then grant you less time (or none at all!) for future tasks.

2) applicationWillEnterForeground

here we have an async dispatch to trackingQueue to end the background task. A problem I see is taskId is an unprotected var being changed on main in some functions and utility queue (not-main) in others. This could introduce subtle race/corruption issues if main is changing taskId when/if trackingQueue also touches self.taskId to make changes. In my experience multi-thread r/w on unprotected vars in swift creates a "hot potato" that later crashes with a nonsense stack trace in unrelated code. Best to completely avoid!

My recommendation here is to treat taskId as a main-thread-only var, and only access it synchronously inside these delegate callbacks (which will always be on main afaik!). Once we can force Swift 5.5, the new @MainActor actor isolation feature could be useful here!

3) applicationWillTerminate

This is, IMO, the largest problem. When iOS calls this function you're about to get killed, and when the function's context exits that's it.

EVERYTHING in this function needs to be synchronously called, otherwise you can be sure it will not reliably execute. I'd be willing to be that async dispatch to trackingQueue for archive() has a near-100% failure rate. Could this cause data/event losses on terminate? While we're at it, we should also synchronously end any background tasks (if they exist) in here.

If you're interested, I could put together a PR against 3.0.0.beta with these changes to discuss further. I just wanted to raise these issues and get some feedback before starting to see if there's anything I've missed!

zihejia commented 3 years ago

hi, our one and only @RamblinWreck77 , great to see you again! As always, we've been super grateful for your contribution to this repo! Thank you so much for such a deep analysis.

RamblinWreck77 commented 3 years ago

@zihejia Sounds good! I'll close this then :) Have a good one!