launchdarkly / android-client-sdk

LaunchDarkly Client-side SDK for Android
Other
45 stars 23 forks source link

OutOfMemoryException and hundreds of launch darkly threads #234

Closed hoda0013 closed 7 months ago

hoda0013 commented 7 months ago

Describe the bug We have been seeing hundreds of crashes over the past 3 months due to an OutOfMemoryException. The crashes are reported via Crashlytics which gives us a stacktrace with the crash and shows the other threads in the process and their stacktraces. What I'm seeing is that what this crash occurs there will be almost 300 other threads that are all created by the LaunchDarkly SDK. It seems like the number of threads created is probably what is causing the memory issue.

To reproduce There doesn't seem to be a way to reproduce this that I have found. What I do know is that this only seems to be happening on devices that are running on a forked version of Android (PayDroid) made by a company called Pax. These are "payment terminals" used to process credit card transactions. We haven't seen any crashes on any other types of Android device.

Expected behavior The LaunchDarkly SDK has some way of preventing so many threads from being created or can clean them up when they are finished running. If I view the crash logs for unrelated crashes, I'll see 2 threads with a the stacktrace I'm seeing not hundreds.

Logs Here's the stacktrace from the crash:

Fatal Exception: java.lang.OutOfMemoryError: pthread_create (1040KB stack) failed: Try again at java.lang.Thread.nativeCreate(Thread.java) at java.lang.Thread.start(Thread.java:730) at java.util.concurrent.ThreadPoolExecutor.addWorker(ThreadPoolExecutor.java:941) at java.util.concurrent.ThreadPoolExecutor.execute(ThreadPoolExecutor.java:1348) at com.launchdarkly.sdk.android.StreamUpdateProcessor.stop(StreamUpdateProcessor.java) at com.launchdarkly.sdk.android.ConnectivityManager.updateDataSource(ConnectivityManager.java:118) at com.launchdarkly.sdk.android.ConnectivityManager.lambda$new$0(ConnectivityManager.java:9) at com.launchdarkly.sdk.android.AndroidPlatformState$ConnectivityReceiver.onReceive(AndroidPlatformState.java:57) at android.app.LoadedApk$ReceiverDispatcher$Args.run(LoadedApk.java:1122) at android.os.Handler.handleCallback(Handler.java:751) at android.os.Handler.dispatchMessage(Handler.java:95) at android.os.Looper.loop(Looper.java:154) at android.app.ActivityThread.main(ActivityThread.java:6119) at java.lang.reflect.Method.invoke(Method.java) at com.android.internal.os.ZygoteInit$MethodAndArgsCaller.run(ZygoteInit.java:969) at com.android.internal.os.ZygoteInit.main(ZygoteInit.java:859)

Here's a stacktrace from one of the other threads created by LaunchDarkly. There are 291 other threads that have the seemingly the exact same stacktrace:

Thread-146: at java.lang.Object.wait(Object.java) at java.lang.Thread.parkFor$(Thread.java:2127) at sun.misc.Unsafe.park(Unsafe.java:325) at java.util.concurrent.locks.LockSupport.park(LockSupport.java:161) at java.util.concurrent.locks.AbstractQueuedSynchronizer$ConditionObject.await(AbstractQueuedSynchronizer.java:2035) at java.util.concurrent.LinkedBlockingQueue.take(LinkedBlockingQueue.java:413) at java.util.concurrent.ThreadPoolExecutor.getTask(ThreadPoolExecutor.java:1058) at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1118) at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:607) at com.launchdarkly.sdk.android.BackgroundThreadExecutor$PriorityThreadFactory.lambda$newThread$0(BackgroundThreadExecutor.java:23) at java.lang.Thread.run(Thread.java:761)

SDK version 4.2.3

Language version, developer tools Kotlin 1.8.10

OS/platform Android 7.1.1 (technical Pax PayDroid 7.1.1) and Android 5.1.1 (technically Pax PayDroid 5.1.1)

Additional context Add any other context about the problem here.

tanderson-ld commented 7 months ago

Hi @hoda0013, thank you for reporting this. We released 5.0.3 to fix a threading issue we were seeing only on Pax devices. The issue was ultimately a race condition in threading, but it didn't lead to leaking threads. Perhaps this issue is only exposed by the scheduler / executor / worker implementations on Pax. I will investigate and get back to you.

Could you provide a rough estimate of how many sessions of your app are affected as a percentage? That will help me determine how thoroughly I need to dig when trying to reproduce it.

Do you mind providing more details about the type of application this is happening in? This is helpful background info for the team so we can keep track of different use cases.

tanderson-ld commented 7 months ago

Could you include a snippet of your config and init code? Those would use LDConfig, LDContext, and LDClient. Based on a quick look through of the code, the current theory is that there are many StreamingDataSources being constructed. It creates one and only one when the LDClient init(...) is called and streaming is enabled. Is it possible that the LDClient init(...) is being called many times on these platforms? The last customer that used a Pax platform worked on a launcher app and saw some interesting update / reboot behavior. Perhaps on this Android platform, the app is running its restart routine multiple times without a process kill resulting in left over objects?

hoda0013 commented 7 months ago

@tanderson-ld

Thanks for the quick response!

Unfortunately I don't have session data or very fine grained user data. The best I can tell you right now is that we've seen this crash 453 times in the past 90 days over 165 unique users. But I don't know how many installs we have on Pax devices yet.

The application this is being used on is a financial app used for taking card payments.

To answer your question about how LD is being configured and initialized:

We have an Activity called LoginActivity which is configured as our Launcher activity. When LoginActivity is launched it creates an instance of LoginViewModel. The first thing that LoginViewModel does is try to see if the user is already logged in, but as part of that check we initialize an instance of LaunchDarkly like this:

private fun initializeLaunchDarklyBlocking(userConfig: LaunchDarklyUserConfig): LDClient? {

        val launchDarklyConfig = buildLDConfig()
        val launchDarklyUser = buildLDUser(userConfig)

        LDClient.init(application, launchDarklyConfig, launchDarklyUser).get()
        return LDClient.get()
    }

 private fun buildLDConfig(): LDConfig {

        return LDConfig.Builder()
            .mobileKey(PRODUCTION_MOBILE_KEY)
            .secondaryMobileKeys(buildSecondaryMobileKeyMap())
            .events(
                Components.sendEvents().allAttributesPrivate(true)
            )
            .disableBackgroundUpdating(true)
            .build()
    }

 private fun buildLDUser(userConfig: LaunchDarklyUserConfig): LDUser {

        if (userConfig.userKey.isNullOrEmpty()) {

            return LDUser
                .Builder("anonymous")
                .anonymous(true)
                .build()
        }

        return LDUser
            .Builder(userConfig.userKey)
            .email(userConfig.userEmail)
            .custom(
                "role",
                userConfig.userRole
            )
            .custom(
                "company_id",
                userConfig.companyId
            )
            .custom(
                "reseller_id",
                userConfig.resellerId
            )
            .custom(
                "version",
                userConfig.applicationVersionName
            )
            .build()
    }

Now, if the user is not logged in, then the user will enter their credentials and submit them, upon doing that, the Launch Darkly initialization code above is run again. So if a user has never used our app before, it looks like we definitely call our LD initialization code at least twice (once on startup and once per login attempt after that).

Once that user is logged in we access the LD instance by calling LDClient.get().

If the user logs out, then our LD init code will run again when LoginActivity loads and again if the user attempts another login.

Does the initialization sequence I described above sounds problematic or like it should work fine?

Thanks again!

tanderson-ld commented 7 months ago

You shouldn't need to call init(...) more than once per application process lifecycle. The SDK does however protect against that and log a warning if that happens: "LDClient.init() was called more than once! returning primary instance.". Do you ever see that log?

It is as if init is being called multiple times, but the internal flag that detects multiple calls is not persisting in memory, and so it then creates another streaming data source with threads allocated to it. Have you ever seen any other weird issues with other SDKs or your own state tracking variables? This seems like a process launch / restoration memory issue at play.

To handle the log-in case, you should use identify, but that is probably not the root cause of the issue you are seeing. https://docs.launchdarkly.com/sdk/features/identify#android

hoda0013 commented 7 months ago

@tanderson-ld

I apologize, we actually have 2 instances of OutOfMemoryExceptions that I'm seeing in Crashlytics. The stacktrace that I posted originally that included the line StreamUpdateProcessor.stop was for an OOM event with far fewer events (34 events in the past 90 days). The stacktrace for the crash I meant to post about (with 453 crash events) is more generic:

Fatal Exception: java.lang.OutOfMemoryError: pthread_create (1040KB stack) failed: Try again
       at java.lang.Thread.nativeCreate(Thread.java)
       at java.lang.Thread.start(Thread.java:733)
       at java.util.concurrent.ThreadPoolExecutor.addWorker(ThreadPoolExecutor.java:970)
       at java.util.concurrent.ThreadPoolExecutor.processWorkerExit(ThreadPoolExecutor.java:1038)
       at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1180)
       at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:636)
       at java.lang.Thread.run(Thread.java:764)

However, in both OOM cases, if I look at the other threads I see the same thing where there are hundreds of threads with the stacktrace:

Thread-460: at java.lang.Object.wait(Object.java) at java.lang.Thread.parkFor$(Thread.java:2127) at sun.misc.Unsafe.park(Unsafe.java:325) at java.util.concurrent.locks.LockSupport.park(LockSupport.java:161) at java.util.concurrent.locks.AbstractQueuedSynchronizer$ConditionObject.await(AbstractQueuedSynchronizer.java:2035) at java.util.concurrent.LinkedBlockingQueue.take(LinkedBlockingQueue.java:413) at java.util.concurrent.ThreadPoolExecutor.getTask(ThreadPoolExecutor.java:1058) at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1118) at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:607) at com.launchdarkly.sdk.android.BackgroundThreadExecutor$PriorityThreadFactory.lambda$newThread$0(BackgroundThreadExecutor.java:23) at java.lang.Thread.run(Thread.java:761)

Sorry for the confusion but I was re-reading your reply about StreamingDataSources and recognized that I'd made an error.

hoda0013 commented 7 months ago

@tanderson-ld

I checked our logs for the "LDClient.init() was called more than once! returning primary instance." but did not find any instances. I also think that I was wrong about calling the init code multiple times. It looks like we do have logic that avoids initializing LD more than once while in our LoginActivity.

There's one thing I did notice yesterday while debugging this. This was on a non-pax device, but I attached Android Studio's profiler tool and looked at the CPU / threads. I noticed that if I toggled airplane mode on/off for the device I'd see threads created "e.g. Thread-32, Thread-33" and some okhttp-eventsource threads as well.

Since I was seeing BackgroundThreadExecutor in all of the stacktraces I looked for that class and noticed that it is used to create a fixed thread pool of 2 threads and is (I think) only being created / referenced in StreamingDataSource. Then I looked for places where StreamingDataSource is created. I found that ConnectivityManager has a function called updateDataSource(...) that looks like it builds a new instance of a DataSource when it's called and there are some conditionals that might try to stop an old data source if it exists prior to the new one being created.

Is ConnectivityManager potentially a place where many StreamingDataSource instances could be created (in turn creating instances of BackgroundThreadExecutor) and somehow not clean them up?

tanderson-ld commented 7 months ago

I think you're right. Earlier I looked for the constructor usage, but missed that data source factory build call in updateDataSource. I'll try those steps to reproduce it and if I can, work on a fix. This all fits with why this bug would be rarely seen, it needs a pretty long lived app and cellular connectivity coming and going.

hoda0013 commented 7 months ago

@tanderson-ld You had asked if we are seeing weird issues with any other SDK or with our own state variables. Not that I know.

There is one really odd part of the codebase that does deal with the app's process though. Each of our Activities overrides the onTrimMemory callback and if it sees a level of ComponentCallbacks2.TRIM_MEMORY_COMPLETE or ComponentCallbacks2.TRIM_MEMORY_RUNNING_CRITICAL it runs the following code:

moveTaskToBack(true)
finishAndRemoveTask()
android.os.Process.killProcess(android.os.Process.myPid())
exitProcess(1)

I believe the reason this was added was to try to kill the app if a low memory condition was detected b/c a different weird error would occur if the system killed the process while the app was in the background and the user tried to resume the app. I wonder if this could somehow contribute to this issue. I definitely see this event occurring often based on our logs and do see it in some of the Crashlytics logs with the OOM exception.

I also dug up some other logs and noticed that in one of them where we saw the OOM occur, I'm seeing that our network connectivity listener (A broadcast receiver) is seeing the network change like every 2 seconds for a half hour and then this crash occurs. I wonder if that kind of unexpected thrashing might cause the LD ConnectivityManager some issues.

tanderson-ld commented 7 months ago

I've been able to reproduce it and I am working on a fix. We plan to back port it into 4.X.

tanderson-ld commented 7 months ago

PR is under review internally and should have a release out by EOD Monday.

hoda0013 commented 7 months ago

@tanderson-ld Wow! When I filed this last night I was not expecting it to move this fast. Thanks so much!

Just out of curiosity, what ended up being the root cause and fix?

tanderson-ld commented 7 months ago

It is live in version 5.0.4, I am working on the release for 4.x and will respond when that is out.

The root cause was an executor that wasn't getting cleaned up correctly, you were spot on and thank you for your help.

tanderson-ld commented 7 months ago

Hi again @hoda0013, 4.3.1 was just released and should be available on repos shortly. Thank you for your patience.