launchdarkly / android-client-sdk

LaunchDarkly Client-side SDK for Android
Other
46 stars 23 forks source link

NullPointerExeption thrown whenever we archive a flag #278

Open pal-tgiannakopoulos opened 2 months ago

pal-tgiannakopoulos commented 2 months ago

Is this a support request? No

Describe the bug

      Fatal Exception: java.lang.NullPointerException: Attempt to invoke virtual method 'int java.lang.Object.hashCode()' on a null object reference
   at java.util.concurrent.ConcurrentHashMap.get(ConcurrentHashMap.java:936)
   at com.launchdarkly.sdk.android.ContextDataManager.notifyFlagListeners(ContextDataManager.java:317)
   at com.launchdarkly.sdk.android.ContextDataManager.initDataInternal(ContextDataManager.java:179)
   at com.launchdarkly.sdk.android.ContextDataManager.switchToContext(ContextDataManager.java:100)
   at com.launchdarkly.sdk.android.LDClient.identifyInternal(LDClient.java:427)
   at com.launchdarkly.sdk.android.LDClient.identifyInstances(LDClient.java:451)
   at com.launchdarkly.sdk.android.LDClient.identify(LDClient.java:399)

We are assuming that the following code inside ContextDataManager.java is causing a null element to end up in the updatedFlagKeys collections which is used as a parameter to notifyFlagListeners method.

        for (Flag newFlag: newData.values()) {
            Flag oldFlag = oldData.getFlag(newFlag.getKey());
            if (oldFlag == null || !oldFlag.getValue().equals(newFlag.getValue())) {
                // if the flag is new or the value has changed, notify.  This logic can be run if
                // the context changes, which can result in an evaluation change even if the version
                // of the flag stays the same.  You will notice this logic slightly differs from
                // upsert.  Upsert should only be calling to listeners if the value has changed,
                // but we left upsert alone out of fear of that we'd uncover bugs in customer code
                // if we added conditionals in upsert
                updatedFlagKeys.add(newFlag.getKey());
            }
        }

To reproduce We haven't managed to reproduce this on demand but we are seeing this crash whenever we are archiving a flag. In our app we allow logging out, which causes a call to identify with an anonymous context. Our assumption based on user flow logs is that this action is causing the crash to happen probably after a flag is archived.

Expected behavior To not crash

Logs No other logs related other than the crashlog in the description

SDK version Android SDK 5.3.0

Language version, developer tools Kotlin 1.9.24 Android compile SDK version 34 Android target SDK version 34 Gradle plugin version 8.5.0 Gradle version 8.8

OS/platform Android regardless of version Our app runs from Android 8 to Android 14 All of them seem to have the same problem

Additional context Add any other context about the problem here.

tanderson-ld commented 2 months ago

Hi @pal-tgiannakopoulos, thank you for reporting this issue. Logging internally as SDK-573. We will investigate.

tanderson-ld commented 2 months ago

@pal-tgiannakopoulos, do you have a count of occurrences from crash reporting tooling? What is that as an absolute number and as a percentage of prod customers?

tanderson-ld commented 2 months ago

I did a preliminary investigation. I agree with your conclusion that a null flag key is ending up in that updatedFlagKeys collection, but after having traced the code, I don't see how that is happening.

Are you using polling or streaming? If you haven't specified in the SDK config, the default is streaming.

The SDK uses the flag key that is received from the server and we replace the local representation with a deletedItemPlaceholder, but the key is never nulled intentionally. Here is what is received by the SDK when archiving a flag.

event:delete
data:{"key":"test-archive","version":101}
pal-tgiannakopoulos commented 2 months ago

hi @tanderson-ld Thanks for taking a look at this

The crashes we are getting are not a lot. Each time we have archived a flag we got around 30-40 crashes. We do have around 25K users daily, but the app does not promote logging out, so the combination of having a user that has the app loaded when the archiving happens and that user then tries to logout is very minimal, so we believe it explains the low crash numbers but also the sudden spike that happens.

We don't explicitly set polling or streaming, so I guess we use streaming

As for the actual problem, I don't think your problem is with what the server is sending, but with what is stored when that happens. I see the line EnvironmentData storedData = getStoredData(currentContext); in ContextDataManager.switchToContext and its result is sent as the newData in the initDataInternal. What is "newData" then? It comes from the DataStore. When the update is received what is stored in the datastore? Is there a chance you have "deleted" the payload from the datastore and leaving nothing in, but the entry still exists? That could end up in a null key

tanderson-ld commented 1 month ago

HI again @pal-tgiannakopoulos , I did some more testing with various combinations of init, identify, and archive at various stages and was unable to reproduce the issue. I also double checked the storage layer and it looks to be operating as expected.

Below is serialized flag data for the flag I was testing with.

Before archive:

"StringVarFlag":{"flagVersion":21,"key":"StringVarFlag","trackEvents":false,"value":"theRuleMatchedLoggedIn","variation":1,"version":116}

After archive but before restart:

"StringVarFlag":{"deleted":true,"key":"StringVarFlag","version":117}

The next app launch, the deleted flag gets wiped out by the incoming payload which does not contain the flag at all.

tanderson-ld commented 1 month ago

Do you by chance trigger any routines on flag updates being received?

Another thought: I need to verify, but the flag is only marked as deleted client side in the SDK if the phone was connected and streamed the DELETE update that happens when the flag is archived. What do you think are the odds that ~30-40 customers are connected when the flag is archived and then choose to logout? 30/25k is 0.12% which would be the product of the probability(being connected) X probability (logging out).

pal-tgiannakopoulos commented 1 month ago

hi @tanderson-ld,

yes you are correct about the probabilities. This is my assumption as well, that's probably why we don't see hundreds or thousants of crashes.

And yes I trigger updates on flag updates being received.

I am exposing flags as Flows for my app to know about. These are typically collected from our viewModels. The Flow is per feature flag. I use a callbackFlow to wrap the logic of getting updates from LaunchDarkly where I am calling the registerFeatureFlagListener and keeping the coroutine alive with await for as long as the Flow is being collected.

simplified relevant code:

fun getFlagValue(flag: String, defaultValue: Boolean): Flow<Boolean> = callbackFlow {

        val listener = FeatureFlagChangeListener {
            if (isActive) {
                require(it == flag) { "Expected flag key(${flag}), but received update for flag key($it)" }
                trySend(client.boolVariation(flag, defaultValue))
            }
        }
        client.registerFeatureFlagListener(flag, listener)

        send(client.boolVariation(flag, defaultValue))

        awaitClose {
            client.unregisterFeatureFlagListener(flag, listener)
        }
    }

Please note, while I am observing only flags my app knows and cares about, I have seen the crashes happening when archiving flags that are not relevant to the app, so I don't know how much the above logic would be responsible for triggering the crash. This method also does not appear in the crashlog. The crash on my code derives from another Flow collection where the LoginState is observed and the client.identify(context) is being called as a result.

pal-tgiannakopoulos commented 1 month ago

@tanderson-ld Can you please try adding .evaluationReasons(true) in your config? I am using it on mine. Is there a chance this is what is causing the issue for me and you cannot replicate without it?

tanderson-ld commented 1 month ago

Just tried that too and still not able to reproduce the issue. Have you been able to reproduce it locally?

pal-tgiannakopoulos commented 1 month ago

No I haven't tried. I cannot pollute our project with test flags, I will need to create a new Launch Darkly app and re-implement. I will update you if I find time to do this, currently there's limited resource I can spend on this.

pal-tgiannakopoulos commented 2 weeks ago

I spent several hours yesterday to replicate this in a controlled environment and I couldn't, so sadly I cannot provide anything else other than the crashlogs in the issue description.