mixpanel / mixpanel-swift

Official iOS (Swift) Tracking Library for Mixpanel Analytics
https://mixpanel.com
Apache License 2.0
434 stars 234 forks source link

[3.0.0] Rare crash inside Track.swift #495

Closed RamblinWreck77 closed 2 years ago

RamblinWreck77 commented 2 years ago

Hey team, congrats on 3.0.0! I am however still seeing an (admittedly rare) crash in prod. My automated crash reporting seems to think it's in line 34 of Track.swift. It also seems to happen extremely early in the app boot cycle, always within the first 500ms or so.

Looking at the code, I was surprised to see that ev is still force-unwraped. IIRC I had a PR where that was changed to a crash-safe if-let implementation. @zihejia Would you mind explaining the rationale behind keeping var ev force unwrapped?

Current Code:

        var ev = event
        if ev == nil || ev!.isEmpty {
            Logger.info(message: "mixpanel track called with empty event parameter. using 'mp_event'")
            ev = "mp_event"
        }

My Suggestion:

        var ev = "mp_event"

        if let hasEv = event {
            ev = hasEv
        } else {
           Logger.info(message: "mixpanel track called with empty event parameter. using 'mp_event'")
        }

This removes any need for force unwrapping later. Also, while I can't prove this, I don't trust the force unwrap in the current code's if statement to be safe. I know that for if (A || B), A should always be checked first, but I wouldn't put it past Apple/the swift compiler optimizations in release mode to occasionally check B before A. When B is a run-time crash if A wasn't checked first I feel that's just asking for trouble.

This may or may not be the cause of the rare crash, but IMO it's worth changing either way. Thanks and have a good one!

zihejia commented 2 years ago

hi @RamblinWreck77 , my bad for missing your changes and I will include it in my next PR. But I suspect the crash could be caused by something else. Do you have a full crashlog?

RamblinWreck77 commented 2 years ago

Here's a few of the the stack traces we have:

Screen Shot 2022-01-11 at 10 18 04 AM Screen Shot 2022-01-11 at 10 16 34 AM Screen Shot 2022-01-11 at 10 19 05 AM
RamblinWreck77 commented 2 years ago

Seems like they're all happening when the next task in an async serial queue is being processed, maybe an incomplete capture/force unwrap on a weak object?

Might also be lock related, I know you guys use locks inside async queues and unfortunately I don't have a ton of experience with those in iOS (I try to avoid them entirely).

zihejia commented 2 years ago

hi @RamblinWreck77 , looks like when the Track is dequeued from the async queue for execution, some data(i.e. looks like an array in the last example above) inside the properties dictionary is no longer accessible due to it possibly being released before that happens.

How frequent is the crash? When you track event with properties, for the properties dictionary, does it also contain references to some other mutable collections?

RamblinWreck77 commented 2 years ago

It's relatively rare, but it is also currently our # 1 and 2 most frequent crash so if we could knock it out it would help metrics. Currently impacts 0.16% of our DAU.

As for properties, no nothing is mutable. We have an event manager with a static dispatch function, so that a view controller can say something like "viewed page" with "page name" = "some page". That static func calls an async dispatch to a utility queue, with an immutable/captured copy of the event name/props. When that async queue is processed Mixpanel.mainInstance().track(event: ..., properties: ...) is called.

I wrapped our calls to mixpanel inside an async utility QoS queue so that there was no way main could be blocked by our event code.

That said, I feel like once the event name/props are "handed off" to mixpanel code everything should be "safe"... so that's weird. trackingQueue.async() inside track() captures the event and properties inputs so again we should be safe here... Very odd.

RamblinWreck77 commented 2 years ago

Also noticed this inside MixpanelInstance.swift / flush()

trackingQueue.async { [weak self, completion] in
            guard let self = self else {
                return
            }

            if let shouldFlush = self.delegate?.mixpanelWillFlush(self), !shouldFlush {
                return
            }

            let eventQueue = self.mixpanelPersistence.loadEntitiesInBatch(type: self.persistenceTypeFromFlushType(.events))
            let peopleQueue = self.mixpanelPersistence.loadEntitiesInBatch(type: self.persistenceTypeFromFlushType(.people))
            let groupsQueue = self.mixpanelPersistence.loadEntitiesInBatch(type: self.persistenceTypeFromFlushType(.groups))

            self.networkQueue.async { [weak self, completion] in
                guard let self = self else {
                    return
                }
                self.flushQueue(eventQueue, type: .events)
                self.flushQueue(peopleQueue, type: .people)
                self.flushQueue(groupsQueue, type: .groups)

                if let completion = completion {
                    DispatchQueue.main.async(execute: completion)
                }
            }
        }

Shouldn't networkQueue capture eventQueue/peopleQueue/groupsQueue explicitly? Those appear to be intended as snapshots of the pending data to flush so I think capturing them explicitly makes sense. Probably not a huge deal, just a best practice.

RamblinWreck77 commented 2 years ago

I saw the force unwrap changes went in with 3.1.0. I'll try and get a prod release out soon to see if that resolves the issue!

zihejia commented 2 years ago

thanks @RamblinWreck77 !

Aliaksandr-A commented 2 years ago

Same issue after upgrade from 2.x to 3.1.0:

Crashed: com.mixpanel.f15cc2d359522a6b1415fe97b156443c.tracking)
0  libswiftCore.dylib             0x383c88 _swift_release_dealloc + 24
1  MyApp                          0x5aef40 closure #1 in MixpanelInstance.track(event:properties:) + 948 (MixpanelInstance.swift:948)
2  MyAppp                          0x5a7b50 thunk for @escaping @callee_guaranteed () -> () + 4369431376 (<compiler-generated>:4369431376)
3  libdispatch.dylib              0x2a84 _dispatch_call_block_and_release + 32
4  libdispatch.dylib              0x481c _dispatch_client_callout + 20
5  libdispatch.dylib              0xc004 _dispatch_lane_serial_drain + 620
6  libdispatch.dylib              0xcc00 _dispatch_lane_invoke + 404
7  libdispatch.dylib              0x174bc _dispatch_workloop_worker_thread + 764
8  libsystem_pthread.dylib        0x37a4 _pthread_wqthread + 276
9  libsystem_pthread.dylib        0xa74c start_wqthread + 8
RamblinWreck77 commented 2 years ago

@zihejia No joy, I can confirm this issue remains in 3.1.2

RamblinWreck77 commented 2 years ago

@zihejia Sorry for the double-ping, but I have a few questions about this function starting at line 934 in MixpanelInstance.swift:

    open func track(event: String?, properties: Properties? = nil) {
        if hasOptedOutTracking() {
            return
        }

        let epochInterval = Date().timeIntervalSince1970

        trackingQueue.async { [weak self, event, properties, epochInterval] in
            guard let self = self else {
                return
            }
            let mixpanelIdentity = MixpanelIdentity.init(distinctID: self.distinctId,
                                  peopleDistinctID: nil,
                                  anonymousId: self.anonymousId,
                                  userId: self.userId,
                                  alias: nil,
                                  hadPersistedDistinctId: self.hadPersistedDistinctId)
            self.timedEvents = self.trackInstance.track(event: event,
                                                        properties: properties,
                                                        timedEvents: self.timedEvents,
                                                        superProperties: self.superProperties,
                                                        mixpanelIdentity: mixpanelIdentity,
                                                        epochInterval: epochInterval)
        }

        if MixpanelInstance.isiOSAppExtension() {
            flush()
        }
    }

So when the customer's app calls track(), it captures the time (epochInterval) at track time, and it goes into an async queue. Identity (mixpanelIdentity) is then resolved at dispatch time, which of course is whenever trackingQueue.async() gets around to executing.

My questions are:

1) What happens if the users identity, such as userId, is null at track time but defined at dispatch time? Wouldn't that cause the customers data to incorrectly have a userId when it was really null at track time?

2) Is the behavior in 1) intentional? It would seem to me that the user's identity at track time is the customer's intent when dispatching an event to mixpanel, not whatever their identity is when mixpanel's .utility QoS trackingQueue gets around to broadcasting the data to mixpanel servers

3) If the above behavior is unintentional, what do you think about the following change:

   open func track(event: String?, properties: Properties? = nil) {
        if hasOptedOutTracking() {
            return
        }

        // Capture these values at track-time, async dispatch
        let epochInterval = Date().timeIntervalSince1970
        let mixpanelIdentity = MixpanelIdentity.init(distinctID: distinctId,
                                                     peopleDistinctID: nil,
                                                     anonymousId: anonymousId,
                                                     userId: userId,
                                                     alias: nil,
                                                     hadPersistedDistinctId: hadPersistedDistinctId)
        let timedEventsSnapshot = timedEvents
        let superPropertiesSnapshot = superProperties

        trackingQueue.async { [weak self, event, properties, epochInterval, mixpanelIdentity, timedEventsSnapshot, superPropertiesSnapshot] in
            guard let self = self else {
                return
            }

            self.timedEvents = self.trackInstance.track(event: event,
                                                        properties: properties,
                                                        timedEvents: timedEventsSnapshot,
                                                        superProperties: superPropertiesSnapshot,
                                                        mixpanelIdentity: mixpanelIdentity,
                                                        epochInterval: epochInterval)
        }

        if MixpanelInstance.isiOSAppExtension() {
            flush()
        }
    }

I think this would guarantee that things like identity/super props at track time are correctly sent to mixpanel at dispatch time

zihejia commented 2 years ago

hi @RamblinWreck77 , 1) is intentional and you are right, the goal is to make things like identity/super props at track time are correctly sent to Mixpanel at dispatch time. One way to guarantee that is to make sure all changes related to mixpanelIdentity, timedEvents, and super properties are protected by the same trackingQueue. If you move them out of the trackingQueue, it will not be the case anymore.

rickvdl commented 2 years ago

Hi, we're also experiencing crashes in production with a similar stack trace, see screenshots for reference. It's dropping our crash free rate ~2% per day, so it's happening quite often for us.

Is there a workaround available / in the works? Can we downgrade to a stable version that doesn't have this issue? We've removed Mixpanel from our app for now.

Thanks in advance

Screenshot 2022-02-05 at 11 36 15 Screenshot 2022-02-05 at 11 36 03 Screenshot 2022-02-05 at 11 35 53
zihejia commented 2 years ago

hi @rickvdl , thanks for bringing it up. Which SDK version are you using? and have you started to experience this crash from a particular version?

rickvdl commented 2 years ago

Hi @zihejia , we were using 3.1.0 I just noticed the 3.1.3 release, but couldn't find anything in the change log referring to a possible fix. Is it possible that this issue was resolved?

zihejia commented 2 years ago

hi @rickvdl , we are still looking into it and the issue has not been resolved yet. But we highly recommend you upgrade to 3.1.3 because 3.1.0 has a known issue for ingesting any event or property containing %, &.

rickvdl commented 2 years ago

Thanks @zihejia as far as I can see we're not sending these in any values nor sending user input that could contain this, so I'll hold off putting the SDK back in for now. Thanks for the update, hope it can be resolved soon

zihejia commented 2 years ago

hi @rickvdl , it seems the crash happens when trying to access the properties in Track, did you happen to have any mutable data inside the properties? Also, does this crash only occur on certain iOS versions?

rickvdl commented 2 years ago

@zihejia As far as I can see we're only sending value types. Does the logic support sending nil values? It looks like the crash is happening on both iOS 15 as well as 14.

zihejia commented 2 years ago

hi @RamblinWreck77 @rickvdl . We have released v3.1.4. It has several critical fixes to prevent race conditions. Would you like to try it and see how it goes?

rickvdl commented 2 years ago

Thanks for the quick release @zihejia ! We'll give it a try and let you know as soon as we have some data :)

ifabijanovic commented 2 years ago

Hello, reporting back from #515 - after reverting back to 2.8.2 we no longer see those crashes. How would we go about verifying that 3.1.4 resolves those issues, as we were unable to reproduce the original crashes? I cannot test this in production 😃

zihejia commented 2 years ago

hi @ifabijanovic , that's great news. 3.1.4 has a similar approach to fix race conditions as 2.8.2 does, so 3.1.4 should be good but no rush if you just reverted.

rickvdl commented 2 years ago

@zihejia 3.1.4 is in use by ~60% of our daily active users and so far we haven't seen any crashes, will report back after full release, but so far it appears to be fixed for us.

rickvdl commented 2 years ago

@zihejia After fully releasing an app version with SDK version 3.1.4 we no longer see the crash. From my side this issue can be closed. Thanks!

zihejia commented 2 years ago

hi @RamblinWreck77 , do you still have crashes? Can this issue be closed? Thanks!

S2Ler commented 2 years ago

Same issue after upgrade from 2.x to 3.1.0:


Crashed: com.mixpanel.f15cc2d359522a6b1415fe97b156443c.tracking)

0  libswiftCore.dylib             0x383c88 _swift_release_dealloc + 24

1  MyApp                          0x5aef40 closure #1 in MixpanelInstance.track(event:properties:) + 948 (MixpanelInstance.swift:948)

2  MyAppp                          0x5a7b50 thunk for @escaping @callee_guaranteed () -> () + 4369431376 (<compiler-generated>:4369431376)

3  libdispatch.dylib              0x2a84 _dispatch_call_block_and_release + 32

4  libdispatch.dylib              0x481c _dispatch_client_callout + 20

5  libdispatch.dylib              0xc004 _dispatch_lane_serial_drain + 620

6  libdispatch.dylib              0xcc00 _dispatch_lane_invoke + 404

7  libdispatch.dylib              0x174bc _dispatch_workloop_worker_thread + 764

8  libsystem_pthread.dylib        0x37a4 _pthread_wqthread + 276

9  libsystem_pthread.dylib        0xa74c start_wqthread + 8

Update from me: no longer crashes.

RamblinWreck77 commented 2 years ago

We haven't shipped to prod in a while/since this fix went live in the mixpanel lib. We should be shipping a release within a week and I can report back then!

zihejia commented 2 years ago

@RamblinWreck77, no worries, looking forward to it!

RamblinWreck77 commented 2 years ago

Looks fixed to me! Closing this :)