mixpanel / mixpanel-swift

Official iOS (Swift) Tracking Library for Mixpanel Analytics
https://mixpanel.com
Apache License 2.0
432 stars 236 forks source link

Potential iOS 15 crash related to Mixpanel #502

Closed igled7 closed 2 years ago

igled7 commented 2 years ago

Hi,

We started to see a high number of crashes yesterday (13 Jan). They come from different app versions, and we never had that crash before.

This is a fragment of the stack trace of the thread were the app is crashing:

Crashed: com.apple.network.connections
0  libquic.dylib                  0x69938 qlog_abort_internal
1  libquic.dylib                  0x69924 qlog_abort_internal
2  libquic.dylib                  0x639e4 quic_frame_write_PADDING
3  libquic.dylib                  0x9c6dc _quic_packet_builder_assemble
4  libquic.dylib                  0x2490 quic_packet_builder_assemble
5  libquic.dylib                  0x366bc qui_assemble_and_encrypt

It seems that when this crash happens, Mixpanel SDK (v2.8.3) is performing some work on its network thread. We noticed that all the crash stack traces follow the same pattern.

com.mixpanel.74dc35f5daf1026de3ef7dff4d7ea18e.network)
0  libsystem_kernel.dylib         0x1540 semaphore_wait_trap + 8
1  libdispatch.dylib              0x4bf0 _dispatch_sema4_wait + 28
2  libdispatch.dylib              0x52a8 _dispatch_semaphore_wait_slow + 132
3  libswiftDispatch.dylib         0x1994 OS_dispatch_semaphore.wait(wallTimeout:) + 24
4  Mixpanel                       0x4d294 Flush.flushQueueInBatches(_:type:) + 156 (Flush.swift:156)
5  Mixpanel                       0x4c38c Flush.flushEventsQueue(_:automaticEventsEnabled:) + 89 (Flush.swift:89)
6  Mixpanel                       0x23038 closure #1 in closure #1 in MixpanelInstance.flush(completion:) + 1144 (MixpanelInstance.swift:1144)

Have you made any BE changes recently that could have caused this?

jaredmixpanel commented 2 years ago

@igled7 v3.1.0 includes a lot of improvements, including a complete overhaul of the flushing mechanisms. Can you try upgrading to the latest version to see if the issue persists? If it does, please add that stack trace and we'll investigate.

igled7 commented 2 years ago

@jaredmixpanel we are planning to upgrade to the latest version of Mixpanel, but I don't think it will help. The crashes seem to be related to a known bug in Apple's implementation of QUIC introduced in iOS 15. They said that it was fixed in iOS 15.2, and this seems to be the case as we don't see the crash happening in that version.

We are trying to understand why this crash started to happen. Have you upgraded your servers to support HTTP/3 this week? I can see that Mixpanel supports HTTP/3.

bolshedvorsky commented 2 years ago

@jaredmixpanel We are seeing same crashes reported by our app as well. Stack trace for the crash is pointed to networking:

Crashed: com.apple.network.connections
0  libquic.dylib                  0x69938 qlog_abort_internal + 272
1  libquic.dylib                  0x69924 qlog_abort_internal + 252
2  libquic.dylib                  0x639e4 quic_frame_write_PADDING + 640
3  libquic.dylib                  0x9c6dc _quic_packet_builder_assemble + 2048
4  libquic.dylib                  0x2490 quic_packet_builder_assemble + 124
5  libquic.dylib                  0x366bc quic_assemble_and_encrypt + 260
6  libquic.dylib                  0x37a04 __quic_send_frames_for_key_state_block_invoke.106 + 1016
7  libnetwork.dylib               0x603310 nw_protocol_data_access_buffer + 1160
8  libquic.dylib                  0x1c8c8 __quic_send_frames_for_key_state_block_invoke + 200
9  libnetwork.dylib               0xb9d4 nw_protocol_service_requested_outbound_data + 360
10 libnetwork.dylib               0x5e99fc nw_protocol_request_outbound_data + 128
11 libquic.dylib                  0x22cb8 quic_send_frames_for_key_state + 1376
...

But interestingly enough it's only Mixpanel who is making networking call at the same time:

com.mixpanel.3bfa18ec20196c56b5726c1d0af33dfa.network)
0  libsystem_kernel.dylib         0x1540 semaphore_wait_trap + 8
1  libdispatch.dylib              0x4bf0 _dispatch_sema4_wait + 28
2  libdispatch.dylib              0x52a8 _dispatch_semaphore_wait_slow + 132
3  libswiftDispatch.dylib         0x1994 OS_dispatch_semaphore.wait(wallTimeout:) + 24
4  App                       0xaed790 Flush.flushQueueInBatches(_:type:) + 156 (Flush.swift:156)
5  App                       0xaec888 Flush.flushEventsQueue(_:automaticEventsEnabled:) + 89 (Flush.swift:89)
6  App                       0xb1d344 closure #1 in closure #1 in MixpanelInstance.flush(completion:) + 1188 (MixpanelInstance.swift:1188)
7  App                       0xb0e5f8 thunk for @escaping @callee_guaranteed () -> () + 3156540 (<compiler-generated>:3156540)
8  libdispatch.dylib              0x2914 _dispatch_call_block_and_release + 32
9  libdispatch.dylib              0x4660 _dispatch_client_callout + 20
10 libdispatch.dylib              0xbde4 _dispatch_lane_serial_drain + 672
11 libdispatch.dylib              0xc958 _dispatch_lane_invoke + 392
12 libdispatch.dylib              0x171a8 _dispatch_workloop_worker_thread + 656
13 libsystem_pthread.dylib        0x10f4 _pthread_wqthread + 288
14 libsystem_pthread.dylib        0xe94 start_wqthread + 8
jaredmixpanel commented 2 years ago

@bolshedvorsky what version of our SDK are you using?

igled7 commented 2 years ago

@jaredmixpanel you can find more info here. Apparently, some people started to see a similar pattern recently.

As I mentioned before, this seems to be related to Apple's HTTP/3 buggy implementation. I think that you could stop this crash from happening if you disable HTTP/3 on api-eu.mixpanel.com (at least for iOS clients).

Edit: I managed to contact one of the people that reported a lot of crashes in the official Apple forum and they are using Mixpanel as well...

bolshedvorsky commented 2 years ago

@jaredmixpanel Our production builds are using 2.x.x versions, we have plans to move to 3.x.x versions but it needs to be an update and a rollout to our entire user base

quintonpryce commented 2 years ago

I'm having this exact issue as well.

zihejia commented 2 years ago

hi @igled7 @bolshedvorsky @quintonpryce, are you able to reproduce it locally? What is the crash rate? We have a hard time reproducing it using 3.1.0. We are more comfortable disabling QUIC if we have a deterministic way to reproduce this problem and we could test the before/after behaviors after making the change.

igled7 commented 2 years ago

Hi @zihejia,

This crash started to happen on the 12th of Jan at 8 PM GMT. Did your infrastructure team make any changes around that period?

If you disable HTTP/3 today, I will be able to report tomorrow if it fixes the current situation. Given the high number of crashes that many people are having (for @quintonpryce it's the number 1 crash as well), I think is worth the try.

zihejia commented 2 years ago

hi @igled7 , we are using GCP and it did seem that Google silently made QUIC traffic changes for our GLB around that time you mentioned. We will disable it and let you know.

zihejia commented 2 years ago

hi @igled7 , we have disabled it but it usually takes a little while to be fully disabled. But you can keep an eye on the crash report from now.

igled7 commented 2 years ago

Thanks @zihejia. I'll report back tomorrow.

bolshedvorsky commented 2 years ago

Thanks @zihejia I checked our logs and it looks like your recent change made this crash to go away. We had the same issue when the app started to crash suddenly starting from Jan 12th. We had few crashes on 19th and no crashes on 20th.

igled7 commented 2 years ago

Hi @zihejia, it seems that the crashes have stopped for us as well.

zihejia commented 2 years ago

I'm closing this issue for now. Sorry for the inconvenience.