microsoft / cpp_client_telemetry

1DS C++ SDK
Apache License 2.0
85 stars 48 forks source link

ResumeTransmission stuck on lock #1077

Open thomasameisel opened 1 year ago

thomasameisel commented 1 year ago

Describe your environment. Describe any aspect of your environment relevant to the problem, including your SDK version, platform, OS version, etc. If you're reporting a problem with a specific version of a library in this repo, please check whether the problem has been fixed on main brach.

iOS platform, SDK version 3.6.187

Steps to reproduce. Describe exactly how to reproduce the error. Include a code sample if applicable.

Call ODWLogManager.ResumeTransmission. The issue was reported as happening on boot, but it is unclear if that is necessary.

What is the expected behavior? What did you expect to see?

ResumeTransmission executes successfully.

What is the actual behavior? What did you see instead?

ResumeTransmission waits for a lock to be released until the app is killed as non-responsive.

Additional context. Add any other context about the problem here.

Stack trace:

AC24C470-6473-4D8C-8A96-18E7BC0D03C9

lalitb commented 1 year ago

@thomasameisel Do you have the stack trace for all other threads at the time when Thread1 was waiting for the lock ?

thomasameisel commented 1 year ago

@lalitb we don't have the stack trace for the other threads unfortunately

lalitb commented 1 year ago

Thanks @thomasameisel the other thread stack would have given more insight of any deadlock situation or if other thread has invoked any LogManager operation which is taking too much of time.

nishchith-cp commented 1 year ago

@lalitb We are facing a similar lock issue on pauseTransmission

Thread 41: triggered +[ODWLogManager pauseTransmission] and is sitting in 1DS lock, probably for a long time.

Screenshot 2023-02-15 at 11 41 39 AM

Attaching the crash reports here.

report-2517258755120939999-2c2491df-77e7-4a01-9e0b-15b3ee6faef7.txt TeamSpaceApp 2-9-23, 1-24 PM.txt

On dispatch of Pause transmission request, this acquires the lock and waits for http request cancelation and never releases.

Thread 38 name:   Dispatch queue: eventDispatchQueue
Thread 38:
0   libsystem_kernel.dylib                 0x1c8cccdfc swtch_pri + 8
1   libsystem_pthread.dylib                0x1d943673c cthread_yield + 32
2   TeamSpaceApp                           0x10838b5f8 Microsoft::Applications::Events::HttpClientManager::cancelAllRequests() + 44
3   TeamSpaceApp                           0x1083def24 std::__1::__function::__func<Microsoft::Applications::Events::TelemetrySystem::TelemetrySystem(Microsoft::Applications::Events::ILogManager&, Microsoft::Applications::Events::IRuntimeConfig&, Microsoft::Applications::Events::IOfflineStorage&, Microsoft::Applications::Events::IHttpClient&, Microsoft::Applications::Events::ITaskDispatcher&, Microsoft::Applications::Events::IBandwidthController*, Microsoft::Applications::Events::LogSessionDataProvider&)::$_2, std::__1::allocator<Microsoft::Applications::Events::TelemetrySystem::TelemetrySystem(Microsoft::Applications::Events::ILogManager&, Microsoft::Applications::Events::IRuntimeConfig&, Microsoft::Applications::Events::IOfflineStorage&, Microsoft::Applications::Events::IHttpClient&, Microsoft::Applications::Events::ITaskDispatcher&, Microsoft::Applications::Events::IBandwidthController*, Microsoft::Applications::Events::LogSessionDataProvider&)::$_2>, bool ()>::operator()() + 60
4   TeamSpaceApp                           0x1083a5afc Microsoft::Applications::Events::LogManagerImpl::PauseTransmission() + 128
5   TeamSpaceApp                           0x1083bb80c Microsoft::Applications::Events::LogManagerBase<Microsoft::Applications::Events::ModuleLogConfiguration>::PauseTransmission() + 84
6   TeamSpaceApp                           0x1083bb718 +[ODWLogManager pauseTransmission] + 20
7   TeamSpaceApp                           0x10a1d9294 TSOneDSTelemetryLogManager.pauseTransmission() + 256
8   TeamSpaceApp                           0x10a1d9348 @objc TSOneDSTelemetryLogManager.pauseTransmission() + 36
9   TeamSpaceApp                           0x1091e16a4 __46-[AXPInstrumentationManager pauseTransmission]_block_invoke + 136
10  TeamSpaceApp                           0x10c7bf56c 0x102b08000 + 164328812
11  libdispatch.dylib                      0x1927cf460 _dispatch_call_block_and_release + 32
12  libdispatch.dylib                      0x1927d0f88 _dispatch_client_callout + 20
13  libdispatch.dylib                      0x1927d8640 _dispatch_lane_serial_drain + 672
14  libdispatch.dylib                      0x1927d918c _dispatch_lane_invoke + 384
15  libdispatch.dylib                      0x1927e3e10 _dispatch_workloop_worker_thread + 652
16  libsystem_pthread.dylib                0x1d9430df8 _pthread_wqthread + 288
17  libsystem_pthread.dylib                0x1d9430b98 start_wqthread + 8
nishchith-cp commented 1 year ago

@lalitb Any updates on this? Could you please prioritize this? Let me know if you need anything else. Here is another crash log. TeamSpaceApp 3-1-23, 1-44 PM.txt

nishchith-cp commented 1 year ago

@lalitb Any updates on this? We are hitting into this quite often. Could you please check on this

lalitb commented 1 year ago

@nishchith-cp - Is it possible to get the stack trace of all other threads, not just the thread crashing with timeout. There is a deadlock scenario between threads, so the data would be helpful.

nishchith-cp commented 1 year ago

Already attached the crash log in the preview comment TeamSpaceApp.2-9-23.1-24.PM (1).txt

nishchith-cp commented 1 year ago

@lalitb Could you share an update on the same?

thomasameisel commented 7 months ago

@lalitb here's a crash log with the PauseTransmission issue - report-2517068873866699999-59e56560-7cb3-4c83-9343-b9e8ff905328 (1).txt

From the call stack, I noticed the PauseTransmission function is synchronously waiting for the HTTP requests to complete. I'm curious on the need to wait for these network requests? By waiting on the requests to complete, PauseTransmission is also waiting to release the m_lock mutex which make other functions (ex. GetLogger) seem like they're hanging since they're waiting on that mutex.