Open maxgolov opened 1 year ago
@maxgolov Is there any workaround for this issue? We explicitly FlushAndTeardown on shutdown but have noticed that in short-lived sessions our events are not being uploaded reliably. I see from #1190 that FlushAndTeardown (and a CFG_INT_MAX_TEARDOWN_TIME) should guarantee that, but that there's still a bug here preventing that.
@vidia - please try a few options I mentioned regarding roll-back to older baseline.
Problem Description
It seems like a change from 2021 is affecting the
FlushAndTeardown
sequence: https://github.com/microsoft/cpp_client_telemetry/pull/829Steps to reproduce.
maxTeardownUploadTimeInSec
to 2-3-5 seconds.FlushAndTeardown
method spinning aimlessly and not actually uploading.What is the expected behavior?
Expected behavior is that the 2-3 seconds are spent on trying to upload pending requests. Once there are zero events left, app should unblock and exit. Ideally with
evt_stats
event turned off, it should be relatively fast exit. Why need to turnevt_stats
off - it's an oversubscribed event that'd generate 401/403, thus it's a waste. And should be removed out of the equation.What is the actual behavior?
SDK spins and doesn't upload anything at the end of the run.
Additional context.
I suspect these pieces of code related to
PauseGuard
are causing the problem:https://github.com/microsoft/cpp_client_telemetry/blob/0177277d1dc194a7703ca87d40587b6148608b45/lib/tpm/TransmissionPolicyManager.cpp#L105
Because activity is paused here at the beginning of the
FlushAndTeardown
method, without actually waiting for events cache to be fully drained: https://github.com/microsoft/cpp_client_telemetry/blob/0177277d1dc194a7703ca87d40587b6148608b45/lib/api/LogManagerImpl.cpp#L380Instead of pausing - it's supposed to wait for the amount of time configured. Obviously
FlushAndTeardown
can't be called on mobile platforms on the main thread. It has to be called on the background thread. Calling this on foreground thread may cause ANR, and app getting killed.The way it stands, it seems like short-lived apps and processes would experience "data loss", because their events get stuck and not attempted to be uploaded at the end of the short-lived app run.