microsoft / cpp_client_telemetry

1DS C++ SDK
Apache License 2.0
86 stars 49 forks source link

Pause Manager causes events not being uploaded on FlushAndTeardown #1120

Open maxgolov opened 1 year ago

maxgolov commented 1 year ago

Problem Description

It seems like a change from 2021 is affecting the FlushAndTeardown sequence: https://github.com/microsoft/cpp_client_telemetry/pull/829

Steps to reproduce.

What is the expected behavior?

Expected behavior is that the 2-3 seconds are spent on trying to upload pending requests. Once there are zero events left, app should unblock and exit. Ideally with evt_stats event turned off, it should be relatively fast exit. Why need to turn evt_stats off - it's an oversubscribed event that'd generate 401/403, thus it's a waste. And should be removed out of the equation.

What is the actual behavior?

SDK spins and doesn't upload anything at the end of the run.

Additional context.

I suspect these pieces of code related to PauseGuard are causing the problem:

https://github.com/microsoft/cpp_client_telemetry/blob/0177277d1dc194a7703ca87d40587b6148608b45/lib/tpm/TransmissionPolicyManager.cpp#L105

Because activity is paused here at the beginning of the FlushAndTeardown method, without actually waiting for events cache to be fully drained: https://github.com/microsoft/cpp_client_telemetry/blob/0177277d1dc194a7703ca87d40587b6148608b45/lib/api/LogManagerImpl.cpp#L380

Instead of pausing - it's supposed to wait for the amount of time configured. Obviously FlushAndTeardown can't be called on mobile platforms on the main thread. It has to be called on the background thread. Calling this on foreground thread may cause ANR, and app getting killed.

The way it stands, it seems like short-lived apps and processes would experience "data loss", because their events get stuck and not attempted to be uploaded at the end of the short-lived app run.

vidia commented 1 week ago

@maxgolov Is there any workaround for this issue? We explicitly FlushAndTeardown on shutdown but have noticed that in short-lived sessions our events are not being uploaded reliably. I see from #1190 that FlushAndTeardown (and a CFG_INT_MAX_TEARDOWN_TIME) should guarantee that, but that there's still a bug here preventing that.

maxgolov commented 1 week ago

@vidia - please try a few options I mentioned regarding roll-back to older baseline.