microsoft / cpp_client_telemetry

1DS C++ SDK
Apache License 2.0
85 stars 48 forks source link

[Android] Allow calling LogManager.flushAndTeardown() on main thread #585

Closed theta3 closed 3 years ago

theta3 commented 3 years ago

Before the app crashes, we need to make sure all the events are flushed. We are getting this error from the SDK:

    08-24 11:05:08.297  6605  6605 W System.err: java.lang.IllegalStateException: Cannot access database on the main thread since it may potentially lock the UI for a long period of time.
    08-24 11:05:08.297  6605  6605 W System.err:    at androidx.room.RoomDatabase.assertNotMainThread(SourceFile:267)
    08-24 11:05:08.297  6605  6605 W System.err:    at androidx.room.RoomDatabase.beginTransaction(SourceFile:351)
    08-24 11:05:08.297  6605  6605 W System.err:    at com.microsoft.applications.events.StorageRecordDao_Impl.insertRecords(SourceFile:108)
    08-24 11:05:08.297  6605  6605 W System.err:    at com.microsoft.applications.events.OfflineRoom.storeRecords(SourceFile:125)
    08-24 11:05:08.297  6605  6605 W System.err:    at com.microsoft.applications.events.OfflineRoom.storeFromBuffersIds(SourceFile:185)
    08-24 11:05:08.297  6605  6605 W System.err:    at com.microsoft.applications.events.OfflineRoom.storeFromBuffers(SourceFile:191)
    08-24 11:05:08.297  6605  6605 W System.err:    at com.microsoft.applications.events.LogManager.nativeFlushAndTeardown(Native Method)
    08-24 11:05:08.297  6605  6605 W System.err:    at com.microsoft.applications.events.LogManager.flushAndTeardown(SourceFile:439)
    08-24 11:05:08.297  6605  6605 W System.err:    at com.acompli.acompli.providers.OTLogger.flushAndTeardown(SourceFile:145)
    08-24 11:05:08.297  6605  6605 W System.err:    at com.acompli.acompli.providers.MainAriaEventLogger.flushAndTeardown(SourceFile:433)
    08-24 11:05:08.297  6605  6605 W System.err:    at com.acompli.acompli.util.AriaAnalyticsProvider.sendCrashEvent(SourceFile:3486)
    08-24 11:05:08.297  6605  6605 W System.err:    at com.acompli.acompli.util.AnalyticsCrashReporter.uncaughtException(SourceFile:65)
    08-24 11:05:08.297  6605  6605 W System.err:    at com.microsoft.office.outlook.crashreport.DateRecordingUncaughtExceptionHandler.uncaughtException(SourceFile:72)
    08-24 11:05:08.297  6605  6605 W System.err:    at java.lang.ThreadGroup.uncaughtException(ThreadGroup.java:1068)
    08-24 11:05:08.297  6605  6605 W System.err:    at java.lang.ThreadGroup.uncaughtException(ThreadGroup.java:1063)
    08-24 11:05:08.297  6605  6605 W System.err:    at java.lang.Thread.dispatchUncaughtException(Thread.java:1955)
    08-24 11:05:08.301  6605  6605 E libc++abi: terminating with uncaught exception of type std::runtime_error: call store

According to feature dev: "we need to be able to flush on the main thread, because we're trying to do shutdown/cleanup; which happens on main thread by definition."

Maybe we can configure Room to ignore main thread check?

maxgolov commented 3 years ago

This is fundamental Android OS design intent. You are not supposed to run long-running operations on the main (UX) thread, as this may result in ANR (Application Not Responding), and the OS may potentially kill your app for that. FlushAndTeardown is potentially a long-running synchronous operation, as you may configure how long it takes via ILogConfiguration, i.e. practically any amount of time. But anything longer than 5 seconds may be considered ANR. See here... Thus, you cannot invoke this operation on the main UI thread by definition.

https://medium.com/@kreynaldi04/android-fundamentals-07-1-asynctask-e07766f891d6

@larvacea - please share your opinion on this bug.

maxgolov commented 3 years ago

@theta3 - you could either address it in the app using either of the above techniques mentioned above (e.g. AsyncTask), OR implement the following shutdown sequence in your code:

    LogManager.pauseTransmission();
    LogManager.flush();

Note that flush() is asynchronous and does not provide any timing guarantees. You may need to confirm with Martin as to how to ensure that the async task / executor spawned by flush() is running to completion before the app is terminated.

I believe the bug should be closed as Design Intent, since this is Android Design Intent to not let running longer-running activities on the main / UI thread.

larvacea commented 3 years ago

The Android paradigm is very clear on how your application should behave. This is how it is described in the documentation for android.app.Application onTerminate():

This method is for use in emulated process environments. It will never be called on a production Android device, where processes are removed by simply killing them; no user code (including this callback) is executed when doing so.

That's pretty clear. Do not ever in any way write any code that calls FlushAndTeardown() for any purpose whatsoever except in tests. Do not bother to implement a "kill this app" UI element, nor should you attempt to implement any form of orderly shutdown in production.

That seems straightforward and unambiguous. I realize that you may well want to have graceful exit on other platforms, but you should definitely not have a "quit" command or other graceful-exit on Android.

Let me know if you have any questions.

theta3 commented 3 years ago

Currently in Outlook Android, when the app crash, we call one last method which calls .flushAndTeardown() to make sure all the telemetry is sent.

The feature dev explicitly asked that "during a crash we need to flush telemetry, so we get the telemetry about the crash persisted for send on restart. If we put crash handling into a background thread it does not have guarantee of completion before the main thread (which is crashing) exits."

I understand the philosophy but shouldn't the SDK leave it up to the clients to decide here?

maxgolov commented 3 years ago

I understand the philosophy but shouldn't the SDK leave it up to the clients to decide here?

You do not want to bluntly ignore the Android platform architecture. You cannot run longer running tasks (such as telemetry upload) on the main thread. You may pause and flush (as shown above), which guarantees the data gets uploaded when (and if) your app starts again.

...we get the telemetry about the crash persisted for send on restart

Do this:

    LogManager.pauseTransmission();
    LogManager.flush();

Note: we can absolutely not give the users ability to invoke us on the main thread. What happens if you do --- IF it takes longer than 5 seconds to flush and upload, the crash (ANR) is reported against our 1DS C++ SDK in Google Play store crash reports! We absolutely must discourage the bad coding practices. I'm marking this bug as invalid and closing it.

maxgolov commented 3 years ago

Invalid bug. Android OS does not allow running long running tasks, such as Rooms DB access on the main thread.

See: https://developer.android.com/topic/performance/vitals/anr

jbflanigan commented 3 years ago

@maxgolov @larvacea - totally understand the concerns here. The problem we are trying to solve is data loss during an application crash. We observed this problem with the Aria SDK in the past and resolve the problem by calling their version of flushAndTeardown() from an UncaughtExceptionHandler (https://docs.oracle.com/javase/7/docs/api/java/lang/Thread.UncaughtExceptionHandler.html)

Is there a solution available in 1DS SDK? It was mentioned in an earlier comment that LogManager.flush() is asynchronous, therefore the app cannot know when events have been successfully flushed to persistence after calling it. Will we suffer from data loss if the application does not wait long enough for flush() to complete it's asynchronous business before exiting? Or is there a guarantee of persistence which we can rely on?

theta3 commented 3 years ago

Correct me if I'm wrong but it looks like Android LogManager.flush() called into WrapperLogManager::Flush() which calls LogManagerImpl::Flush() which calls OfflineStorageHandler::Flush(). It calls m_offlineStorageDisk->StoreRecords(records) aka OfflineStorage_Room::StoreRecords(StorageRecordVector& records) which calls up the Java layer auto storeId = env->GetMethodID(room_class, "storeFromBuffers", storeSignature);

Which is the method that crashes in flushAndTeardown() because of calling from the main thread

maxgolov commented 3 years ago

I am not sure what guarantees you have when your app is in a bad state. If you have some sort of Executor that typically handles your non-UI tasks, you'd probably want to invoke the Flush from there. There is a piece of logic in OfflineStorageHandler.cpp that performs async-Flush, but that appears to be done only when the memory storage is exceeded:

    PAL::scheduleTask(&m_taskDispatcher, 0, this, &OfflineStorageHandler::Flush);

This is scheduled in native code thread pool (single worker thread we have for scheduling jobs). When the app is dead, and/or in a bad state, about to die - most likely all native threads are gonna be killed at once once it terminates, not gonna wait for thread completion.

You need to decide on a global crash handler at Java layer, that displays some sort of a spinning wheel or something, then waits for your AsyncTask or Executor thread pool (managed in your code) to complete running the Flush() method. SDK is not handling this - since C++ SDK already has its own background tasks pool, that is unaware what is about to happen with the higher-level Java app, if it's about to be killed or not.

You need to handle the Flush method in AsyncTask , bound to the global exception in your own app code at higher level, to schedule a separate non-UI Java thread for that or use either of the already running non-UI threads.

jbflanigan commented 3 years ago

wait, now it sounds like flush() is synchronous. Is that correct? If so, yeah this is easy for the app to deal with.

maxgolov commented 3 years ago

It appears like Flush() is synchronous. My apologies - since it was a-synchronous in Aria v1, and in 1DS it is sync.

jbflanigan commented 3 years ago

perfect, thanks for the clarification @maxgolov!