Closed andsel closed 1 year ago
Hi @hylkevds please may I ask your review on this?
If I understand the changes correctly, when an exception happens in a task, this will store the exception and then end the SessionEventLoop that had the exception. Then, when the server exits, it will re-throw the exception.
So between the exception happening and the server being shut down, all of the Sessions that are assigned to the event loop that had the exception will hang?
That's both good and bad :) On one hand, trying to continue when (for instance) the queue store is corrupt is a recipe for disaster, but on the other hand, we have to make sure we only throw exceptions for things that are really fatal.
And when something is fatal, like the queue store being corrupted, maybe we should shut down the entire server?
Maybe we should make two types of exception? Fatal ones that shut down the server to prevent further corruption, and non-fatal ones that just kill the session that had the exception?
This requirement come out to me during debugging of a failed test, where the expectation didn't match.
The problem was that an error happened in the SessionEventLoop
and wans't logged because the log, in this case, is at info
level.
So I thought that also logging at WARN or ERROR without creating an impact on the execution wouldn't help to be aware of which error happened and where it happened. So if for any reason, a SessionEventLoop dies in tests, the test should fail.
This is contrary to what Netty event loop does, where it prints the error and continue.
But on the case of session event loop, if something gone bad, it's useless to continue and just reporting on the logs the error.
I agree that just continuing like nothing happened is a bad idea. The question is, is just killing the affected SessionEventLoop
enough? I think, if an SessionEventLoop
is killed, the entire server should exit.
I think, if an SessionEventLoop is killed, the entire server should exit.
Yes maybe this proposal is more sensible. I think that the exception handler of every SessionEventLoop
should have a reference to the Server
and command an emergency shutdown, without consuming any session command queue, and avoiding to send DISCONNECT
messages to all the live sessions.
WDYT?
Yes, I agree. We'll also need a hook, so that an embedding application can catch the state and decide what to do.
Hi @hylkevds I've pushed some changes, extending the interception framework to notify the library's users of errors in event loops.
When a session loop the exception is simply logged and lost, so the error status is not reported to the broker. This PR interrupt the session loop on an exception, saving the error in a local variable.
When the
PostOffice
terminates it re-throw the exception that was reached during session loop execution, while in the meantime it notifies the user implemented APIInterceptHandler.onSessionLoopError
. The firing of the newly added event typeInterceptExceptionMessage
containing the error stacktrace, is done asychronously from the launching thread (the session loop thread that terminating because of the error).What it does?
SessionEventLoop
to be itself a thread (and not aRunnable
task)uncatchedExceptionHandler
in the newly`SessionEventLoop
thread to scatter the notification.