Bubble up session loop's exceptions

moquette-io / moquette

Java MQTT lightweight broker

http://moquette-io.github.io/moquette/

Apache License 2.0

2.27k stars 814 forks source link

Bubble up session loop's exceptions #736

Closed andsel closed 1 year ago

andsel commented 1 year ago

When a session loop the exception is simply logged and lost, so the error status is not reported to the broker. This PR interrupt the session loop on an exception, saving the error in a local variable.

When the PostOffice terminates it re-throw the exception that was reached during session loop execution, while in the meantime it notifies the user implemented API InterceptHandler.onSessionLoopError. The firing of the newly added event type InterceptExceptionMessage containing the error stacktrace, is done asychronously from the launching thread (the session loop thread that terminating because of the error).

What it does?

create a new interception hook in the Moquette interception framework.
modify the SessionEventLoop to be itself a thread (and not a Runnable task)
set an uncatchedExceptionHandler in the newly `SessionEventLoop thread to scatter the notification.
collect the error, if any, for each SessionEventLoop and log on termination of PostOffice.

andsel commented 1 year ago

Hi @hylkevds please may I ask your review on this?

hylkevds commented 1 year ago

If I understand the changes correctly, when an exception happens in a task, this will store the exception and then end the SessionEventLoop that had the exception. Then, when the server exits, it will re-throw the exception.

So between the exception happening and the server being shut down, all of the Sessions that are assigned to the event loop that had the exception will hang?

That's both good and bad :) On one hand, trying to continue when (for instance) the queue store is corrupt is a recipe for disaster, but on the other hand, we have to make sure we only throw exceptions for things that are really fatal.

And when something is fatal, like the queue store being corrupted, maybe we should shut down the entire server?

Maybe we should make two types of exception? Fatal ones that shut down the server to prevent further corruption, and non-fatal ones that just kill the session that had the exception?

andsel commented 1 year ago

This requirement come out to me during debugging of a failed test, where the expectation didn't match. The problem was that an error happened in the SessionEventLoop and wans't logged because the log, in this case, is at info level. So I thought that also logging at WARN or ERROR without creating an impact on the execution wouldn't help to be aware of which error happened and where it happened. So if for any reason, a SessionEventLoop dies in tests, the test should fail. This is contrary to what Netty event loop does, where it prints the error and continue. But on the case of session event loop, if something gone bad, it's useless to continue and just reporting on the logs the error.

hylkevds commented 1 year ago

I agree that just continuing like nothing happened is a bad idea. The question is, is just killing the affected SessionEventLoop enough? I think, if an SessionEventLoop is killed, the entire server should exit.

andsel commented 1 year ago

I think, if an SessionEventLoop is killed, the entire server should exit.

Yes maybe this proposal is more sensible. I think that the exception handler of every SessionEventLoop should have a reference to the Server and command an emergency shutdown, without consuming any session command queue, and avoiding to send DISCONNECT messages to all the live sessions. WDYT?

hylkevds commented 1 year ago

Yes, I agree. We'll also need a hook, so that an embedding application can catch the state and decide what to do.

andsel commented 1 year ago

Hi @hylkevds I've pushed some changes, extending the interception framework to notify the library's users of errors in event loops.