Open TimoGlastra opened 2 years ago
Ideas:
Suggesting option retry multiple times is a good initial step, because:
If maintainers are happy with this suggestion I'm willing to work on this(so so you can assign me)
Makes sense @morrieinmaas! Some notes
a. Fail to acquire throws an error -> catch it and retry (only needs a counter and try/catch
We should probably add a custom error that extends the aries framework error so we can catch the specific error type. I think it now just throws an aries framework error.
Can also include a 'send problem report' back after all retries fail.
That'll be tricky as we need a session to decrypt/encrypt the message. So we would need to send an unencrypted message, or we would need to look at a way to encrypt the message in anoncrypt (so anonymous encryption), but I still think we need a session for that.
But otherwise I'm happy with this suggestion and feel free to work on it. We should however look where to add this logic. If we only want to do retries for inbound message we should do it in the MessageReceiver.receiveMessage
method, but that doesn't feel like the right place to me. This is more of a tenant module thing, so it would fit better in the TenantAgentContextProvider
or the TenantSessionCoordinator
, but those methods are also used for the getTenantAgent
calls, which defeats the purpose of this change. We could just increase the timeout and we have the same effect (5x 1 second or 1 x 5 seconds doesn't really matter).
So then the question arises, what is the benefit of re-trying if we can just increase the timeout? I think retry logic is often in place when integrating with systems you can't control, but in this case we do control the system. Having a longer timeout for inbound messages over getting tenant agents would achieve the same in this case.
Thoughts?
@TimoGlastra hmm, interesting thoughts. I'll have to have a more detialed look at this before I can give a (more) qualified feedback.
The tenant module has a limit on max concurrent open sessions, and a timeout to acquire a session. If the max has been reached and a sessions doesn't get released within the timeout (either because there's a queue or the sessions take long) an error will be thrown and the session creation will be aborted.
This is 'fine' when you want to get a tenant agent, as it will throw an error and your session will be aborted. But when receiving a message via inbound transport and we fail to acquire a session, the message will be lost.
The question is, what should we do in this case? Should we add priority to inbound messages? Should we create a queue and try again later? Should we store them and process them later? The end goal here should be to minimize the change of message loss.
Thoughts?