openwallet-foundation / credo-ts

Typescript framework for building decentralized identity and verifiable credential solutions
https://credo.js.org
Apache License 2.0
262 stars 199 forks source link

What to do if session couldn't be acquired for incoming message #947

Open TimoGlastra opened 2 years ago

TimoGlastra commented 2 years ago

The tenant module has a limit on max concurrent open sessions, and a timeout to acquire a session. If the max has been reached and a sessions doesn't get released within the timeout (either because there's a queue or the sessions take long) an error will be thrown and the session creation will be aborted.

This is 'fine' when you want to get a tenant agent, as it will throw an error and your session will be aborted. But when receiving a message via inbound transport and we fail to acquire a session, the message will be lost.

The question is, what should we do in this case? Should we add priority to inbound messages? Should we create a queue and try again later? Should we store them and process them later? The end goal here should be to minimize the change of message loss.

Thoughts?

TimoGlastra commented 2 years ago

Ideas:

morrieinmaas commented 2 years ago

Suggesting option retry multiple times is a good initial step, because:

  1. Cheap and straight forward to implement a. Fail to acquire throws an error -> catch it and retry (only needs a counter and try/catch
  2. Can also include a 'send problem report' back after all retries fail.
  3. keeps logic simple and within what a framework should do as opposed to adding a queue or db or other infrastructure
  4. Since magnitude and frequency of occurrence are somewhat unknown/unmeasured this would possibly addresses the issue sufficiently for time being so we atm don't over-engineer a solution to a non-problem.

If maintainers are happy with this suggestion I'm willing to work on this(so so you can assign me)

TimoGlastra commented 2 years ago

Makes sense @morrieinmaas! Some notes

a. Fail to acquire throws an error -> catch it and retry (only needs a counter and try/catch

We should probably add a custom error that extends the aries framework error so we can catch the specific error type. I think it now just throws an aries framework error.

Can also include a 'send problem report' back after all retries fail.

That'll be tricky as we need a session to decrypt/encrypt the message. So we would need to send an unencrypted message, or we would need to look at a way to encrypt the message in anoncrypt (so anonymous encryption), but I still think we need a session for that.


But otherwise I'm happy with this suggestion and feel free to work on it. We should however look where to add this logic. If we only want to do retries for inbound message we should do it in the MessageReceiver.receiveMessage method, but that doesn't feel like the right place to me. This is more of a tenant module thing, so it would fit better in the TenantAgentContextProvider or the TenantSessionCoordinator, but those methods are also used for the getTenantAgent calls, which defeats the purpose of this change. We could just increase the timeout and we have the same effect (5x 1 second or 1 x 5 seconds doesn't really matter).

So then the question arises, what is the benefit of re-trying if we can just increase the timeout? I think retry logic is often in place when integrating with systems you can't control, but in this case we do control the system. Having a longer timeout for inbound messages over getting tenant agents would achieve the same in this case.

Thoughts?

morrieinmaas commented 2 years ago

@TimoGlastra hmm, interesting thoughts. I'll have to have a more detialed look at this before I can give a (more) qualified feedback.