openwallet-foundation / didcomm-mediator-service

36 stars 40 forks source link

Define a mechanism to monitor, track and notify Connection Timeouts from a deployed Mediator #75

Closed swcurran closed 1 year ago

swcurran commented 1 year ago

We are seeing connection timeouts in Aries mobile wallets with a (more or less) stock Aries Mediator Service mediator. We need a way to be aware of these errors on the mediator side so that we can know when and how often they are occurring, and so that a notification can go to the team that has deployed the monitor. This task is to figure how to add monitoring to a deployment of the aries-mediator-service.

Suggesting steps:

The logging info below is a possibility. We'd have to see what a "normal" websocket closure (including the mobile device turning off) looks like to ensure we aren't looking at false positives.

March 17th 2023, 14:59:22.548   aries-mediator-agent    2023-03-17 21:59:22,548 aries_cloudagent.transport.inbound.ws ERROR Unexpected Websocket message type received: WSMsgType.CLOSED: None, None
March 17th 2023, 14:56:39.975   aries-mediator-agent    2023-03-17 21:56:39,975 aries_cloudagent.transport.inbound.ws ERROR Unexpected Websocket message type received: WSMsgType.CLOSED: None, None
swcurran commented 1 year ago

@usingtechnology -- note this issue as you see what is happening on the ACA-Py Mediator side.

swcurran commented 1 year ago

Assigning this to @usingtechnology and @WadeBarnes after a question from @jleach about the status of this issue. In the research being done into the mediator behaviour, have we done enough to be able to detect on the mediator side when an error in establishing a connection (either to the mediator itself, or to another agent) occurs?

Note that the answer to this might be a “no, not possible”, and we close this accordingly.

Thanks!

WadeBarnes commented 1 year ago

The error messages listed above are a common occurrence.

For example: image

There are a lot of "Error" messages (noise) around (what seems to be) regular web socket traffic. Therefore I think the first thing that needs to happen is a review of the logging associated to the traffic to determine what a normal web socket connection lifecycle should look like and ensue the events are logged appropriately. At the same time we could review the timeout settings and determine what settings would be considered reasonable. The current web socket timing settings are ACAPY_WS_HEARTBEAT_INTERVAL=15, and ACAPY_WS_TIMEOUT_INTERVAL=60 in all environments, based on recommendations here; https://github.com/hyperledger/aries-cloudagent-python/issues/2157#issuecomment-1468197480

Related issue:

WadeBarnes commented 1 year ago

Some thoughts on this ...

jleach commented 1 year ago

@swcurran Do you think this should be transferred into ACA-py as an action item for @WadeBarnes' comments above (review params and logging)? Once done close it - if feels a little amorphic is that its hard to tease out what specific changes need to take place beyond this.

swcurran commented 1 year ago

From the sounds of it, I think this request should be pushed to the BC Gov deployment repo for the mediator, and we work on the types of solutions @WadeBarnes mention above that work in the BC Gov context. As we find useful things, updating either or both of the ACA-Py and this repo is appropriate as documentation or code (if that makes sense).

I’m going to close this issue here — feel free to reopen if needed.