Mobile agent not receiving credentials after reconnecting to the mediator

ericvergnaud commented 10 months ago

Hi there,

we are using aries-framework-javascript in 2 instances:

as a mobile agent, using react-native
as a mediator, using node in Docker Credentials themselves are issued using aca-py. Everything works fine when the mobile app is active.

As you're probably aware of, iOS inhibits network connections when an app goes to the background. Our understanding is that the mediator would store the messages received when the mobile app is unreachable, and would send them once the mobile app reconnects. We're seeing a consistent scenario where credentials transmitted to the mediator through messages of type "did:sov:BzCbsNYhMrjHiqZDTUASHg;spec/routing/1.0/forward" are never delivered to the mobile agent. When the mobile agent reconnects, it sends a "https://didcomm.org/messagepickup/2.0/status-request" message to the mediator, which responds with a "https://didcomm.org/messagepickup/2.0/status" where where message_count = 0...

Since ensuring consistent async delivery of messages is the primary purpose of the mediator, it seems we are missing something, but what ?

We're using 0.4.0 on mobile side and 0.4.1 on mediator side (running the provided Docker). Worth noting, our mediator is not connected to a persistent database.

Help would be appreciated.

Thanks

TimoGlastra commented 10 months ago

If your mediator is not connected to a persistent database, you'd lose all connections / messages when it restarts. But that doesn't seem to be the problem here.

Have you looked at the logs whether the messages are being queued or sent directly to the device?

In theory when using pickup v2, the messages sent by the mediator should be acked by the client, and then they'll be removed: https://github.com/hyperledger/aries-rfcs/tree/main/features/0685-pickup-v2#messages-received

@genaris is it correct this is how this is implemented?

If you can provide logs of both sides, and analyze them we can probably figure out a bit more on what the underlying issue is here.

ericvergnaud commented 10 months ago

Hi, thanks for your kind proposal to help. Attached are 2 log files.

In the mobile log, you'll notice that the invitation (around 12:26:02) goes well. At 12:26:18, the mobile app goes to the background, as part of that we close the 2 outbound transports. At 12:28:18, the mobile app becomes active again, as part of that we reopen the 2 outbound transports.

In the mediator log, you'll notice that in the meantime, the mediator receives 3 'forward' messages, at 2023-09-06T10:26:07.541Z, 2023-09-06T10:26:22.418Z and 2023-09-06T10:26:26.049Z. The logs don't say whether these were forwarded, stored or queued, but since I'm not seeing any "Send outbound message" for these in the mediator log, and not seeing any corresponding "Received message with type" in the mobile.log, I'm assuming they are 'stored' somewhere...

I hope this helps.

Eric

mobile.csv mediator.csv

genaris commented 10 months ago

@TimoGlastra yes, it works like that but in this case it seems to me that the problem is related to the way WebSockets are being managed.

Looking at the logs it seems to me that the, at the moment of processing the first forward message (2023-09-06T10:26:07.541Z) mediator still sees the WebSocket as opened, so it sends it through the existing WebSocketTransportSession. A few seconds later (2023-09-06T10:26:09.996Z) it receives a WebSocket closed event.

So when the mediator sends the Forward message to that 'phantom' WebSocket, it thinks everything is fine and considers it has delivered the message (no need to queue it anywhere). However, the app won't receive any message and will not have any way of retrieving it, making the message effectively lost (which is certainly terrible).

Why it does think a WebSocket is opened when it is not? It can be caused to a number of reasons that I've identified based on the experience I had with mobile apps and AFJ in both sides:

Client shuts down unexpectedly, meaning that it won't close the socket gracefully (typical case when mobile phone has poor connectivity): if no specific heartbeat mechanism is done at the app level, it depends on the OS/low-level layers to let the app know if the socket is effectively closed
Client shuts down gracefully, but a message has been received during the shut down process (this could be what happened here, given the short time frame between the Forward and the WebSocket close): WebSocket was opened, low level layer library has received it and probably notified the app but it was already deciding to shutdown so it ignored it
Connection is established but the message could not be delivered:
- React Native event emitter is not sending the message event (this happens to me quite often when refreshing the app or switching background/foreground, mostly in development mode). I got better results by using the onmessage callback instead of the event listener as done in the default WsOutboundTransporter
- In Node.JS OS doesn't emit any error when ws.send() failed to deliver the message

In practice, most of these issues can be solved or reduced by implementing at mediator level a heartbeat mechanism where it does a ws.ping every n seconds (we found 3 to be a good number). This implies some updates to WsInboundTransport.

But I think that at some point it will be needed in Live mode to stop using regular "Forward" messages (which don't require any acknowledgement) and use Delivery instead, which as @TimoGlastra mentioned do require confirmation. See the discussion here: https://github.com/hyperledger/aries-rfcs/issues/760

I'm testing a mechanism like that, which for the moment is a bit tricky to 'standardize' but I think can be useful to add to AFJ in the next major version (as it will need to break the very dated MessageRepository API).

ericvergnaud commented 10 months ago

Hi,

thanks for that.

I did notice that the first message occurred during a gray area. But what about the 2 other ones ?

Re a switch from "Forward" messages (which don't require any acknowledgement) to Delivery, is that a change within the mediator itself ? or does it require a change in the way other components (such as mobile or aca-py) talk to the mediator ?

I noticed that "forward" messages have no identified sender, so I'm assuming the sender is the mediator itself, is that a valid assumption ?

ericvergnaud commented 10 months ago

Hi again,

FWIW, I'm posting another set of logs, with no gray area.

In this session, you'll notice the following:

at 11:52:23.571Z, the mediator receives a forward message the afj logs are somewhat a bit too lean, but that is likely the 'did:sov:BzCbsNYhMrjHiqZDTUASHg;spec/connections/1.0/response' message received by mobile at 11:52:23.755Z

at 11:53:02, mobile transports are shutdown the app remains in the background for roughly 3 minutes at 11:56:34, mobile transports are restarted

in the meantime, the mediator receives 6 forward messages: at 11:53:07.270Z at 11:53:08.264Z at 11:53:08.495Z at 11:53:14.897Z at 11:53:20.079Z at 11:53:20.451Z all these are received by the mediator after the WS is closed at 11:53:02, and before it is reopened at 11:56:34

But none of these make it to mobile. When mobile reconnects and sends a request-status message at 11:56:33.750Z, received at 11:56:34.786Z, the response sent at 11:56:34.844Z says message_count: 0. This can only result from either of:

messages beings sent despite the WS socket being closed, even long after the WS is closed
messages being stored (without logs ?) but not retrieved by status-request
some other reason that I cannot think of ?

Needless to say, this is very concerning.

mediator.log mobile.log

ericvergnaud commented 10 months ago

Alternately, it seems from https://github.com/hyperledger/aries-rfcs/tree/main/features/0685-pickup-v2#messages-received that forwarding messages will only be attempted if live mode is turned on.

Would it make sense to turn live mode off from mobile before switching to background ?

How can we do that ?

genaris commented 10 months ago

This can only result from either of:

messages beings sent despite the WS socket being closed, even long after the WS is closed

messages being stored (without logs ?) but not retrieved by status-request

I think it is the first one, because when mediator queues a message it does print a Queue packed message for connection ... legend in the debug log. And when it forwards a message for a connected session succesfully, it currently does it so silently (it would be nice to be more verbose on this).

So IMHO the mediator agent considers there is an opened WebSocket for the mobile agent and that's why it forwards the messages. It does not receive any error from lower level layers so it does not throw any error.

Based on the mediator log from the time you mentioned the mobile agent transports were stopped and the app went background, I see (leaving only relevant parts):

2023-09-07T11:53:01.844Z Socket closed. 2023-09-07T11:53:02.713Z Socket connected. 2023-09-07T11:53:02.936Z Received message with type 'https://didcomm.org/messagepickup/2.0/status-request', 2023-09-07T11:53:02.964Z Storing session for inbound message '09ec1f67-50c1-4765-8c42-c9efc0c0b24f' 2023-09-07T11:53:02.965Z Send outbound message { '@type': 'https://didcomm.org/messagepickup/2.0/status', message_count: 0 } 2023-09-07T11:53:02.968Z Pack outbound message https://didcomm.org/messagepickup/2.0/status 2023-09-07T11:53:07.289Z Received message with type 'did:sov:BzCbsNYhMrjHiqZDTUASHg;spec/routing/1.0/forward' 2023-09-07T11:53:08.277Z Received message with type 'did:sov:BzCbsNYhMrjHiqZDTUASHg;spec/routing/1.0/forward', 2023-09-07T11:53:08.500Z Received message with type 'did:sov:BzCbsNYhMrjHiqZDTUASHg;spec/routing/1.0/forward', 2023-09-07T11:53:14.904Z Received message with type 'did:sov:BzCbsNYhMrjHiqZDTUASHg;spec/routing/1.0/forward', 2023-09-07T11:53:20.092Z Received message with type 'did:sov:BzCbsNYhMrjHiqZDTUASHg;spec/routing/1.0/forward', 2023-09-07T11:53:20.454Z Received message with type 'did:sov:BzCbsNYhMrjHiqZDTUASHg;spec/routing/1.0/forward', 2023-09-07T11:53:34.417Z Socket closed. 2023-09-07T11:53:34.419Z Socket closed. 2023-09-07T11:56:34.616Z Socket connected.

And also the mobile log at that particular time (ordered backwards):

"2023-09-07T11:53:02.875Z",,"""com.getpocketdev""","Successfully connected to WebSocket wss://dev.getpocket.io/aries_js" "2023-09-07T11:53:02.745Z",,"""com.getpocketdev""","Stopping HTTP outbound transport" "2023-09-07T11:53:02.744Z",,"""com.getpocketdev""","Stopping WS outbound transport" "2023-09-07T11:53:02.085Z",,"""com.getpocketdev""","Connecting to WebSocket wss://dev.getpocket.io/aries_js" "2023-09-07T11:53:02.084Z",,"""com.getpocketdev""","Sending outbound message to endpoint 'wss://dev.getpocket.io/aries_js' over WebSocket transport." "2023-09-07T11:53:02.079Z",,"""com.getpocketdev""","Pack outbound message https://didcomm.org/messagepickup/2.0/status-request" "2023-09-07T11:53:02.079Z",,"""com.getpocketdev""","Sending outbound message to service:" "2023-09-07T11:53:02.073Z",,"""com.getpocketdev""","resolving didUrl did:peer:1zQmWSZezhx42Bc6FT5Nzo7DgZAdy7M4Zyzv4D66PUzubW3x" "2023-09-07T11:53:02.072Z",,"""com.getpocketdev""","Retrieved 2 services for message to connection '13f38ad1-6ae3-4976-b5cd-2a5dba4f222f'(Aries Framework JavaScript Mediator)'" "2023-09-07T11:53:02.066Z",,"""com.getpocketdev""","resolving didUrl did:peer:1zQmYPrUCb1HJ9vup3AY2xKFvF2qWsZbZpTDVQPLnpohKfgH" "2023-09-07T11:53:02.066Z",,"""com.getpocketdev""","Resolving services for connection theirDid did:peer:1zQmYPrUCb1HJ9vup3AY2xKFvF2qWsZbZpTDVQPLnpohKfgH." "2023-09-07T11:53:02.066Z",,"""com.getpocketdev""","Retrieving services for connection '13f38ad1-6ae3-4976-b5cd-2a5dba4f222f' (Aries Framework JavaScript Mediator)" "2023-09-07T11:53:02.066Z",,"""com.getpocketdev""","Send outbound message" "2023-09-07T11:53:02.053Z",,"""com.getpocketdev""","Websocket connection to mediator with connectionId '13f38ad1-6ae3-4976-b5cd-2a5dba4f222f' is closed, attempting to reconnect..." "2023-09-07T11:53:01.951Z",,"""com.getpocketdev""","WebSocket closing to wss://dev.getpocket.io/aries_js" "2023-09-07T11:53:01.786Z",,"""com.getpocketdev""","Stopping HTTP outbound transport" "2023-09-07T11:53:01.786Z",,"""com.getpocketdev""","Stopping WS outbound transport"

It seems to me that the mobile agent tried to re-connect to the mediator as soon as you stopped the transports for the first time (11:53:01.786Z), and it didn't actually closed the socket when stopped the transports the second time and it went background (you can see a WebSocket closing to... legend the first time, but the last log we have from the mobile is Successfully connected to WebSocket).

This makes me think that the resulting WebSocket was not closed gracefully, so the mediator considered it to be opened until 11:53:34 (around 30 seconds after), when it received two WebSocket closed events. We don't have further information, but maybe those correspond to both the mobile agent and the agent you are using to send messages to it.

It might be also possible for the WebSocket to be actually opened and the connection between mobile device and iOS to be responsive. To make sure about that, as I suggested previously, I think we can use a heartbeat mechanism in the WsInboundTransporter to send pings periodically. That will give us more details about what's going on and eventually fix the issue as well. Here you can see a custom implementation that you can use in your mediator as inbound transporter: WsInboundTransport.txt

ericvergnaud commented 10 months ago

Thanks for the analysis May I ask why WsOutboundTransport and HttpOutboundTransport close() methods are marked async when there is no need for it ? Also, wouldn't it be more effective to call socket.terminate() which doesn't suffer from the undeterministic behaviour of socket.close() ? I'm happy to try that out if you think it's worth it I see ping messages flowing so it looks like there is already some heartbeating going on

genaris commented 10 months ago

Also, wouldn't it be more effective to call socket.terminate() which doesn't suffer from the undeterministic behaviour of socket.close() ?

I don't see terminate() available in react-native. Is it? But I'm curious about the statement that close is undeterministic. AFAIK the main difference with terminate is that the latter will drop the connection immediately without executing the closing handshake. Would that be more reliable in this case? At first sight, it seems to me that it would be worst.

I see ping messages flowing so it looks like there is already some heartbeating going on

Not sure if you are referring to the same thing as I do: the ping messages I'm talking about are the ones at the WebSocket protocol level. In your logs I can only see 'Trust Ping' messages exchanged, which is an special type of DIDComm message. In this case it is used as an acknowledge for Status and also for connection establishment, but not periodically sent under the hood to check the link.

ericvergnaud commented 10 months ago

Thanks for this. By undeterministic I mean that we don't know, at call time, whether or not it succeeded. And clearly the mediator isn't told about it. Maybe because it is marked async, so the underlying close() doesn't happen in a meaningful time. But maybe the transport should reject outgoing messages when it's closing, instead of reopening the socket. On the mediator side, I'm not convinced that a heartbeat will prevent message loss. Imho it would just improve the odds that the socket is live. How about awaiting a technical ACK sent over the http channel in response to the DIDComm message ? It doesn't affect the protocol itself but would provide transport level guarantees of delivery. What do you think ?

ericvergnaud commented 10 months ago

I also notice on mediator side, that in WsInboundTransport, there is no error handler installed when the socket is connected (line 34), and that send() doesn't check for errors either (line 96). Is this intended ? Would it make sense to check for errors on send() and raise the corresponding error if it failed ? (happy to try that out myself)

genaris commented 10 months ago

On the mediator side, I'm not convinced that a heartbeat will prevent message loss. Imho it would just improve the odds that the socket is live. How about awaiting a technical ACK sent over the http channel in response to the DIDComm message ? It doesn't affect the protocol itself but would provide transport level guarantees of delivery. What do you think ?

The heartbeat mechanism will not prevent all messages losses but certainly can help to minimize this kind of issues. That said, in my opinion we should require application-level ACKs so we won't depend on any issue that may arise at transport layer.

We can either leverage DIDComm 'please-ack' decorator (and ack message) or use pickup protocol's Delivery message that embeds its own way of doing acknowledge. This will need some effort because we'll need to handle a sort of message queue for the 'Live Mode' (which, BTW, is not currently fully compliant with what's stated in pickup protocol 2.0) so the mediator can track and use to retry deliveries.

I also notice on mediator side, that in WsInboundTransport, there is no error handler installed when the socket is connected (line 34), and that send() doesn't check for errors either (line 96). Is this intended ? Would it make sense to check for errors on send() and raise the corresponding error if it failed ? (happy to try that out myself)

I don't think this is intended. Actually it is possible that the callback was not available or working like that at the time it was written (recently we've upgraded ws module to 8.x, which introduced some important changes over 7.x, but did not modify this class).

I didn't try it by myself yet but I think it will be a good idea if you can test it in a custom inbound transporter. Hopefully it will catch an error and help us to progress on the fix needed to make mediator side more robust.

ericvergnaud commented 10 months ago

Noted, I'll try that out and keep you posted

ericvergnaud commented 10 months ago

Hi there, thanks again for your support in this matter.

After experimenting (outside aries) with iOS and react-native, and improving logs on both sides, we've narrowed the disaster scenario as follows:

mobile agent connects to mediator, opens some web sockets
when the mobile app is active, messages sent by the mediator are processed by the mobile agent as expected
when the mobile app is inactive, messages are queued by iOS until the app becomes active again (proven by our experiments)
during inactivity, middleware such as ngninx or AWS ELB close the socket, due to socket inactivity for longer than the configured timeout (typically 60s)
when the app becomes active again, the messages queued by iOS are irremediably lost, because the socket was closed

This is disastrous for fire-and-forget protocols, such as Forward messages.

The 'solution' proposed above in WsInboundTransport.txt consists in keeping the websocket alive by sending pings at regular intervals. This may be viable for server-2-server communications (where the number of peers would be small) but is definitely not for mobile agents, because the server would end up keeping alive thousands of websockets for inactive mobile devices (until the user actually exits the mobile app).

For now, we have implemented a workaround, which is to, rather than send a message blindly, send a ping first, and wait for the corresponding pong before the sending the message itself. We'll see how it works, but we already know it is not bullet-proof because the mobile app could very well become inactive between sending the pong and receiving the message.

By nature web sockets are unreliable so I believe that using fire-and-forget protocols over websocket is likely to frequently fail. I am impatient to see how #76 evolves, and when it becomes available in aries-framework-javascript ?

swcurran commented 10 months ago

Just reading this last note — not the whole stream. Indicio has implemented SocketDock that I think addresses the questions raised here. Perhaps it can help. Quick overview — SocketDock is a component to handle web sockets in front of a scalable component (say, Aries) that can then just use HTTPS to send messages reliably.

genaris commented 10 months ago

Thank you @ericvergnaud , your findings are certainly very, very valuable. It is true that the solution of ping-pongs work well in certain scenarios (like agents going offline) but not the one you mention about apps going background and then foreground in iOS.

Regarding this, I have a question: even if you stop the WsOutboundTransport before the app goes background iOS keeps the WebSocket opened? So there is no way to actually forcing it to close in this platform?

By nature web sockets are unreliable so I believe that using fire-and-forget protocols over websocket is likely to frequently fail. I am impatient to see how #76 evolves, and when it becomes available in aries-framework-javascript ?

100% agree (especially the one about being impatient 😝). I'm working on it and hopefully will be integrated in the next major release (0.5.0).

ericvergnaud commented 10 months ago

@genaris we've tried "stop the WsOutboundTransport before the app goes background". It didn't work well because the underlying socket wasn't getting closed deterministically (remember me mentioning that earlier ?). It could be because transport.close is async, so its execution is queued in the event loop, or it could be that a new socket was opened by another thread (this isn't single JS thread V8!).

ericvergnaud commented 10 months ago

@swcurran thanks for that. Looking into it

genaris commented 10 months ago

It didn't work well because the underlying socket wasn't getting closed deterministically (remember me mentioning that earlier ?). It could be because transport.close is async, so its execution is queued in the event loop, or it could be that a new socket was opened by another thread (this isn't single JS thread V8!).

I don't think the async is a problem, as long as you are calling it like await transport.stop(). In my opinion, based on the mobile logs you shared, when you stop the WsOutboundTransporter, the MediationRecipientModule detects it has disconnected from the mediator so it attempts to re-connect with it, opening a new socket and exchange a some messages (like status request and trust ping) just before the app goes background.

I know this is does not make sense, as when we stop a transporter one would understand that no new sessions may be opened, but for the moment it is working like that.

A way to get around this issue can be to call agent.mediationRecipient.stopMessagePickup() before stopping the outbound transporters. Then you'll need to call agent.mediationRecipient.initiateMessagePickup() when the app goes back to foreground.

ericvergnaud commented 10 months ago

@swcurran at first glance in seems the product is meant to fill a gap for backends without WebSocket support, but they don't say anything about WebSocket transport reliability... Again I suspect the only viable solution is to rely on acknowledgments at the aries protocol level.

ericvergnaud commented 10 months ago

@genaris thanks for this, we'll try it out

TimoGlastra commented 10 months ago

Again I suspect the only viable solution is to rely on acknowledgments at the aries protocol level.

Yes, I think this is how we should solve this. This also makes it agnostic of the transport being used, and means it won't be impacted by any transport level issues. We should probably:

in the receiver side keep a lot of all messages that have been received and processed. If a message is received that was already processed we discard it and don't process it again
On the server side we need to be able to mark messages as 'sent', but don't remove them until they have been acknowledged. If the message hasn't been acknowledged in x time, or until a new session is opened the next time or until a new status request is received, it should be re-sent to the client.

Both sides make sure in this case that messages are not lost, but also not processed multiple times.

ericvergnaud commented 10 months ago

@TimoGlastra indeed the receiver needs to guarantee idempotence i.e. the same message received twice should only be processed once. IMHO this should be managed by the framework.

ericvergnaud commented 10 months ago

@genaris whilst trying out your suggestion we've also tried ping-pong-ing the receiver before sending the message. For those interested, here is our code change in WsInboundTransport.ts:

  public async send(agentContext: AgentContext, encryptedMessage: EncryptedMessage): Promise<void> {
    if (this.socket.readyState !== WebSocket.OPEN) {
      throw new AriesFrameworkError(`${this.type} transport session has been closed.`)
    }
    // websockets get closed by infrastructure after a given timeout (typically 60s)
    // this is expected and desirable, otherwise the number of opened web sockets could become unmanageable
    // but when a mobile app becomes inactive, it stops processing websocket messages until it becomes active again
    // as a result, messages sent whilst the app is inactive are irremediably lost when the websocket is closed
    // in order to minimize the risk of message loss, we do a ping/pong and only send the message when the pong is received
    let success = false;
    let timeoutId: any | null = null;
    const delay = (ms: number, val: any) => new Promise( (resolve) => { timeoutId = setTimeout( resolve, ms ) })
    this.socket.once("pong", () => {
      this.socket.send(JSON.stringify(encryptedMessage), (error? : Error | undefined) => {
        if (error != undefined) {
          const message = `${this.type} send message failed.`
          this.logger.debug(message + " Error: " + error)
          throw new AriesFrameworkError(message, { cause: error })
        } else {
          this.logger.debug(`${this.type} sent message successfully.`)
          success = true;
          clearTimeout(timeoutId)
        }
      })
    })
    this.socket.ping("ping")
    await delay(10000, () => success = false)
    if(!success) {
        const message = `${this.type} send message timed out.`
        this.logger.debug(message)
        throw new AriesFrameworkError(message)
    }
  }

To avoid confusion, I HATE this code :-) But it works i.e. when the mobile app is in the background, the pong is never received, so the error is bubbled up and the message gets queued (corresponding logs in PR #1573 ), and they are sent later when the app becomes active again. So it's a temporary workaround that might be useful to readers, but certainly NOT a fix !!!

dheerajbudhiraja commented 8 months ago

Any reliable solution we got on this?

For Workaround, we did stored connection request message in connection itself, same for credential offer message as retry mechanism.

Whenever new invitation is scanned, we are checking if there is some connection id for that invitation and connection is not in response_received, we are retrieving that connection request message and sending again to issuer.

Same thing is handled on issuer side if connection is not in acked state, resend response back.

There were some changes done in connection request, connection trust ping, credential offer accept flows for this.

Can I raise PR for these changes?

openwallet-foundation / credo-ts

Mobile agent not receiving credentials after reconnecting to the mediator #1570