Resubscribe to resources if the WebSocket connection is closed unexpectedly

jholleran commented 4 years ago

In any environment WebSocket (WS) connections could close unexpectedly for various reasons. Also, in some deployments connections are even closed if there is no recent traffic. For example; running a server behind and AWS Elastic Load Balancer will close idle connections (default is 60 seconds but it can be increased to a maximum of 4000 seconds). Ngnix (by default) will also close idle connections.

Clients would need to reconnect and resubscribe to resources its interested in. It would be great if this was in the specification about how a client should recover in situations like this.

For example; On a WS close event the client could attempt to reconnect to the WS endpoint a number of times with an exponential back-off time (1, 2, 4, 8, 16, 32 seconds) in between each attempt. If the connection is restored it will send the subscribe message to its interested resources.

Also, guidance on how to keep WS connections open would also be valuable. Making sure the Client and Server sends Heartbeat message would prevent idle connections. See:

https://developer.mozilla.org/en-US/docs/Web/API/WebSockets_API/Writing_WebSocket_servers#Pings_and_Pongs_The_Heartbeat_of_WebSockets

https://tools.ietf.org/html/rfc6455#section-5.5.2

All of this would help improve notifications been missed by a client application.

RubenVerborgh commented 4 years ago

Agreed with the idea of closing it; no need to leave things open.

However, this also brings up a problem in the design of the current protocol. Currently, a client would

sub /foo/bar

and receive

pub /foo/bar

upon change.

However, if the socket connection is lost, the following timeline could happen:

t=0 WebSocket connection is made
t=1 sub /foo/bar
t=61 socket closed
t=62 /foo/bar is updated
t=63 reconnection
t=64 sub /foo/bar

So the client was never notified of the update that happened at t=62, because it was disconnected right before. This would mean that either:

the client should avoid losing the connection by sending keepalive messages;
the client needs the assume all resources can have changed when losing connection;
we need to design the protocol differently.

The first cannot cover all cases (involuntary loss), the second is undesired because of bandwidth, which leaves us with the third.

I've never made it a secret that I am highly suspicious of the current draft protocol, because it was never properly designed and vetted, so this is yet another argument.

A possible fix for a future protocol could be to include an eTag-like value, for example:

update /foo/bar etag:1e646636-50ad-45c8-b67c-c17f5691d215

However, rather than patching up the draft protocol, I think we should properly design one.

csarven commented 4 years ago

gklyne commented 4 years ago

@RubenVerborgh Agree about the race condition issue.

As it stands, as an application designer, I'd assume your second option.

I assume your concern here is bandwidth used for re-reading the resources. I guess one could use existing HTTP etag mechanisms to avoid unnecessary reloads of the resource data, but that still requires additional round-trips for the extra GETs.

A simple option would be for every subscription to provoke an immediate pub response (which I think would help to keep client-side code flows simpler), but means more bandwidth used for maybe-unneeded notifications.

Maybe both SUB and PUB messages can include an etag-like value (or hash): if the server tag has changed from what is provided in a SUB, it can send an immediate PUB with the new tag? For backwards compatibility, If PUB doesn't include a tag value then fallback to the "simple" option above?

Questions to consider about such a scheme:

in "backward compatibility" mode, include a tag value anyway with PUB messages (not so compatible - some clients may choke on unexpected extra data)?
align tag value with etag value associated with the corresponding resource?

(Something I've found in past work with applications that use real-time messaging/events is that it is important to ensure that multiple notifications of the same event are effectively idempotent. You can't avoid the possibility of either missing a notification or getting a notification twice. So I design to send possibly-superfluous notifications, and have the receiver check whether a notification actually needs further action.)

solid / notifications

Resubscribe to resources if the WebSocket connection is closed unexpectedly #7