taoensso / sente

Realtime web comms library for Clojure/Script
https://www.taoensso.com/sente
Eclipse Public License 1.0
1.73k stars 193 forks source link

On reconnect, dangling requests made with the old connection are waiting till timeout #413

Closed p-himik closed 1 year ago

p-himik commented 1 year ago

Sente 1.17.0.

To reproduce with Aleph:

  1. Open a client-server connection with default parameters
  2. Send a message larger than 64 kB from the client to the server and specify some timeout and a logging callback function
  3. Notice that on the server, there's an exception:
    WARN  aleph.http.server [aleph-netty-server-event-pool-4-2]
    error in websocket handler
    io.netty.handler.codec.http.websocketx.CorruptedWebSocketFrameException: Max frame length of 65536 has been exceeded.
    at io.netty.handler.codec.http.websocketx.WebSocket08FrameDecoder.protocolViolation(WebSocket08FrameDecoder.java:424)
    at io.netty.handler.codec.http.websocketx.WebSocket08FrameDecoder.decode(WebSocket08FrameDecoder.java:286)
    at io.netty.handler.codec.ByteToMessageDecoder.decodeRemovalReentryProtection(ByteToMessageDecoder.java:510)
    at io.netty.handler.codec.ByteToMessageDecoder.callDecode(ByteToMessageDecoder.java:449)
    at io.netty.handler.codec.ByteToMessageDecoder.channelRead(ByteToMessageDecoder.java:279)
    at io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:379)
    at io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:365)
    at io.netty.channel.AbstractChannelHandlerContext.fireChannelRead(AbstractChannelHandlerContext.java:357)
    at io.netty.channel.DefaultChannelPipeline$HeadContext.channelRead(DefaultChannelPipeline.java:1410)
    at io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:379)
    at io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:365)
    at io.netty.channel.DefaultChannelPipeline.fireChannelRead(DefaultChannelPipeline.java:919)
    at io.netty.channel.epoll.AbstractEpollStreamChannel$EpollStreamUnsafe.epollInReady(AbstractEpollStreamChannel.java:800)
    at io.netty.channel.epoll.EpollEventLoop.processReady(EpollEventLoop.java:487)
    at io.netty.channel.epoll.EpollEventLoop.run(EpollEventLoop.java:385)
    at io.netty.util.concurrent.SingleThreadEventExecutor$4.run(SingleThreadEventExecutor.java:997)
    at io.netty.util.internal.ThreadExecutorMap$2.run(ThreadExecutorMap.java:74)
    at io.netty.util.concurrent.FastThreadLocalRunnable.run(FastThreadLocalRunnable.java:30)
    at java.base/java.lang.Thread.run(Thread.java:833)

Expected behavior:

The callback function is called immediately with :chsk/closed.

Actual behavior:

The callback is called only after the timeout with :chsk/timeout.

ptaoussanis commented 1 year ago

@p-himik Hi Eugene, thanks for this!

I'm not too familiar with Aleph, and haven't used its adapter myself - but there's some mention of this max-frame-size in the commit message here:

[Aleph adapter] Add support for websocket-connection options The use-case is e.g. if you want to send larger WS messages. The max-frame-payload is 65536 bytes and max-frame-size is 1048576 bytes by default. It seems, the disassembly/ reassembly of frames isn't supported in some scenarios but there is still need for larger data transfer over WS. Therefore the easiest solution is to just increase the max-frame-payload size. Aleph currently has 7 options to configure more appropriate settings in less common circumstances.

Hope this helps?

p-himik commented 1 year ago

Right, that's what I did - increased max-frame-payload to satisfy my current needs.

But the issue is not about this particular exception but rather about the fact that the "send a message and get the connection closed by the server" workflow results in a timeout instead of an immediate error.

ptaoussanis commented 1 year ago

@p-himik So just to clarify:

  1. This server exception does not occur when you increase max-frame-payload.
  2. Your point is that when max-frame-payload is too low and the server exception does occur, you'd want the client's request to immediately return :chsk/timeout.

Am I understanding that correctly?

If that's the case, could you explain a little what the motivation is exactly? Are you expecting to see this error in production once max-frame-payload is tuned?

As I mentioned, I'm not too familiar with Aleph or its adapter. I guess Aleph might not be closing the requesting connection when such an error occurs? If that's the case, I guess it should be possible for the Aleph adapter to catch Aleph server exceptions like this one and explicitly close the requesting connection.

Though it's not obvious to me that that'd necessarily always be desirable assuming the connection is otherwise still healthy? This may relate to your title though, which I didn't understand-

On reconnect, dangling requests made with the old connection are waiting till timeout

It's not completely obvious to me from your issue text what exactly this issue title means. Could you please rephrase, and/or explain in a little more detail exactly what's going on?

Thanks!

p-himik commented 1 year ago
  1. That's correct
  2. Almost - :chsk/closed, as the OP states in "Expected behavior". Because it seems that the server closes the connection - there's a new WS request in the Network tab of my browser's devtools

To demonstrate that the problem is not related to Aleph, I created an example based on http-kit: https://github.com/p-himik/sente-timeout-error Well, attempted to create, best to my abilities. I'm still not sure it's exactly the same because every server library has its own notion of channels and streams which can be used differently.

ptaoussanis commented 1 year ago

Okay, so let me confirm my current understanding then:

What does the issue title mean?:

On reconnect, dangling requests made with the old connection are waiting till timeout

Where does reconnection and dangling requests come into this?

(Sorry, I'm struggling to understand a little- just want to make sure we're on the same page).

p-himik commented 1 year ago

This issue isn't about Aleph, or the specific "Max frame length" error you mentioned. We can ignore this.

Yes.

Instead, any time any Sente server throws an exception for any request, you'd expect the client to get a :chsk/closed event rather than a :chsk/timeout event? Or only when the server specifically closes the connection in response to an exception?

Yes, except that instead "throws an exception for any request" it should be "closes a connection as a result of any request". It just so happens that that's what Aleph does in the case of a message exceeding max frame payload - catches an exception from Netty, reports it, and closes the connection. Some other server might close a connection for some other reason that might have nothing to do with exceptions.

What does the issue title mean?

It could probably be worded better. "When the underlying server closes the connection, any dangling requests made with Sente wait till timeout" or something like that. I'm not that good at titles.

ptaoussanis commented 1 year ago

Okay, great- then clear so far ๐Ÿ‘

And to confirm: what's your motivation for specifically wanting a :chsk/closed rather than :chsk/timeout return value? Is your concern the return value, or the time needed to wait for the timeout to trigger?

Is there a specific use case you're having trouble with?

p-himik commented 1 year ago

The main concern is being able to show the user a reasonable error message.

A delayed timeout error, as is the case right now, suggests that a user should try again - even though no amount of attempts will fix the issue since some code changes are needed. An immediate timeout error is confusing - after all, the request was made just now, so a timeout is the last thing one would expect. A :chsk/closed error would let me show something like "The connection was unexpectedly closed during the request - please try again and, if the error persists, contact the administrator." Which, I think, as reasonable as it could be in this case.

ptaoussanis commented 1 year ago

Okay, gotcha. This context is helpful in understanding what your objective is ๐Ÿ‘

So it's some time ago that I wrote Sente, and I don't have the chance to look through the code right now - but my recollection is that reliably detecting a closed connection from the client's side can be difficult, and may not always possible. Different servers and clients (esp. browsers) may have different behaviour.

So :chsk/timeout is used as a reliable but high-level (non-specific) client-side indication that "something went wrong". Literally: "the request timed out before getting a response".

No semantics or meaning beyond that are implied. Reasons for a :chsk/timeout may include temporary network issues, permanent network issues, temporary server issues, permanent server issues, etc.

A timeout by itself doesn't imply that a retry will/not be helpful.

[...] even though no amount of attempts will fix the issue since some code changes are needed.

Note that due to auto reconnects, polling, etc. - even if the trouble was caused by a server-side issue, it could have been a temporary server-side issue - or it could be an issue with only a subset of servers in a distributed set, etc. I.e. an immediate retry may still be successful even for a server-side issue.

My recommendation here is to try return something meaningful from the server when possible. I.e. don't rely on the connection layer to convey application-level (user-message) semantics.

For example, try ensure that your server handlers catch exceptions and return a meaningful error to the client that can drive a specific informative error message to the user. This can often be conveniently done with a middleware-like wrapper. A convention I often use is for every chsk endpoint to always return a map of form {:okay <payload>} or {:error <payload>}.

Then any non-200 or :chsk/timeout is essentially a general unexpected error, that may/not go away after retrying.

There's some nuances to this if you have resources that you also want to provide to clients outside of your control, but that's a larger topic.

And note that this approach of course depends on the server to not automatically close the connection, which may depend on the kind of error you're encountering. A trapped application-level exception should generally not close the connection. The particular error you saw here with Aleph might not be easily trapped at the application level or adapter level, not sure.

Hope this helps?

p-himik commented 1 year ago

my recollection is that reliably detecting a closed connection from the client's side can be difficult, and may not always possible

In this particular case, Sente definitely knows that the connection was dropped - because it immediately reconnects. That's why my reproduction code has that js/Number.MAX_SAFE_INTEGER as a timeout in there, so that a reconnect doesn't make it harder to understand what's happening. That's why it still seems to me that Sente can handle it in a better way - if it knows that there's been a reconnection, all pending client->server requests could be answered right away with :chsk/closed because, well, the original connection has been closed.

My recommendation here is to try return something meaningful from the server when possible.

When the connection layer fails before a value is returned, how can I return something meaningful? My server code doesn't even know that there's been a failure in the case with max frame payload. It doesn't even know there's been a request in the first place.

try ensure that your server handlers catch exceptions and return a meaningful error to the client that can drive a specific informative error message to the user

Aleph catches those exceptions and logs them - they don't make it outside of Aleph. Of course, it could be reasoned that it's a flaw in Aleph - I don't know enough of the background to argue. But it's a possibility that goes beyond web servers, as I demonstrated with http-kit. It might even be caused by a browser itself - IIRC, at least in some versions of Chrome messages > 10 MB used to drop the connection. Not sure what the state of affairs here right now.

A trapped application-level exception should generally not close the connection.

Absolutely - everything that can be caught, I do catch and handle, without dropping the connection.

ptaoussanis commented 1 year ago

Thanks for the info ๐Ÿ‘ Can reply in more detail, but want to check back on the main objective before we go much deeper:

The main concern is being able to show the user a reasonable error message.

So we've got two possible cases:

  1. [Current case] Sente callback handlers called with :chsk/timeout on any issue with getting reply.
  2. [Alternative] Sente callback handlers called with :chsk/closed when server closed request, or :chsk/timeout otherwise.

Let's just assume for now that (2.) can be done reliably.

For case (1.) the typical user message would be something like "ERROR: failed to get reply from server. Please check your connection, and/or try again later. If this error persists, please [do x].".

Could you give me an example of the improved user message/s you'd expect to display in case (2.)?

p-himik commented 1 year ago

An immediate connection drop is almost certainly a symptom of there being something wrong with the server. On the other hand, a timeout is usually a symptom of the server being overloaded, at least in my experience.

Even if a user doesn't report anything, errors of both kinds will make it into the logs, so eventually I'll see them. However, by myself I can't rely on :chsk/closed being a bad thing - it's only bad if a user sees it when it's not expected. But I can rely on a plethora of :chsk/timeout being a bad thing - it's an actionable error, so I don't need any user input to notice and tackle it.

With that in mind, a user-level message that differentiates between :chsk/closed and :chsk/timeout is helpful because then a user would have a clear signal that something definitely should be reported. A timeout doesn't really need to reported, so the error message should be different IMO.

ptaoussanis commented 1 year ago

An immediate connection drop is almost certainly a symptom of there being something wrong with the server. On the other hand, a timeout is usually a symptom of the server being overloaded, at least in my experience.

I'm not sure this is accurate though, and is why I was asking for an example of user message/s you'd expect to present.

Neither :chsk/closed nor :chsk/timeout reliably imply anything concrete about the underlying cause. Both can occur for various reasons. So it's not obvious to me what kind of user message you'd be able to present for :chsk/closed that would be an improvement over the one I mentioned earlier for :chsk/timeout.

To summarise my current understanding: this discussion is essentially a proposal to call request handlers with :chsk/closed in some circumstances rather than :chsk/timeout.

As I understand it, the proposal's value depends on these assertions:

  1. Distinguishing between :chsk/timeout and :chsk/closed is useful if it's possible to reliably distinguish.
  2. It's possible to reliably distinguish.
  3. [Secondary] The possible usefulness would justify the costs involved (in development time and complexity).

I certainly may be wrong, but my initial/hunch assessment is that each one of these assertions seems unlikely to be true.

Assertion 1: distinguishing between :chsk/timeout and :chsk/closed is useful if it's possible to reliably distinguish

This assertion could be easily established with a good example (existence proof), though I'm not confident that we have one yet.

a user-level message that differentiates between :chsk/closed and :chsk/timeout is helpful because then a user would have a clear signal that something definitely should be reported

But what is the clear signal that you'd be reporting for :chsk/closed vs :chsk/timeout? Both of these would have a variety of different possible causes.

(Note: as it is now, :chsk/closed actually only has a single clear cause since it's reserved for the case of trying to send a request on a known-closed connection.)

Assertion 2: it's possible to reliably distinguish

In this particular case, Sente definitely knows that the connection was dropped - because it immediately reconnects. [...] That's why it still seems to me that Sente can handle it in a better way

Note that the first point doesn't necessarily imply the second in general.

Like I mentioned, my concern isn't that you can't detect this closure in this case with your current server and browser. But reliably detecting connection closures may depend on at least the server, adapter, client (esp. browser), proxies, connection mode (e.g. Ajax vs WebSocket), etc. (Disclaimer: again, from my recollection). In the worst case, some of this behaviour may even change over time?

Let's say Sente could provide you with an immediate :chsk/closed event in this case, but not reliably in all cases. Would you even want to code your UI to depend on the behaviour?

The benefit of :chsk/timeout I was pointing out earlier, is that it's completely reliable: it doesn't depend on the server, on proxies, or on the client environment. You know how it behaves in all environments, you know what it means, and you know it's reliable. It's not highly specific, but it is highly reliable and consistent.

Assertion 3: the possible usefulness would justify the costs involved (in development time and complexity)

Ultimately this'd need to be the hurdle to overcome if putting work into this.


Conclusion

To be clear: I'm definitely not opposed to any changes if you or anyone else wants to propose some! Just suggesting that we try be clear on on:

And reminding again that I may be mistaken, so welcome corrections if anything I've said is incorrect. Hope that seems reasonable / makes sense!

Cheers :-)

ptaoussanis commented 1 year ago

Closing due to inactivity as part of issue triage. Please feel free to re-open if this issue is still relevant, thanks!