LiveView + Bandit doesn't handle server restarts gracefully

derekkraan commented 1 year ago

Environment

Elixir version (elixir -v): Erlang/OTP 25 [erts-13.1.2] [source] [64-bit] [smp:32:24] [ds:32:24:10] [async-threads:1] [jit:ns]

Elixir 1.14.1 (compiled with Erlang/OTP 25)

Phoenix version (mix deps): 1.7.0-rc.2
Phoenix LiveView version (mix deps): 0.18.11
Operating system: Linux
Browsers you attempted to reproduce this bug on (the more the merrier): Firefox
Does the problem persist after removing "assets/node_modules" and trying again? Yes/no: Yes

Actual behavior

I am testing bandit (0.6.7; thousand_island 0.5.16) with phoenix.

When my server restarts (during a deploy), the websocket connection is closed with a status 1001.

The websocket connection is not re-established.

Expected behavior

I am expecting the websocket connection to be automatically re-established.

Analysis (FWIW)

I suspect this bit to be the main culprit. Status code 1001 can be both "server going down" and "browser navigating away". Right now LiveView assumes the second scenario to be the case, but now with bandit we are getting 1001 also for "server going down", which is causing issues.

I'm not sure on a potential solution here; reconnect after a timeout to give leaving browsers a chance to actually go? Just always reconnect? It's not entirely clear to me at which point a browser stops running the JS when clicking away from a page.

edit for posterity: The WebSocket Protocol RFC

josevalim commented 1 year ago

Can you please file this issue in bandit instead? Perhaps it’s something they would change on their side.

derekkraan commented 1 year ago

Will do. Thanks.

derekkraan commented 1 year ago

@josevalim I did some further digging, the lines that are causing this issue in live_socket.js were added in Dec 2022 in this commit.

Changing the block to look like this resolved the issue on my end:

this.socket.onClose(event => {                                                            
  // failsafe reload if normal closure and we still have a main LV                        
  if(event && event.code === 1000 && this.main){ return this.reloadWithJitter(this.main) }
  if(event && event.code === 1001 && this.main){ return this.reloadWithJitter(this.main) }
})

I could not detect any spurious attempts to reconnect a websocket when navigating away from the page, in either FF or Chrome. This would seem to be consistent with the standard https://websockets.spec.whatwg.org/#garbage-collection which states that if the document is going away, new websocket connect attempts will be either failed or result in a status code 1001. Neither attempts will reach the server if I am interpreting this correctly.

I'm trying to figure out why this code was changed to work the way it does.

Right now I'm waiting to see if @mtrudel wants to change it in bandit, but I think there is also a case to be made for changing the code in LiveView to the above.

josevalim commented 1 year ago

@derekkraan unfortunately we saw issues in Firefox where clicking links to navigate away would not terminate Websockets in a similar mechanism to Chrome/Safari.

LiveView already has to handle compatibility across all different browsers, it is way too much complexity to ask it to also handle compatibility across different webservers. If Bandit believes 1001 is the best return for them, then perhaps it could be made configurable.

mtrudel commented 1 year ago

Thanks for the overnight analysis everyone! I'll be updating Bandit's behaviour to match Cowboy's in a few minutes.

As something of an aside @josevalim, you'll recall close code behaviour was one of the things that we'd identified as being a hole in the specification when we were working up WebSock (see the later parts of https://github.com/phoenixframework/phoenix/pull/4973 and https://github.com/phoenixframework/phoenix/pull/5063 for prior discussion). I've got some upcoming work planned to land better support for closing connection in WebSock here: https://github.com/phoenixframework/websock_adapter/issues/2 (the work is going to be strictly additive as always, but will provide for more explicit shutdown semantics via WebSock). That may alleviate some of the cross platform concerns around Phoenix support for multiple servers (though it wouldn't have solved this particular case).

I agree that LiveView / Phoenix should not ever have to know or care about which server it's running on; our abstractions have failed if that's the case.

mtrudel commented 1 year ago

@derekkraan unfortunately we saw issues in Firefox where clicking links to navigate away would not terminate Websockets in a similar mechanism to Chrome/Safari.

I've also seen some real oddball connection-level behaviour from Firefox as well. It's definitely cut from very different cloth as other clients, especially around its WebSocket implementation. An important test candidate, for sure.

mtrudel commented 1 year ago

Fixed in Bandit 0.6.8; get it while it's hot!

mtrudel commented 1 year ago

Far be it from me to suggest anything in Javascript (I barely know how to webpack my polyfilled hooks or transpile a redux shadow), but I've noticed that it's really common behaviour in WebSocket implementations to consider a range of result codes as being 'successful' (e.g. Autobahn does this, as does every client I've looked at). It may be worth doing so in the case that @derekkraan suggested above:

Changing the block to look like this resolved the issue on my end:

this.socket.onClose(event => {                                                            
  // failsafe reload if normal closure and we still have a main LV                        
  if(event && event.code === 1000 && this.main){ return this.reloadWithJitter(this.main) }
  if(event && event.code === 1001 && this.main){ return this.reloadWithJitter(this.main) }
})

I spent quite a bit of time trying to thread the needle on this when I was getting Bandit green against Autobahn, and the best I could come up with is to consider anything matching code in 0..999 or code in 1004..1006 or code in 1012..2999 as 'unexpected' and everything else as 'expected'.

Just some food for thinking.

kevinlang commented 1 year ago

Caddy will also give 1001 when closing the connection during an e.g., config reload, which is causing the same problem of LiveView not gracefully reconnecting for us.

https://github.com/caddyserver/caddy/blob/90798f3eea6b7a219a9bb55f680919c0504263e3/modules/caddyhttp/reverseproxy/streaming.go#L254

I think we will need to change the JS here to the block mentioned above or similar.

Otherwise my only other course of action is to change my reverse proxy to one that hopefully returns 1000 in this scenario, or to fork Caddy itself to change it to return 1000.

Edit: I created a new issue to track specifically the issue with Caddy and LiveView playing together in #2544

DianaOlympos commented 1 year ago

Note that it may be worth filing a bug on firefox to ask them for the thinking behind it.

This would take time to solve it but i think, after careful reading of the RFC, that this is an oversight of the RFC. I think an errata to that language would be good but... Well.

The fundamental problem is that "server going away" is badly defined and is quite different from "navigating away". In particular, on a "server crash", this may not be the server going away if you think of the server as everything behind the load balancer termination. You can still reconnect to it.

This would be inline with the way RFC use "going away" terminology for HTTP.

See https://www.rfc-editor.org/rfc/rfc6455#section-7.4.1

  1001

  1001 indicates that an endpoint is "going away", such as a server
  going down or a browser having navigated away from a page.

So yeah in theory servers should basically never send 1001

mtrudel commented 1 year ago

It's a gap in the RFC for sure. When I changed this recently in Bandit I basically hewed to existing practice in the ecosystem, which was to return a 1000. In my heart of hearts I think 1001 is probably the more correct option though.

I'm pretty sure the only way forward is to follow Postel's law as I'd suggested above, and have the client consider a spectrum of return codes as being 'normal'.

phoenixframework / phoenix_live_view