taoensso / sente

Realtime web comms library for Clojure/Script
https://www.taoensso.com/sente
Eclipse Public License 1.0
1.73k stars 193 forks source link

Intermittent issue receiving server events on client when switching cellular network #417

Closed Naomarik closed 1 year ago

Naomarik commented 1 year ago

I've been unreliably able to replicate an issue that causes client to no longer receive messages from server until page is refreshed.

This has occurred to me intermittently when my iPhone network goes from switching states between:

This has been intermittent for me in that I can get client to stop receiving server messages several times consecutively, but it will also have multiple rounds of working fine.

This problem also occurs on my PC when tethered to iPhone for internet and switching between 4G/5G.

I've also tried switching states on my PC from ethernet <-> wifi but that has worked reliably.

Other info:

Naomarik commented 1 year ago

Some more information from a recently broken connection after swapping between 4G/5G many, many times.

Adding a watch to chsk-state atom to conj state over time ordered by oldest first:

[{:type :ws,
  :open? true,
  :ever-opened? true,
  :csrf-token
  "rvHxShhbjWaa4NGg1gQPTXHwyteF2d5Qvxx/UaH4kSODiyH5/D5iJ0XBzLUnPwuUucf6mHgf0maYM+mW",
  :uid 17592186045486,
  :handshake-data nil,
  :first-open? true,
  :time 1676982835857}
 {:first-open? false,
  :uid 17592186045486,
  :time 1676982960860,
  :type :ws,
  :handshake-data nil,
  :ever-opened? true,
  :open? false,
  :csrf-token
  "rvHxShhbjWaa4NGg1gQPTXHwyteF2d5Qvxx/UaH4kSODiyH5/D5iJ0XBzLUnPwuUucf6mHgf0maYM+mW",
  :last-close {:udt 1676982960860, :reason :unexpected},
  :last-ws-close
  {:udt 1676982960860,
   :ev #object[CloseEvent [object CloseEvent]],
   :clean? false,
   :code 1006,
   :reason ""}}
 {:first-open? false,
  :uid 17592186045486,
  :time 1676982960865,
  :udt-next-reconnect 1676982962437,
  :type :ws,
  :handshake-data nil,
  :ever-opened? true,
  :open? false,
  :csrf-token
  "rvHxShhbjWaa4NGg1gQPTXHwyteF2d5Qvxx/UaH4kSODiyH5/D5iJ0XBzLUnPwuUucf6mHgf0maYM+mW",
  :last-close {:udt 1676982960860, :reason :unexpected},
  :last-ws-close
  {:udt 1676982960860,
   :ev #object[CloseEvent [object CloseEvent]],
   :clean? false,
   :code 1006,
   :reason ""}}
 {:first-open? false,
  :uid 17592186045486,
  :time 1676982965376,
  :type :ws,
  :handshake-data nil,
  :ever-opened? true,
  :open? true,
  :csrf-token
  "rvHxShhbjWaa4NGg1gQPTXHwyteF2d5Qvxx/UaH4kSODiyH5/D5iJ0XBzLUnPwuUucf6mHgf0maYM+mW",
  :last-close {:udt 1676982960860, :reason :unexpected},
  :last-ws-close
  {:udt 1676982960860,
   :ev #object[CloseEvent [object CloseEvent]],
   :clean? false, 
   :code 1006, 
   :reason ""}}]

Here is the series of events from monitoring :chsk/state

[{:ch-recv #object[CW [object Object]],
  :send-fn #object[g],
  :state
  #object[cljs.core.Atom {:val {:first-open? false, :uid 17592186045486, :type :ws, :handshake-data nil, :ever-opened? true, :open? true, :csrf-token "rvHxShhbjWaa4NGg1gQPTXHwyteF2d5Qvxx/UaH4kSODiyH5/D5iJ0XBzLUnPwuUucf6mHgf0maYM+mW", :last-close {:udt 1676982960860, :reason :unexpected}, :last-ws-close {:udt 1676982960860, :ev #object[CloseEvent [object CloseEvent]], :clean? false, :code 1006, :reason ""}}}],
  :event
  [:chsk/state
   [{:type :auto,
     :open? false,
     :ever-opened? false,
     :csrf-token
     "rvHxShhbjWaa4NGg1gQPTXHwyteF2d5Qvxx/UaH4kSODiyH5/D5iJ0XBzLUnPwuUucf6mHgf0maYM+mW"}
    {:type :ws,
     :open? true,
     :ever-opened? true,
     :csrf-token
     "rvHxShhbjWaa4NGg1gQPTXHwyteF2d5Qvxx/UaH4kSODiyH5/D5iJ0XBzLUnPwuUucf6mHgf0maYM+mW",
     :uid 17592186045486,
     :handshake-data nil,
     :first-open? true}]],
  :id :chsk/state,
  :?data
  [{:type :auto,
    :open? false,
    :ever-opened? false,
    :csrf-token
    "rvHxShhbjWaa4NGg1gQPTXHwyteF2d5Qvxx/UaH4kSODiyH5/D5iJ0XBzLUnPwuUucf6mHgf0maYM+mW"}
   {:type :ws,
    :open? true,
    :ever-opened? true,
    :csrf-token
    "rvHxShhbjWaa4NGg1gQPTXHwyteF2d5Qvxx/UaH4kSODiyH5/D5iJ0XBzLUnPwuUucf6mHgf0maYM+mW",
    :uid 17592186045486,
    :handshake-data nil,
    :first-open? true}]}
 {:ch-recv #object[CW [object Object]],
  :send-fn #object[g],
  :state
  #object[cljs.core.Atom {:val {:first-open? false, :uid 17592186045486, :type :ws, :handshake-data nil, :ever-opened? true, :open? true, :csrf-token "rvHxShhbjWaa4NGg1gQPTXHwyteF2d5Qvxx/UaH4kSODiyH5/D5iJ0XBzLUnPwuUucf6mHgf0maYM+mW", :last-close {:udt 1676982960860, :reason :unexpected}, :last-ws-close {:udt 1676982960860, :ev #object[CloseEvent [object CloseEvent]], :clean? false, :code 1006, :reason ""}}}],
  :event
  [:chsk/state
   [{:type :ws,
     :open? true,
     :ever-opened? true,
     :csrf-token
     "rvHxShhbjWaa4NGg1gQPTXHwyteF2d5Qvxx/UaH4kSODiyH5/D5iJ0XBzLUnPwuUucf6mHgf0maYM+mW",
     :uid 17592186045486,
     :handshake-data nil,
     :first-open? true}
    {:first-open? false,
     :uid 17592186045486,
     :type :ws,
     :handshake-data nil,
     :ever-opened? true,
     :open? false,
     :csrf-token
     "rvHxShhbjWaa4NGg1gQPTXHwyteF2d5Qvxx/UaH4kSODiyH5/D5iJ0XBzLUnPwuUucf6mHgf0maYM+mW",
     :last-close {:udt 1676982960860, :reason :unexpected},
     :last-ws-close
     {:udt 1676982960860,
      :ev #object[CloseEvent [object CloseEvent]],
      :clean? false,
      :code 1006,
      :reason ""}}]],
  :id :chsk/state,
  :?data
  [{:type :ws,
    :open? true,
    :ever-opened? true,
    :csrf-token
    "rvHxShhbjWaa4NGg1gQPTXHwyteF2d5Qvxx/UaH4kSODiyH5/D5iJ0XBzLUnPwuUucf6mHgf0maYM+mW",
    :uid 17592186045486,
    :handshake-data nil,
    :first-open? true}
   {:first-open? false,
    :uid 17592186045486,
    :type :ws,
    :handshake-data nil,
    :ever-opened? true,
    :open? false,
    :csrf-token
    "rvHxShhbjWaa4NGg1gQPTXHwyteF2d5Qvxx/UaH4kSODiyH5/D5iJ0XBzLUnPwuUucf6mHgf0maYM+mW",
    :last-close {:udt 1676982960860, :reason :unexpected},
    :last-ws-close
    {:udt 1676982960860,
     :ev #object[CloseEvent [object CloseEvent]],
     :clean? false,
     :code 1006,
     :reason ""}}]}
 {:ch-recv #object[CW [object Object]],
  :send-fn #object[g],
  :state
  #object[cljs.core.Atom {:val {:first-open? false, :uid 17592186045486, :type :ws, :handshake-data nil, :ever-opened? true, :open? true, :csrf-token "rvHxShhbjWaa4NGg1gQPTXHwyteF2d5Qvxx/UaH4kSODiyH5/D5iJ0XBzLUnPwuUucf6mHgf0maYM+mW", :last-close {:udt 1676982960860, :reason :unexpected}, :last-ws-close {:udt 1676982960860, :ev #object[CloseEvent [object CloseEvent]], :clean? false, :code 1006, :reason ""}}}],
  :event
  [:chsk/state
   [{:first-open? false,
     :uid 17592186045486,
     :type :ws,
     :handshake-data nil,
     :ever-opened? true,
     :open? false,
     :csrf-token
     "rvHxShhbjWaa4NGg1gQPTXHwyteF2d5Qvxx/UaH4kSODiyH5/D5iJ0XBzLUnPwuUucf6mHgf0maYM+mW",
     :last-close {:udt 1676982960860, :reason :unexpected},
     :last-ws-close
     {:udt 1676982960860,
      :ev #object[CloseEvent [object CloseEvent]],
      :clean? false,
      :code 1006,
      :reason ""}}
    {:first-open? false,
     :uid 17592186045486,
     :udt-next-reconnect 1676982962437,
     :type :ws,
     :handshake-data nil,
     :ever-opened? true,
     :open? false,
     :csrf-token
     "rvHxShhbjWaa4NGg1gQPTXHwyteF2d5Qvxx/UaH4kSODiyH5/D5iJ0XBzLUnPwuUucf6mHgf0maYM+mW",
     :last-close {:udt 1676982960860, :reason :unexpected},
     :last-ws-close
     {:udt 1676982960860,
      :ev #object[CloseEvent [object CloseEvent]],
      :clean? false,
      :code 1006,
      :reason ""}}]],
  :id :chsk/state,
  :?data
  [{:first-open? false,
    :uid 17592186045486,
    :type :ws,
    :handshake-data nil,
    :ever-opened? true,
    :open? false,
    :csrf-token
    "rvHxShhbjWaa4NGg1gQPTXHwyteF2d5Qvxx/UaH4kSODiyH5/D5iJ0XBzLUnPwuUucf6mHgf0maYM+mW",
    :last-close {:udt 1676982960860, :reason :unexpected},
    :last-ws-close
    {:udt 1676982960860,
     :ev #object[CloseEvent [object CloseEvent]],
     :clean? false,
     :code 1006,
     :reason ""}}
   {:first-open? false,
    :uid 17592186045486,
    :udt-next-reconnect 1676982962437,
    :type :ws,
    :handshake-data nil,
    :ever-opened? true,
    :open? false,
    :csrf-token
    "rvHxShhbjWaa4NGg1gQPTXHwyteF2d5Qvxx/UaH4kSODiyH5/D5iJ0XBzLUnPwuUucf6mHgf0maYM+mW",
    :last-close {:udt 1676982960860, :reason :unexpected},
    :last-ws-close
    {:udt 1676982960860,
     :ev #object[CloseEvent [object CloseEvent]],
     :clean? false,
     :code 1006,
     :reason ""}}]}
 {:ch-recv #object[CW [object Object]],
  :send-fn #object[g],
  :state
  #object[cljs.core.Atom {:val {:first-open? false, :uid 17592186045486, :type :ws, :handshake-data nil, :ever-opened? true, :open? true, :csrf-token "rvHxShhbjWaa4NGg1gQPTXHwyteF2d5Qvxx/UaH4kSODiyH5/D5iJ0XBzLUnPwuUucf6mHgf0maYM+mW", :last-close {:udt 1676982960860, :reason :unexpected}, :last-ws-close {:udt 1676982960860, :ev #object[CloseEvent [object CloseEvent]], :clean? false, :code 1006, :reason ""}}}],
  :event
  [:chsk/state
   [{:first-open? false,
     :uid 17592186045486,
     :udt-next-reconnect 1676982962437,
     :type :ws,
     :handshake-data nil,
     :ever-opened? true,
     :open? false,
     :csrf-token
     "rvHxShhbjWaa4NGg1gQPTXHwyteF2d5Qvxx/UaH4kSODiyH5/D5iJ0XBzLUnPwuUucf6mHgf0maYM+mW",
     :last-close {:udt 1676982960860, :reason :unexpected},
     :last-ws-close
     {:udt 1676982960860,
      :ev #object[CloseEvent [object CloseEvent]],
      :clean? false,
      :code 1006,
      :reason ""}}
    {:first-open? false,
     :uid 17592186045486,
     :type :ws,
     :handshake-data nil,
     :ever-opened? true,
     :open? true,
     :csrf-token
     "rvHxShhbjWaa4NGg1gQPTXHwyteF2d5Qvxx/UaH4kSODiyH5/D5iJ0XBzLUnPwuUucf6mHgf0maYM+mW",
     :last-close {:udt 1676982960860, :reason :unexpected},
     :last-ws-close
     {:udt 1676982960860,
      :ev #object[CloseEvent [object CloseEvent]],
      :clean? false,
      :code 1006,
      :reason ""}}]],
  :id :chsk/state,
  :?data
  [{:first-open? false,
    :uid 17592186045486,
    :udt-next-reconnect 1676982962437,
    :type :ws,
    :handshake-data nil,
    :ever-opened? true,
    :open? false,
    :csrf-token
    "rvHxShhbjWaa4NGg1gQPTXHwyteF2d5Qvxx/UaH4kSODiyH5/D5iJ0XBzLUnPwuUucf6mHgf0maYM+mW",
    :last-close {:udt 1676982960860, :reason :unexpected},
    :last-ws-close
    {:udt 1676982960860,
     :ev #object[CloseEvent [object CloseEvent]],
     :clean? false,
     :code 1006,
     :reason ""}}
   {:first-open? false,
    :uid 17592186045486,
    :type :ws,
    :handshake-data nil,
    :ever-opened? true,
    :open? true,
    :csrf-token
    "rvHxShhbjWaa4NGg1gQPTXHwyteF2d5Qvxx/UaH4kSODiyH5/D5iJ0XBzLUnPwuUucf6mHgf0maYM+mW",
    :last-close {:udt 1676982960860, :reason :unexpected},
    :last-ws-close
    {:udt 1676982960860,
     :ev #object[CloseEvent [object CloseEvent]],
     :clean? false, 
     :code 1006, 
     :reason ""}}]}]
ptaoussanis commented 1 year ago

@Naomarik Hi Omar, thanks for the detailed info!

Haven't had the opportunity to look at this closely yet, but would like to confirm my understanding of what you've shared so long.

Is the following an accurate description of an example of the behaviour you're seeing?:

  1. Sente is initially working fine. Messages from client->server and server->client are received without issue.

  2. This working state persists until the client device (iPhone) switches networks (e.g. Wifi->Cellular).

  3. At that point it may happen that: 3a. Messages from client->server stop being received. 3b. Messages from server->client stop being received.

  4. This broken state persists indefinitely, until the page is refreshed on the client.

My questions:

There's at least a few different things that could be going on, and the above info should help lead us to the right general area.

Naomarik commented 1 year ago

1-4 is accurate except for client->server DOES continue working.

Once in broken state

Some additional notes

So to summarize in particular what seems to be broken is the server's send-fn made by make-channel-socket! when choosing a uid to send to. I have already tested both setting an actual user's ID from database and leaving it to the default client id.

Naomarik commented 1 year ago

Quick update: I exposed the conns_ atom to a GET request and found this. The 3 clients that have nils all do not receive messages from server. The 4th one is a new tab and receiving messages correctly.

 {:ws
  {17592186046201
   {"e4bbe8b8-82f6-4371-bb7f-65486bd1e40d" "[nil 1677102871278]",
    "53b5c7f9-6249-4157-b661-a515ee1f15dc" "[nil 1677102836257]",
    "2ceff709-7269-4349-bd0f-326763d9a013" "[nil 1677102878342]",
    "5166004f-87cc-499b-a0dd-229561a54277"
    "[#object[io.undertow.websockets.core.protocol.version13.WebSocket13Channel 0x65d0947c \"WebSocket13Channel peer /127.0.0.1:55998 local /127.0.0.1:4000[ No Receiver [] -- [] -- []\"] 1677102883344]"}},
  :ajax {}}
ptaoussanis commented 1 year ago

Thanks a lot for the extra info, busy going through your info from here and Telegram.

A work-in-progress summary so far (will keep updated):

  1. Sente initially working fine. client->server, server->client calls work fine.

  2. Working state persists until the client device (iPhone) switches networks (e.g. Wifi->Cellular).

  3. At that point it MAY (sometimes, not always) enter a broken state.

  4. Broken state persists indefinitely, until the page is refreshed on the client.

While in broken state:

Naomarik commented 1 year ago

So this is happening easily by just switching IP.

Connection on first IP. Notice websocket peer address is 49342

2023-02-23 17:43:24,796 [XNIO-3 task-2] INFO  naomarik.proj.web.routes.ws - [:ip-addr 2001:8f8:1d66:c751:4405:3f94:70fb:*7d8]
2023-02-23 17:43:24,798 [XNIO-3 task-2] INFO  proj.sentee - [:debug/on-open 17592186046209 fb891a6e-bf82-4c16-a94d-87e34ee36440 #object[io.undertow.websockets.core.protocol.version13.WebSocket13Channel 0x31892166 WebSocket13Channel peer /[0:0:0:0:0:0:0:1]:**49342** local /[0:0:0:0:0:0:0:1]:4000[ No Receiver [] -- [] -- []]]

Browser tab is still open, computer's internet switches from phone tethering -> home wifi.

Connection on second IP. Notice websocket peer address is 54406

2023-02-23 17:44:39,424 [XNIO-3 task-2] INFO  naomarik.proj.web.routes.ws - [:ip-addr *6.17.244.29]
2023-02-23 17:44:39,427 [XNIO-3 task-2] INFO  proj.sentee - [:debug/on-open 17592186046209 fb891a6e-bf82-4c16-a94d-87e34ee36440 #object[io.undertow.websockets.core.protocol.version13.WebSocket13Channel 0x2a6f60fc WebSocket13Channel peer /127.0.0.1:54406 local /127.0.0.1:4000[ No Receiver [] -- [] -- []]]

Disconnection from client from IP address a bit over a minute later, peer address from first IP 49342

2023-02-23 17:45:44,224 [async-dispatch-1] INFO  proj.sentee - [:DISCONNECTING [fb891a6e-bf82-4c16-a94d-87e34ee36440 #object[io.undertow.websockets.core.protocol.version13.WebSocket13Channel 0x31892166 WebSocket13Channel peer /[0:0:0:0:0:0:0:1]:**49342** local 0.0.0.0/0.0.0.0:4000[ No Receiver [] -- [] -- []]]]

This nils the sch in the [sch udt] vector from the conns_ atom and server->client no longer works. Client->server works fine.

ptaoussanis commented 1 year ago

Excellent summary and debugging work by @Naomarik:

Client connects to server with a unique client ID. That is used as the primary key for that specific browser tab to identify it. The websocket channel (or sch in sente code) and udt are the key's value, so we have {<client-id [<websocket channel (port assigned to first IP)> udt]}.

The websocket channel maps onto a port on the server of which 65,535 are available. So the websocket channel has a distinct port assigned to each connection. When the IP address changes suddenly, there is no immediate on-close event that gets sent. Instead it times out after about 60 seconds.

At some point in the future, client sente reconnects with the SAME CLIENT ID as before, but with a new ip address. Now we have {<client-id [<websocket channel (port assigned to second IP)> udt]}.

When timeout occurs, the client ID gets dissoc'd, removing the reconnected websocket channel and any hope for the server to initiate contact to its beloved client, but vice versa works fine. Later on, upd-conn! occurs to update the udt in the first arity, so there is a {<client-id [nil <udt>]}.


I'll prepare a fix 👍

ptaoussanis commented 1 year ago

Will be fixed in forthcoming v1.18 👍 Leaving issue open until the fix is confirmed.

ptaoussanis commented 1 year ago

Fix has been confirmed by reporter, closing 👍