taoensso / sente

Realtime web comms library for Clojure/Script
https://www.taoensso.com/sente
Eclipse Public License 1.0
1.73k stars 193 forks source link

Some dead websocket connections are never cleaned up #431

Closed krajj7 closed 11 months ago

krajj7 commented 12 months ago

After upgrading from sente 1.17.0 to 1.18.1 I noticed that my app is accumulating dead websocket connections in the connected-uids atom that never go away, even after trying to close them by sending [:chsk/close]. One case when this seems to happen is when the browser is killed non-gracefully, ie. kill -9.

This can be simulated by using the example project in the repo (I am assuming Linux and using Chromium as the browser):

1) git checkout https://github.com/ptaoussanis/sente.git; cd sente/example-project; lein cljsbuild once 2) (for my JVM I have to add the dependency [com.google.guava/guava "25.1-jre"] in project.clj or it doesn't work) 3) lein repl 4) run (-main) 5) Navigate your browser to the page printed in the REPL ("Web server is running at ...") 6) Make sure it made a websocket connection (I haven't tested AJAX) 7) Log in with some user-id on the page 8) Running (:ws @connected-uids) in the REPL should show that user-id 9) Kill the browser non-gracefully by running killall -KILL chromium 10) Try to disconnect the client: (chsk-send! <uid> [:chsk/close]) 11) Wait 30 seconds and check (:ws @connected-uids)

With the example project from version 1.17.0 the client disappears from (:ws @connected-uids) after only a few seconds. With version 1.18.1 the client seems to stay indefinitely.

I realize that using :chsk/close is undocumented, as far as I understand, unresponsive clients should eventually be disconnected automatically, but this never seems to happen with 1.18.1 (when the browser is killed non-gracefully as above). With 1.17.0 automatic disconnects mostly worked but I found some cases where an explicit close was needed, which is why I'm using it.

ptaoussanis commented 12 months ago

@krajj7 Hi Jan,

I'm having difficulty reproducing this so far. I'm consistently seeing (:ws @connected-uids) displaying the expected #{} a couple seconds after the forced kill.

Are you 100% sure that you're running Sente 1.18.1 and not 1.18.0? Have you tried running lein clean before your build?

You can check the sente/sente-version var to make sure you don't have a stale build artifact, etc.

I realize that using :chsk/close is undocumented

Correct, but it should still work fine.

as far as I understand, unresponsive clients should eventually be disconnected automatically

Correct. The server has a loop that regularly tries to ping the client if the connection has been idle. That ping attempt should trigger a connection close if it fails.

If you've confirmed that you're really running 1.18.1 and are still seeing the issue - did you make any other changes to the reference example? E.g. did you change the default web server from http-kit, etc.?

Thanks!

krajj7 commented 12 months ago

Hi, thank you for looking into this.

I re-tried today and found that I also cannot reliably reproduce with the given instructions, I am sorry about that. I can still trigger stuck connections with a few minutes of trying, but frustratingly, I haven't been able to find a reliable trigger.

I am 100% sure I'm using 1.18.1 and I didn't make any relevant changes in the example project. The only changes I made was the added guava dependency and commenting out auto-opening of the browser, since (.browse (java.awt.Desktop/getDesktop) (java.net.URI. uri)) causes an UnsupportedOperationException on my machine.

I was most successful triggering the problem with these steps (in the example project):

1) Start the example project server, open the web page, make sure it's using a websocket connection 2) Log in as "a" 3) Freeze the browser by running killall -STOP chromium 3) Spam the browser with a couple messages from the REPL: (doseq [x (range 250)] (chsk-send! "a" [:test/test])) 4) Un-freeze the browser by doing killall -CONT chromium 5) Quickly run (chsk-send! "a" [:chsk/close]) in the REPL while the queued-up messages are processing. The browser eventually reconnects, but sometimes this new connection becomes "immortal". 6) Close the browser or log in with a different UID 7) After a few seconds check (:ws @connected-uids). If the problem was triggered "a" will remain there forever.

The timing of step 6 seems to be critical. If it's done too late or too early the problem doesn't manifest. The message spam doesn't always seem to be neccessary, I'm not sure if it matters or not.

I'm sorry about these unreliable instructions. If possible please give it 5 minutes and let me know if you got a stuck connection. If not I will try to dive deeper and see if I can find the cause or something more helpful.

krajj7 commented 12 months ago

I'll also add that besides the example project and chromium, I also observed this problem in my larger app which uses http-kit, the connections were non-localhost and it happens with Firefox as well.

ptaoussanis commented 12 months ago

@krajj7 Hi Jan, thanks for the additional info. It's helpful to know if it's intermittent and/or seems to involve some kind of timing issue.

That indicates a different kind of problem than if it's consistently reproducible.

Will investigate further and come back to you 👍

ptaoussanis commented 12 months ago

@krajj7 Really struggling to reproduce this on my end. And have gone over the relevant code a few times now without spotting any obvious explanation so far.

I'm sorry to prod on this, but would you please humour me and just explicitly confirm that you've run lein clean and confirmed that you're seeing [1 18 1] when you check sente/version?

1.18.0 had a known issue with connected-uids_ not being properly cleared after disconnects, and I just know from experience that it's super easy to unexpectedly end up with stale artifacts.

I need to get some sleep, but will continue digging first thing in the morning - I might be missing something obvious atm.

Apologies for all the trouble on this, and appreciate your assistance debugging!

Cheers

ptaoussanis commented 12 months ago

Quick update: while I still haven't been able to successfully reproduce on my end yet, I have a couple ideas for some of the logic that could be made more robust. Will prepare a debug build with some improvements and extra logging and come back to you.

krajj7 commented 12 months ago

I'm sorry to hear you weren't able to reproduce it. My repro instructions must still be missing something crucial, but I don't know what it is.

I can confirm I get [1 18 1] when I run sente/sente-version. I don't have 1.18.0 on my system at all. I managed to reproduce this after doing "lein clean" and also from a completely fresh checkout of the repository.

I'll try to look "under the hood" a bit more and see if I can discover something.

ptaoussanis commented 12 months ago

I can confirm I get [1 18 1] when I run sente/sente-version. I don't have 1.18.0 on my system at all. I managed to reproduce this after doing "lein clean" and also from a completely fresh checkout of the repository.

Thanks Jan 🙏

I'll try to look "under the hood" a bit more and see if I can discover something.

I think I've spotted a possible cause of the trouble you're seeing, trying to confirm - will come back to you shortly.

ptaoussanis commented 12 months ago

When you get an opportunity, would you please try [com.taoensso/sente "1.18.2-SNAPSHOT"], now on Clojars?

The dev reference project uses this version.

My hope is that the problem should be fixed here, but if it's not - could you please share the server-side logging output from that reference project?

Thanks so much for all your time+effort on this! Really sorry about all the trouble.

krajj7 commented 12 months ago

Thanks so much for all your time+effort on this! Really sorry about all the trouble.

No problem at all, I'm glad to assist.

I tried "1.18.2-SNAPSHOT" and managed to trigger the issue.

The log is here: https://github.com/ptaoussanis/sente/files/12002682/bug.log (the problematic session starts at 2023-07-10T13:06:11.290Z, I accidentally included the previous one as well)

This is roughly what I did:

A few things to note:

Hope this helps. I'll be happy to do more experiments if needed.

ptaoussanis commented 12 months ago

Hope this helps. I'll be happy to do more experiments if needed.

That did help, thanks! And much appreciated 🙏

Would you please try again with the current (updated) [com.taoensso/sente "1.18.2-SNAPSHOT"]?

krajj7 commented 12 months ago

The new example project does (set-min-log-level! :report) which inhibits logging (I couldn't see what port the server used to open the page), so I changed that to (set-min-log-level! :trace)

Anyway the issue seems to be much easier to trigger now. I get stuck connections just by re-logging in with different user-ids, without doing anything else.

I made two logs, the first is really short: https://github.com/ptaoussanis/sente/files/12004962/1.18.1-2_test1.log

1) opened example page, logged in as "a" 2) froze browser 3) sent (chsk-send! "a" [:chsk/close]) 4) noticed connection not going away 5) killed browser 6) connection "a" is still there after a few minutes

Second log: https://github.com/ptaoussanis/sente/files/12004969/1.18.1-2_test2.log

1) opened example page, logged in as "1" 2) logged in as "2" 3) logged in as "3" 4) logged in as "4" 5) logged in as "5" 6) closed browser 7) there are two stuck connections

ptaoussanis commented 12 months ago

Finally managed to see this on my end! For some reason took dozens of attempts on my system, still don't know why we've had such different experiences with this.

Anyway, it looks like the lingering connections were caused by the connection's :on-close event somehow firing before handshake. I didn't anticipate that as a possibility, so the connection logic was getting into an unexpected state and showing up as "immortal" as you put it.

I'll get the logic reworked and push another update.

Will then ask for your patience to test once again if possible, since it's been so difficult for me to actually generate the necessary conditions on my system.

ptaoussanis commented 12 months ago

Okay, [com.taoensso/sente "1.18.2-alpha2"] is up on Clojars and includes a significant reworking of the connection management code.

Using this version, I seem to be unable to produce any lingering dead connections on my system. But since I also struggled with the previous version, it would be great to get confirmation from you if possible.

I've added some additional relevant debugging tools to the reference example on master, incl.:

Hopefully everything works on your end. As soon as you can get me a confirmation, I'll cut a stable point release.

krajj7 commented 11 months ago

Tried 1.18.2-alpha2, made sure to lein clean etc., but unfortunately I am still seeing the problem. I got some websocket connections stuck as well as an AJAX one.

Here is a log that ends with two stuck websocket connections: 1.18.2-alpha2_ws.log

Here is a log that ends with one AJAX connection stuck: 1.18.2-alpha2_ajax.log

ptaoussanis commented 11 months ago

Wow, that's so crazy ^^

Will go through your logs in a moment.

In the meantime- is there anything about your system that's unusual? It would be so helpful if I could reproduce this better on my end.

What browser and OS are you on? Are you running any browser plugins or other software that could be doing network conditioning? Are you changing any options or software in the reference example?

You mentioned earlier that you need to add Guava for your JVM to work - what JVM is that?

To clarify: this behaviour definitely shouldn't happen in any environment, it's a bug in Sente if this can occur. I just feel bad repeatedly wasting your time with updates that don't actually solve the problem.

Might be helpful if I try replicate your environment if possible.

krajj7 commented 11 months ago

I don't think there's anything unusual about my setup. I'm not changing anything in the project except for logging (added file logger, :trace by default) and the guava lib. I don't use any unusual networking software or plugins.

HW: Lenovo Yoga Slim 7 15ITL05 laptop OS: Debian testing JVM: Java 17.0.8-ea OpenJDK 64-Bit Server VM (Debian packaged) Browsers where I reproduced the problem (with the newest alpha):

I don't mind continuing testing new versions as needed. I can see that it would be much better if you could reproduce the issue yourself, but I don't know what could help with that.

ptaoussanis commented 11 months ago

👍 And just on the off-chance that it's relevant, why is Guava necessary in your environment?

krajj7 commented 11 months ago

I think it's some dependency conflict. The server runs without the guava dep, but when I try to compile the javascript I get this error:

Execution error (NoSuchMethodError) at com.google.javascript.jscomp.deps.ModuleLoader/createRootPaths (ModuleLoader.java:257).
'java.util.stream.Collector com.google.common.collect.ImmutableSortedSet.toImmutableSortedSet(java.util.Comparator)'

By default [com.google.guava/guava "20.0"] is used, adding an explicit [com.google.guava/guava "25.1-jre"] dependency makes it work.

ptaoussanis commented 11 months ago

So to confirm: if you just clone the Sente repo without making any changes, go to the example project's path - and execute lein start - you see the NoSuchMethodError above?

Edit: if so, can you share the contents of your ~/.lein/profiles.clj?

krajj7 commented 11 months ago

So to confirm: if you just clone the Sente repo without making any changes, go to the example project's path - and execute lein start - you see the NoSuchMethodError above?

Yes.

Edit: if so, can you share the contents of your ~/.lein/profiles.clj?

I do have a bunch of things in ~/.lein/profiles.clj:

{:repl {:plugins [ [cider/cider-nrepl "0.31.0"]
                  [lein-cljfmt "0.9.2"] ] }
 :user
 {:plugins [[lein-kibit "0.1.8"]
            [lein-cloverage "1.2.4"]
            [jonase/eastwood "1.4.0"]
            [lein-localrepo "0.5.4"]
            [lein-ancient "0.7.0"]
            [lein-nvd "2.0.0"] ]
   :dependencies [[inspector-jay "0.3"]
                  [cljfmt "0.9.2"]
                  [fipp "0.6.26"] ; cljfmt
                  [com.cemerick/pomegranate "1.1.0"]
                  [org.clojure/tools.namespace "1.4.4"]
                  [criterium "0.4.6"] ]
    :repl-options {:timeout 5184000 }}}
ptaoussanis commented 11 months ago

Hi Jan- thanks for the additional info. Indeed, nothing sticks out as obviously relevant 👍

I've just pushed [com.taoensso/sente "1.18.2-alpha3"] to Clojars.

This includes a rewrite of the last code that I believe could have been causing problems.

Could you please try this and let me know? 🙏

I've also updated the reference example again to include some additional debugging tools - notably a toggle to simulate unreliable but unbroken connections.

krajj7 commented 11 months ago

Hi Peter,

So far I haven't been able to trigger the issue with 1.18.2-alpha3, so that seems promising :+1:

I'm going to try a longer experiment to confirm. I deployed the new alpha to my app where I originally noticed the problem and will script periodic browser freezes/restarts to happen during the night. That should test a lot of reconnects.

So I'll report tomorrow, unless I notice any problem sooner.

ptaoussanis commented 11 months ago

I'm going to try a longer experiment to confirm. [...] So I'll report tomorrow, unless I notice any problem sooner.

Excellent, thank you so much! I'll wait on word from you then to push 1.18.2.

krajj7 commented 11 months ago

I'm happy to report that my experiment didn't reveal any problems :+1:

I tested re/connections to my app as well as the example project at the same time, more than 1800 sessions in total with random user-ids and none of them got stuck.

Thank you for working on this.

ptaoussanis commented 11 months ago

That's great news, thanks so much for all your patience and assistance with debugging this!

ptaoussanis commented 11 months ago

[com.taoensso/sente "1.19.0"] is now up on Clojars, release notes here 👍

Note that because of the amount of code touched, I decided in hindsight to be cautious and release this work as part of a bigger version bump (1.19 instead of 1.18.2).

Closing for now, but please feel free to re-open if you do encounter any more related trouble.

Cheers!