taoensso / sente

Realtime web comms library for Clojure/Script
https://www.taoensso.com/sente
Eclipse Public License 1.0
1.74k stars 193 forks source link

Scalability issues / handling many simultaneous users? #265

Closed luposlip closed 7 years ago

luposlip commented 8 years ago

Hi!

I'm using Sente in a startup, where everything has gone just fine in our very small test scenarios with 5-6 people. Yesterday we had a demonstration, and [UPDATE] the websocket stopped responding to incoming events at around 15-20 simultaneous connections.

Admittedly I haven't done much to limit the amount of work done for each socket event (such as sending back an :ack and spawning work to another thread etc.). But then again - it's only 20 users - and all the events cause no more than at most 50ms of work (most in the area around 50-500ns).

I don't care much (at this point) about user wait time. It doesn't matter if the users have to wait for a couple of seconds before they get the returned data from the socket. What worries me is that the socket simply cease to work. There's nothing to see in any log file. Client browser sends its frames to the server, but the socket just act as a black hole.

The server in general works fine though, because the async (AJAX) login/logoff, user activation and other AJAX based stuff works perfectly.

The server uses:

                 [org.clojure/clojure "1.9.0-alpha10"]
                 [org.clojure/core.async "0.2.385"]
                 [http-kit "2.2.0"]
                 [ring/ring-core "1.5.0"]
                 [ring/ring-devel "1.5.0"]
                 [ring/ring-defaults "0.2.1"]
                 [com.taoensso/encore "2.80.1"]
                 [com.taoensso/timbre "4.7.4"]
                 [com.taoensso/sente "1.10.0"]

The client uses:

                 [org.clojure/clojure "1.9.0-alpha10"]
                 [org.clojure/clojurescript "1.9.227"]
                 [org.clojure/core.async "0.2.385"]
                 [com.taoensso/encore "2.67.1"]
                 [cljs-ajax "0.5.8"]
                 [com.taoensso/timbre "4.7.2" :exclusions [com.taoensso/encore]]
                 [com.taoensso/sente "1.10.0"]
                 [reagent "0.6.0-rc"]
                 [re-frame "0.8.0"]
                 [secretary "1.2.3"]

I'm not saying this is a bug, it can easily be me using the socket in a wrong manner.

But have anyone seen this in their own setup, have any ideas on what goes wrong, or at least how I can proceed my investigation?

Thanks!

Best, Henrik

ptaoussanis commented 8 years ago

Hi Henrik,

Can't diagnose without more details- but I can say that a decent, properly configured server shouldn't have any problem handling at least hundreds or thousands of concurrent connections without any special effort.

Just have my hands very full atm (in the middle of a product launch, sorry!), but can try take a closer look next week/end some time if you can provide more details, e.g.:

Would try start with some of these. Otherwise like I say, can try assist next week some time (the more info you can provide, the more likely I'll be able to help quickly). Am also available for support hire if you need hands-on help more urgently.

Good luck! Cheers :-)

ptaoussanis commented 8 years ago

Oh, last comment since I just noticed: have experienced some random issues with the Clojure 1.9 alphas in the past - might be worth also just downgrading to a stable release there if you can just to rule it out.

luposlip commented 8 years ago

Thanks @ptaoussanis, I'll downgrade to Clojure 1.8. In the meantime I've removed all uses of custom transit readers/writers, and replaced them with defaults. Just to make sure I haven't overseen anything in those.

Here are answers to your questions:

I'll get back to you if the issue persists, hope your launch is a huge success! :)

ptaoussanis commented 8 years ago

AWS ECS

You mean EC2? What hardware? Would note that EC2 performance can be shockingly bad on the smaller instances. Either way wouldn't really expect maxing out at 20 conns if properly configured, though it's been a long time since I've used EC2 myself.

No detectable websocket errors at all,

Would recommend further debugging here to find out what's actually happening. Is the server receiving the messages? Dropping them? Is a proxy dropping them? Is the TCP layer dropping them? Somewhere in your stack there should be relevant info.

No, haven't tried to downgrade to AJAX. Would like that not to be necessary

To clarify: this'd be as a debugging step, to rule out issues that extend beyond WebSockets specifically. Not recommending that you don't use WebSockets.

I've looked through my channel-socket handler, can't find any apparent issues. Most of what the handler does is returning cached data, so it takes almost no time. Even if it did, I'd expect the websocket just to queue events up, and eventually respond to them?

Would recommend profiling+logging. Note that the queue might not be indefinite. Can't recall off-hand, but it's probably a dropping queue by default.

You can also avoid the default chsk router fns and just write your own small one for better inspection/control.

I'll get back to you if the issue persists, hope your launch is a huge success! :)

Thanks! Will try follow up to any other info you provide as I can. Cheers!

luposlip commented 8 years ago

AWS ECS - EC2 Container Services. Docker. The hardware is 2x t2.small. I have the same AWS ECS setup in another startup where we use more old-school AJAX for most stuff, and only a (Sente) websocket for minor stuff. There has been no issue at all with many connected users.

I'll see if I can enable more debugging between and on the load balancer and the EC2 instances.

The queue, is that in core.async?

When the web socket "died" yesterday, we couldn't reconnect to it at all. Even if we closed our client completely, reconnected, logged on (via AJAX) and established a new websocket, it still didn't react to the incoming events at all. Some throttling seemed to be going on, just can't figure out where (and not sure that's it either).

If the issue persists, I'll definitely try to write my own chsk-router.

Thanks again! :)

danielcompton commented 8 years ago

@luposlip We experienced roughly what you are describing, and in our case it was because we were running out of threads to run our go blocks on, and were getting deadlocked. Are you doing any blocking I/O from any Sente event messages?

luposlip commented 8 years ago

@danielcompton - even with just around 15-20 active users?

Yes, I have some limited blocking IO. But most of it return cached data, and takes very little time.

Did you solve the problem by simply returning an ACK and then return the result (if any) asynchronously?

danielcompton commented 8 years ago

What we found was that the underlying database driver we were using was creating 2 go blocks per database connection, one to send, and one to receive. The receiving go block was doing a blocking I/O call to receive data, so once we got above 40ish db connections, we would run out of go threads. Everything would run fine, and then at some point it would all seem to hang. Profiling with VisualVM or YourKit may help you identify if that's the case (if all of the core.async threads are blocked).

luposlip commented 8 years ago

Well, it most certainly could be the same problem. I'm using Datomic for database stuff, not sure how that works in that regard, but I will investigate that. Also VisualVM/YourKit is new to me, I'll find them. Thanks for the input!

arichiardi commented 8 years ago

Here I replace and instrument both go and thread thread pools, it might get complicated to see what is stuck. This is also a very good read: https://stuartsierra.com/2015/05/27/clojure-uncaught-exceptions

ptaoussanis commented 8 years ago

Just skimmed this, may have misunderstood - but wanted to confirm quickly that everyone understands you're not supposed to do blocking IO work in a go block? This includes the default channel socket router, which is a single go loop. That'll starve the thread pool.

arichiardi commented 8 years ago

Agree and confirm, IO should go in a thread close whose pool is unbounded whereas the go pool is not. Just to reiterate on this, if a go block does not have available thread it will block until one is free.

luposlip commented 8 years ago

While it was clear to me not to do anything that takes "long time", it was not clear to me that it mustn't be blocking at all. I read this before I posted this message: https://github.com/ptaoussanis/sente/issues/227

And to me it seemed more like general recommendations, than something I should be aware of when having a small demo of 15-20 users.

Most of my calls are non-blocking, but a few uses Datomic transact which is blocking. Datomic has an async alternative called transact-async, I'll immediately change to using that instead of the blocking version, and run through all my handlers once more to return a simple acknowledge from all of them and do the real work in a different thread.

I think the need for processing everything in a different thread could be clearer in the documentation. At least I have missed the point. I started using Sente for small loads a long time ago (version 0.8.2), and haven't read the documentation from cover to cover since then. Perhaps it has been added to the documentation since.

Maybe Sente could have a choice of a serial and parallel router out of the box?

Thanks for all your input so far!

ptaoussanis commented 8 years ago

it was not clear to me that it mustn't be blocking at all.

As with most things, there's not a hard prescription applicable to all cases. Some minor blocking usu. isn't a problem in practice for trivial loads like yours. But yes, in general, you'll want to avoid IO in go blocks unless you understand the ramifications. That's nothing to do with Sente in particular, just a function of how core.async works.

In any case, blocking IO can't (shouldn't) be a complete explanation for the problem that you're seeing if your blocking times average 500ns and never exceed 50ms. 20 concurrent clients is nothing.

First step here really needs to be some basic debugging to start ruling out possible causes, otherwise any ideas are at best conjecture.

Cheers :-)

ptaoussanis commented 8 years ago

Quick heads-up that I may need to sign-off this issue for a while, but will try check back periodically. Best of luck!

luposlip commented 8 years ago

Thanks Peter, will let you know how I progress!

ptaoussanis commented 8 years ago

Oh, just looked into clarifying the start-server-chsk-router! docstring and see that it's already pretty good:

"Creates a go-loop to call `(event-msg-handler <server-event-msg>)` and
log any errors. Returns a `(fn stop! [])`.

For performance, you'll likely want your `event-msg-handler` fn to be
non-blocking (at least for slow handling operations). Clojure offers
a rich variety of tools here including futures, agents, core.async,
etc.

Advanced users may also prefer to write their own loop against `ch-recv`."

May I ask what part of that was unclear? (Sincere question: I might be missing something). Or was it just easy to miss the docstring?

Thanks!

ptaoussanis commented 8 years ago

Have tried to further clarify the docstring:

"Creates a simple go-loop to call `(event-msg-handler <server-event-msg>)`
and log any errors. Returns a `(fn stop! [])`.

Nb performance note: since your `event-msg-handler` fn will be executed
within a simple go block, you'll want this fn to be ~non-blocking
(you'll especially want to avoid blocking IO) to avoid starving the
core.async thread pool under load.

To avoid blocking, Clojure offers a rich variety of tools incl. futures,
agents, core.async, etc. The correct tool/s to use will depend on your
application (and on the particular request/s), so this isn't something the
router can/should handle for you automatically.

Note that advanced users may also prefer to just write their own loop
against `ch-recv`."
luposlip commented 8 years ago

Agreed. It's clear. So in my case it may just be because I missed the doc string. Tomorrow I'll hopefully know if removing the blocking transactions was all I needed.

Den søn. 18. sep. 2016 07.01 skrev Peter Taoussanis < notifications@github.com>:

Have tried to further clarify the docstring:

"Creates a simple go-loop to call (event-msg-handler <server-event-msg>)

and log any errors. Returns a (fn stop! []).

Nb performance note: since your event-msg-handler fn will be executedwithin a simple go block, you'll want this fn to be ~non-blocking(you'll especially want to avoid blocking IO) to avoid starving thecore.async thread pool under load.To avoid blocking, Clojure offers a rich variety of tools incl. futures,agents, core.async, etc. The correct tool/s to use will depend on yourapplication (and on the particular request/s), so this isn't something therouter can/should handle for you automatically.Note that advanced users may also prefer to just write their own loopagainst ch-recv."

You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/ptaoussanis/sente/issues/265#issuecomment-247825744, or mute the thread https://github.com/notifications/unsubscribe-auth/AAum647KMreEQ52iCzEhf1P8S-Vq3hUOks5qrMW0gaJpZM4J9pVZ .

luposlip commented 8 years ago

It's more clear now BTW. For me it seems important to stretch that there should be no blocking calls at all to avoid strange future surprises. Especially if avoiding this is the only thing going wrong in my case. I'll let you know (I'll create tests simulating lots of simultaneously connected users).

Den søn. 18. sep. 2016 08.59 skrev Luposlip luposlip@gmail.com:

Agreed. It's clear. So in my case it may just be because I missed the doc string. Tomorrow I'll hopefully know if removing the blocking transactions was all I needed.

Den søn. 18. sep. 2016 07.01 skrev Peter Taoussanis < notifications@github.com>:

Have tried to further clarify the docstring:

"Creates a simple go-loop to call (event-msg-handler <server-event-msg>)

and log any errors. Returns a (fn stop! []).

Nb performance note: since your event-msg-handler fn will be executedwithin a simple go block, you'll want this fn to be ~non-blocking(you'll especially want to avoid blocking IO) to avoid starving thecore.async thread pool under load.To avoid blocking, Clojure offers a rich variety of tools incl. futures,agents, core.async, etc. The correct tool/s to use will depend on yourapplication (and on the particular request/s), so this isn't something therouter can/should handle for you automatically.Note that advanced users may also prefer to just write their own loopagainst ch-recv."

You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/ptaoussanis/sente/issues/265#issuecomment-247825744, or mute the thread https://github.com/notifications/unsubscribe-auth/AAum647KMreEQ52iCzEhf1P8S-Vq3hUOks5qrMW0gaJpZM4J9pVZ .

danielcompton commented 8 years ago

Blocking calls on their own shouldn't be enough to cause a deadlock if the block calls eventually succeed or fail, you will just have a low concurrency. To get a full deadlock you need to consume all of the core.async threadpool with blocking operations that don't return immediately (e.g. with a read on a socket).

ptaoussanis commented 8 years ago

Blocking calls on their own shouldn't be enough to cause a deadlock if the block calls eventually succeed or fail, you will just have a low concurrency.

Well the math on the quoted times definitely wouldn't add up as a complete explanation. Mean 500ns handling times still gives ~2m reqs/sec. Anyway, next step will be doing some concrete debugging otherwise efforts are likely just being wasted. Really shouldn't be hard to rule out most things in a few minutes at the REPL.

Since the threading seems to be a common cause of confusion though, will just give a quick concrete example here of some options:


;; An arbitrary msg handler, let's assume it often blocks
;; You'll want to avoid giving this directly to your chsk router
(defn my-blocking-msg-handler [event-msg]
  (let [{:keys [id ?data event ring-req ?reply-fn]} event-msg]
    ;; Do stuff, then maybe call `?reply-fn` when ready
    ))

;; An easy, naive way to guarantee that this function won't block the handling queue
;; You could give this to your chsk router
(defn my-non-blocking-msg-handler [event-msg]
  (future (my-blocking-msg-handler event-msg)))

;; Or, if you want to throttle the thread count:
(def my-future-pool "Use max 4 threads" (taoensso.encore/future-pool 4))
(defn my-non-blocking-msg-handler [event-msg]
  (my-future-pool (my-blocking-msg-handler event-msg)))

The usual recommended pattern for Sente is to have a single handler multimethod. So then you can just wrap a single fn (the multimethod fn) in this way and get automatic threading for all method implementations.

So why doesn't the router just do this automatically? Because automatically running every request in a future naively assumes two things:

  1. That futures are the best tool for the job (sometimes they aren't).
  2. That all requests cost the same (they rarely do).

For example, let's say you have two common API end points:

  1. Fetch the current news bulletin, and
  2. Search the database for a user-provided text string

Let's say (1) is cached and returns in 2ns. Going through a future (and incurring the overhead there) would be wasteful. Not the end of the world, but something to keep in mind.

In contrast, let's say that (2) cannot be cached and may take ~300ms mean time to return. That'd be a clear candidate for a future, agent, etc. Running this request directly in the handler go loop would start causing problems at sufficient load.

Anyway, if the choice for whatever reason is between not threading any requests and threading all requests, I'd suggest threading all requests with something like one of the simple wrappers above.

If you want more sophisticated threading control for larger production applications, then instead of wrapping your top-level multimethod fn - you instead allow the individual method implementations to use their own request-appropriate form/s of threading (again: futures, agents, core.async, etc.)

Does any of that make sense / help?

My apologies if some of this wasn't as clear as it could have been, appreciate being pinged about it so that improvements can be made. For example, I'd be up for just including an optional router flag to enable automatic future-based threading if folks would find that convenient/helpful?

luposlip commented 8 years ago

Well, bugger.. Now the issue happened again. I can login/logout of my app (because it uses old-school AJAX). All Sente handlers is off the main thread in around 0.2ms because I've moved them into different threads via a simple (future (do-the-stuff ...)).

[UPDATE] Just for the record, I've made the (future (do-the-stuff ...))-optimization for every single event, making it as easy as possible to test the handlers separately (without actually having events passed through the router), and making individual event optimizations (as compared to just putting a simple future around everything).

But Sente doesn't "reply" at all :(

So for now - what is the suggestion? To write my own router, and debug via that?

ptaoussanis commented 8 years ago

Hi Henrik, I've provided suggestions above (which are unchanged).

Beyond that won't have any further suggestions without the debug information I've requested, sorry.

luposlip commented 8 years ago

Just for clarification - using a fixed-size thread pool (via encore/future-pool - wouldn't block, even if more than 4 threads are requested. Instead it would simply wait for a thread in the thread pool to be available, thus meaning more wait time. But this would ensure the server not to run out of resources, right?

ptaoussanis commented 8 years ago

encore/future-pool is currently just a semaphore for the standard Clojure thread pool. Future calls will block when more than the specified number of threads are already occupied. The blocking behaviour is described in the docstring.

luposlip commented 8 years ago

OK thanks, didn't see that. It seems like a bad choice then. Have enabled tracing for Sente on the server, will check the logs next time the issue happens.

ptaoussanis commented 7 years ago

Closing for now due to inactivity, please feel free to reopen with the requested debug info if you'd like further assistance.

Cheers :-)

luposlip commented 6 years ago

Just a comment, if anyone else is experiencing similar issues.

I solve my immediate issue by setting :simple-auto-threading? true in the call to sente/start-chsk-router!.

Later I figured out that a library I was using, also uses core.async deep down. So the use of this library sometimes made the sente router stop, because it apparently shares the same core.async thread pool.

Luckily I could force this library to not use core.async, and since then I haven't had any issues.

Don't hesitate to write to me if you want to know more about my experiences.