us_loop_on_sample_load and better multithreading

ghost commented 6 years ago

Various threading features are required. Reuse port, master listener & slave worker, etc.

victorstewart commented 5 years ago

A forking take on parallelism might be a better idea on Linux? I believe it's more performant, children vs threads. Certainly MUCH simpler memory model and no coordination between "threads" to be concerned with.

ghost commented 5 years ago

The lib is only threaded per-loop so essentially it's the same as you talk about. No data is shared, you run individual us_loop isolated to one thread each. Sharing data between threads is always slow and is not the purpose here.

ghost commented 5 years ago

The only thing you want to (somewhat) share between the loops is the listener, some sort of common entry point for connections. Some kind of master/slave set up.

victorstewart commented 5 years ago

I've been flirting with the idea of jumping ship to uWebSockets, but would need a "worker-per-logical-core" to do so. I was thinking of simply writing a fork-per-logical-core loop before initializing uWS::SSLApp, and setting SO_REUSEPORT. Seems like the least friction way. Do you have any warnings though?

(P.S. now that the HTTP implementation is complete, you should consider submitting it to the techempower benchmarks ;) )

ghost commented 5 years ago

Techempower are incompetent morons. And that's putting it lightly.

You already have this support in 0.14

victorstewart commented 5 years ago

haha. think you might be my computer soulmate.

I see what you mean with multithreaded_echo.cpp

victorstewart commented 5 years ago

cleared up for me how essentially there’s 0 performance difference between threads and processes inside the kernel

https://stackoverflow.com/questions/807506/threads-vs-processes-in-linux

ghost commented 5 years ago

I don't think it makes any difference if you have a fix amount of threads or processes basically acting as a wrapper of a cpu core. But I've always preferred threads bc they allow sharing without syscalls as messenger also threads are standardized to the language processes are not

ghost commented 5 years ago

A simple way that many use and was available in v0.14 is something like us_socket_transfer(us_socket , us_loop ) kind of deal:

Very simple, you listen and accept from one thread like usual and you get your usual on_open event on the main thread, then you us_socket_transfer and it goes over to the slave thread. Very simple interface, allows you to select how to load balance things.

That would be step 1; the most simple way.

Forking can be necessary later on to play well with JavaScript environments that don't have threads where you need to also support forking kind of solutions.

ghost commented 5 years ago

Lwan works like that, they listen and accept on one thread and then just eventfd-epoll_wait kind of transfers that FD to another thread via a fast fifo-queue. That's simple.

ghost commented 5 years ago

You might want something like on_opening event where you get to early transfer the FD to another thread before any SSL work is done, essentially called immediately after accept. Then you get on_open called on the slave thread. Simple.

ghost commented 5 years ago

us_socket_context_on_distribute:

returns the us_socket_context where on_open should happen

ghost commented 3 years ago

It already works good enough. And if anything all you need is a way to stop listening on the context with the most connections. That's all really.

And this can be done from any thread, so any thread can stop polling for accept for some other thread. In short - no need for a master.

ghost commented 3 years ago

If you have 4 CPUs then the 3 with the least amount of connections should poll for accept. Whenever you accept check if you are the one with the most connections and in that case, stop polling for accept and start polling for accept on the one that previously did not poll for accept. It's that simple, and you can set the granularity to 50 so that the switching does not trigger all the time.

And if you have 16 CPUs then maybe it is enough to only poll for accept on maybe 8 cores. Same rule, only with a limit.

ghost commented 3 years ago

And this switching should simply trigger on every accept and every close. That's it

ghost commented 3 years ago

Fixing this should also enable this support on macOS (and maybe Windows). When we have this enforcement all platforms should work with this.

ghost commented 3 years ago

I forgot libuv is made by ogres and is not thread safe even though the underlying kernel is 🤦

So the rule has to be triggered by every thread, on their 4-second timers;

wait for 4-second timeout
lock a common list for all threads
are we holding more insert metric than other threads AND ther is at least one other poller, WE stop polling for accept
are we holding fewer insert metric than other threads, then WE start polling for accept
unlock list
goto 1

With this, the metric can be number of connection OR any metric as returned by a callback (such as memory usage, CPU-time usage, etc). This callback will be implemented with number of connections as default metric but should be possible to override.

This way load balancing is "re-routed" every 4 seconds, which is not too often to cause extra load, while not too long to be irrelevant. It is just right.

ghost commented 3 years ago

Really, you can just make it:

us_loop_on_sample_load as a callback that returns the time from CLOCK_THREAD_CPUTIME_ID and that can be sampled since last timeout. Then you have load balancing that is based on actual per-thread CPU-time usage which is probably the most accurate

uNetworking / uSockets

us_loop_on_sample_load and better multithreading #17