Open ghost opened 6 years ago
A forking take on parallelism might be a better idea on Linux? I believe it's more performant, children vs threads. Certainly MUCH simpler memory model and no coordination between "threads" to be concerned with.
The lib is only threaded per-loop so essentially it's the same as you talk about. No data is shared, you run individual us_loop isolated to one thread each. Sharing data between threads is always slow and is not the purpose here.
The only thing you want to (somewhat) share between the loops is the listener, some sort of common entry point for connections. Some kind of master/slave set up.
I've been flirting with the idea of jumping ship to uWebSockets, but would need a "worker-per-logical-core" to do so. I was thinking of simply writing a fork-per-logical-core loop before initializing uWS::SSLApp, and setting SO_REUSEPORT. Seems like the least friction way. Do you have any warnings though?
(P.S. now that the HTTP implementation is complete, you should consider submitting it to the techempower benchmarks ;) )
Techempower are incompetent morons. And that's putting it lightly.
You already have this support in 0.14
haha. think you might be my computer soulmate.
I see what you mean with multithreaded_echo.cpp
cleared up for me how essentially there’s 0 performance difference between threads and processes inside the kernel
https://stackoverflow.com/questions/807506/threads-vs-processes-in-linux
I don't think it makes any difference if you have a fix amount of threads or processes basically acting as a wrapper of a cpu core. But I've always preferred threads bc they allow sharing without syscalls as messenger also threads are standardized to the language processes are not
A simple way that many use and was available in v0.14 is something like us_socket_transfer(us_socket , us_loop ) kind of deal:
Very simple, you listen and accept from one thread like usual and you get your usual on_open event on the main thread, then you us_socket_transfer and it goes over to the slave thread. Very simple interface, allows you to select how to load balance things.
That would be step 1; the most simple way.
Forking can be necessary later on to play well with JavaScript environments that don't have threads where you need to also support forking kind of solutions.
Lwan works like that, they listen and accept on one thread and then just eventfd-epoll_wait kind of transfers that FD to another thread via a fast fifo-queue. That's simple.
You might want something like on_opening event where you get to early transfer the FD to another thread before any SSL work is done, essentially called immediately after accept. Then you get on_open called on the slave thread. Simple.
us_socket_context_on_distribute:
returns the us_socket_context where on_open should happen
It already works good enough. And if anything all you need is a way to stop listening on the context with the most connections. That's all really.
And this can be done from any thread, so any thread can stop polling for accept for some other thread. In short - no need for a master.
If you have 4 CPUs then the 3 with the least amount of connections should poll for accept. Whenever you accept check if you are the one with the most connections and in that case, stop polling for accept and start polling for accept on the one that previously did not poll for accept. It's that simple, and you can set the granularity to 50 so that the switching does not trigger all the time.
And if you have 16 CPUs then maybe it is enough to only poll for accept on maybe 8 cores. Same rule, only with a limit.
And this switching should simply trigger on every accept and every close. That's it
Fixing this should also enable this support on macOS (and maybe Windows). When we have this enforcement all platforms should work with this.
I forgot libuv is made by ogres and is not thread safe even though the underlying kernel is 🤦
So the rule has to be triggered by every thread, on their 4-second timers;
With this, the metric can be number of connection OR any metric as returned by a callback (such as memory usage, CPU-time usage, etc). This callback will be implemented with number of connections as default metric but should be possible to override.
This way load balancing is "re-routed" every 4 seconds, which is not too often to cause extra load, while not too long to be irrelevant. It is just right.
Really, you can just make it:
us_loop_on_sample_load as a callback that returns the time from CLOCK_THREAD_CPUTIME_ID and that can be sampled since last timeout. Then you have load balancing that is based on actual per-thread CPU-time usage which is probably the most accurate
Various threading features are required. Reuse port, master listener & slave worker, etc.