Implement thread groups to replace nbproc

wtarreau commented 2 years ago

Your Feature Request

This is a placeholder to document all the work that remains to be done on thread groups, and that started for 2.5. The goal is to support N (<=64) groups of 64 threads, or 4096 threads max. If some performance tradeoffs require to cut that number down to 1024 or 2048 that's fine as well.

Among the high-level tasks that were identified are:

[x] one global scheduler per group. Something needs to be done so that it's still possible to wake up a remote task from another group at limited cost. Or maybe the remote wakeup cost is acceptable to endure as long as it's extremely rare.
[x] thread_isolate() must evolve to have a per-group isolation and per-process isolation.
[x] file descriptors may only be watched by threads of a same group. This means no more than 64 threads on a listener's FD. However it's perfectly possible to have more than one FD per listener, or multiple listeners for a bind line, which is what the current shards keyword already does. FD migration between groups ought to be avoided whenever possible, but there's a delicate race between the close() and the socket() or accept() because one FD belongs to the whole process and may be reassigned to a thread of a different group. Atomic ops needed to check for migration are particularly complex but an algo was found (needs to be turned to code). An alternative relying on userspace-selected FDs might allow each thread to rely on a private range of FDs that never conflicts with other ranges. Some work on this was posted here: https://lore.kernel.org/lkml/20200423073310.GA169998@localhost/T/ but nothing was merged to date, and for now network syscalls are not covered.
[x] idle connections need to be made per-group. This means that a thread wishing to take an outgoing connection would only steal connections from other threads of its own group but not others, possibly resulting in creating new connections from time to time instead of using an existing one. This is absolutely mandatory due to the complex operations involved there already, but in addition this part is already a bit contended on large scale systems and it's not possible at all to think about scaling it to larger core counts nor remote cores.
[x] shared memory pools must be per-group. This doesn't seem difficult at all, most likely each group will simply have its shared pool head.
[ ] load balancing algos must also be group-aware as much as possible. For round-robin, the mechanism looks easy in theory, having one head per group and that's done, except that it also means that servers need to have as many attach links as there are groups. This is not necessarily desirable but it's possible that there's no other option. The real impact is on memory, which means that we would need to dynamically allocate the server link nodes instead of using a fixed size. There could be other caveats such as a server being both up and down, up in one group while still down in another one, and maintaining a single state consistent across all groups isn't easy. But maybe we can afford a process-wide synchronization when changing their states, that doesn't happen too often. For leastconn this could mean that we'd need to have an imprecise measure of the number of connections. Probably that we'd have the local group's count plus an estimate of all other groups' that's renewed once in a while to keep long connections accurate without slowing fast ones too much
[ ] queues will be quite difficult: for the principle of fairness we must dequeue in the optimal order, but doing this means extremely slow dequeuing. Maybe each group ought to accept an inaccuracy margin and prefer dequeuing from its own group until the margin is reached and sometimes have a look at other groups and re-adjust its margin. It's probable that the ratio of local-to-remote active connections can be useful to estimate how often a group is responsible for checking other groups' queues, based on the assumption that other groups' connections will be able to dequeue their own.
[ ] stats will be painful. It's clearly not imaginable to maintain a set of shared counters across all groups, considering that right now these counters are already responsible for a non-negligible part of the overhead. Stats are not checked every microseconds so we could afford to keep them local and aggregate them to the process every few tens of milliseconds. It just becomes harder to measure max values and rates (or they have to be done during process-wide aggregation). And it also means that local values need to serve as references for listeners (maxconn and maxrate), which is another quite difficult thing to deal with given that transitions from blocked to unblocked do not always imply waking up every listener.
[ ] stick-tables certainly need to remain as they are. Maybe the memory model ought to be improved to better align data to further limit false sharing
[ ] lots of shared elements (variables, maps etc) might become contention points and will have to be evaluated when this happens.

What are you trying to do?

Support more than 64 threads per process
preserve as good performance as possible on NUMA systems

Output of `haproxy -vv`

2.5 to 2.7 at least

TimWolla commented 2 years ago

preserve as good performance as possible on NUMA systems

I've just assisted a user in IRC that was likely running into the NUMA issue. Is it possible to detect the NUMA situation and log a warning until this issue is properly fixed?

wtarreau commented 2 years ago

Thanks Tim! For NUMA, since 2.4 we detect it and only bind to the node with the most allowed CPUs, and emit a warning mentioning it. However there are some machines sold by unscrupulous vendors which employ NUMA CPUs like EPYC where the BIOS does not advertise NUMA nodes, so neither the OS nor haproxy can know that some cores are having a hard time communicating and that's a big problem. This results in memory being allocated from random nodes for other ones, and threads to be spread all over the CPUs when this is really bad. That's not even specific to haproxy, I happened to face the issue with only a client and a server deployed on the same machine, with the performance cut in half when both were deployed on cores from different chiplets :-( All this mess was the initial reason for my work on the atomic tests and CAS latency tests, where I'm observing between 19 and 260ns between two cores depending on the chosen pairs.

So in short, we already do our best to avoid blindly binding all cores on such systems but some do not advertise the problem and will need to be manually fixed using taskset/cpu-map. The consolation is that it's true for literally everything that runs on these, kernel and libc included!

TimWolla commented 2 years ago

Thanks Tim! For NUMA, since 2.4 we detect it and only bind to the node with the most allowed CPUs, and emit a warning mentioning it. However there are some machines sold by unscrupulous vendors which employ NUMA CPUs like EPYC where the BIOS does not advertise NUMA nodes, so neither the OS nor haproxy can know that some cores are having a hard time communicating and that's a big problem.

In this case the user was running on 2.4.8 on an actual Xeon Gold dual socket system. HAProxy was running inside of a container though. I'll check whether I can get them to add the details themselves, without me relaying it.

wtarreau commented 2 years ago

With a dual-socket they should definitely get NUMA right. In any case when users report you issues related to high CPU count machines, it's very useful to ask for the output from lspci -e. It will show the cache topology and NUMA nodes, quickly telling what CPU may optimally work with which one.

jakubvojacek commented 2 years ago

@wtarreau Hello, I didn't want to mess up this thread, I opened https://github.com/haproxy/haproxy/issues/1625. I tried assigning the haproxy process to just one CPU but I am still running into issues, unfortunately.

wtarreau commented 2 years ago

Thanks, I've responded there so that we can try to keep this task list approximately clean.

wtarreau commented 1 year ago

Note that thread-pools per group are worse, as they result in a huge increase of memory usage, hence were not merged. The rest is debatable.

haproxy / haproxy