Open wtarreau opened 2 years ago
preserve as good performance as possible on NUMA systems
I've just assisted a user in IRC that was likely running into the NUMA issue. Is it possible to detect the NUMA situation and log a warning until this issue is properly fixed?
Thanks Tim! For NUMA, since 2.4 we detect it and only bind to the node with the most allowed CPUs, and emit a warning mentioning it. However there are some machines sold by unscrupulous vendors which employ NUMA CPUs like EPYC where the BIOS does not advertise NUMA nodes, so neither the OS nor haproxy can know that some cores are having a hard time communicating and that's a big problem. This results in memory being allocated from random nodes for other ones, and threads to be spread all over the CPUs when this is really bad. That's not even specific to haproxy, I happened to face the issue with only a client and a server deployed on the same machine, with the performance cut in half when both were deployed on cores from different chiplets :-( All this mess was the initial reason for my work on the atomic tests and CAS latency tests, where I'm observing between 19 and 260ns between two cores depending on the chosen pairs.
So in short, we already do our best to avoid blindly binding all cores on such systems but some do not advertise the problem and will need to be manually fixed using taskset/cpu-map. The consolation is that it's true for literally everything that runs on these, kernel and libc included!
Thanks Tim! For NUMA, since 2.4 we detect it and only bind to the node with the most allowed CPUs, and emit a warning mentioning it. However there are some machines sold by unscrupulous vendors which employ NUMA CPUs like EPYC where the BIOS does not advertise NUMA nodes, so neither the OS nor haproxy can know that some cores are having a hard time communicating and that's a big problem.
In this case the user was running on 2.4.8 on an actual Xeon Gold dual socket system. HAProxy was running inside of a container though. I'll check whether I can get them to add the details themselves, without me relaying it.
With a dual-socket they should definitely get NUMA right. In any case when users report you issues related to high CPU count machines, it's very useful to ask for the output from lspci -e
. It will show the cache topology and NUMA nodes, quickly telling what CPU may optimally work with which one.
@wtarreau Hello, I didn't want to mess up this thread, I opened https://github.com/haproxy/haproxy/issues/1625. I tried assigning the haproxy process to just one CPU but I am still running into issues, unfortunately.
Thanks, I've responded there so that we can try to keep this task list approximately clean.
Note that thread-pools per group are worse, as they result in a huge increase of memory usage, hence were not merged. The rest is debatable.
Your Feature Request
This is a placeholder to document all the work that remains to be done on thread groups, and that started for 2.5. The goal is to support N (<=64) groups of 64 threads, or 4096 threads max. If some performance tradeoffs require to cut that number down to 1024 or 2048 that's fine as well.
Among the high-level tasks that were identified are:
shards
keyword already does. FD migration between groups ought to be avoided whenever possible, but there's a delicate race between the close() and the socket() or accept() because one FD belongs to the whole process and may be reassigned to a thread of a different group. Atomic ops needed to check for migration are particularly complex but an algo was found (needs to be turned to code). An alternative relying on userspace-selected FDs might allow each thread to rely on a private range of FDs that never conflicts with other ranges. Some work on this was posted here: https://lore.kernel.org/lkml/20200423073310.GA169998@localhost/T/ but nothing was merged to date, and for now network syscalls are not covered.What are you trying to do?
Output of
haproxy -vv