haproxy / haproxy

HAProxy Load Balancer's development branch (mirror of git.haproxy.org)
https://git.haproxy.org/
Other
5.01k stars 799 forks source link

Implement thread groups to replace nbproc #1616

Open wtarreau opened 2 years ago

wtarreau commented 2 years ago

Your Feature Request

This is a placeholder to document all the work that remains to be done on thread groups, and that started for 2.5. The goal is to support N (<=64) groups of 64 threads, or 4096 threads max. If some performance tradeoffs require to cut that number down to 1024 or 2048 that's fine as well.

Among the high-level tasks that were identified are:

What are you trying to do?

Output of haproxy -vv

2.5 to 2.7 at least
TimWolla commented 2 years ago

preserve as good performance as possible on NUMA systems

I've just assisted a user in IRC that was likely running into the NUMA issue. Is it possible to detect the NUMA situation and log a warning until this issue is properly fixed?

wtarreau commented 2 years ago

Thanks Tim! For NUMA, since 2.4 we detect it and only bind to the node with the most allowed CPUs, and emit a warning mentioning it. However there are some machines sold by unscrupulous vendors which employ NUMA CPUs like EPYC where the BIOS does not advertise NUMA nodes, so neither the OS nor haproxy can know that some cores are having a hard time communicating and that's a big problem. This results in memory being allocated from random nodes for other ones, and threads to be spread all over the CPUs when this is really bad. That's not even specific to haproxy, I happened to face the issue with only a client and a server deployed on the same machine, with the performance cut in half when both were deployed on cores from different chiplets :-( All this mess was the initial reason for my work on the atomic tests and CAS latency tests, where I'm observing between 19 and 260ns between two cores depending on the chosen pairs.

So in short, we already do our best to avoid blindly binding all cores on such systems but some do not advertise the problem and will need to be manually fixed using taskset/cpu-map. The consolation is that it's true for literally everything that runs on these, kernel and libc included!

TimWolla commented 2 years ago

Thanks Tim! For NUMA, since 2.4 we detect it and only bind to the node with the most allowed CPUs, and emit a warning mentioning it. However there are some machines sold by unscrupulous vendors which employ NUMA CPUs like EPYC where the BIOS does not advertise NUMA nodes, so neither the OS nor haproxy can know that some cores are having a hard time communicating and that's a big problem.

In this case the user was running on 2.4.8 on an actual Xeon Gold dual socket system. HAProxy was running inside of a container though. I'll check whether I can get them to add the details themselves, without me relaying it.

wtarreau commented 2 years ago

With a dual-socket they should definitely get NUMA right. In any case when users report you issues related to high CPU count machines, it's very useful to ask for the output from lspci -e. It will show the cache topology and NUMA nodes, quickly telling what CPU may optimally work with which one.

jakubvojacek commented 2 years ago

@wtarreau Hello, I didn't want to mess up this thread, I opened https://github.com/haproxy/haproxy/issues/1625. I tried assigning the haproxy process to just one CPU but I am still running into issues, unfortunately.

wtarreau commented 2 years ago

Thanks, I've responded there so that we can try to keep this task list approximately clean.

wtarreau commented 1 year ago

Note that thread-pools per group are worse, as they result in a huge increase of memory usage, hence were not merged. The rest is debatable.