HA-Proxy version 2.3-dev0 2020/07/07 - https://haproxy.org/
Status: development branch - not safe for use in production.
Known bugs: https://github.com/haproxy/haproxy/issues?q=is:issue+is:open
Running on: Linux 5.4.0-40-generic #44-Ubuntu SMP Mon Jun 22 23:59:48 UTC 2020 aarch64
Build options :
TARGET = linux-glibc
CPU = generic
CC = clang-9
CFLAGS = -O2 -Wall -Wextra -Wdeclaration-after-statement -fwrapv -Wno-address-of-packed-member -Wno-unused-label -Wno-sign-compare -Wno-unused-parameter -Wno-missing-field-initializers -Wno-string-plus-int -Wtype-limits -Wshift-negative-value -Wnull-dereference -Werror
OPTIONS = USE_PCRE=1 USE_PCRE_JIT=1 USE_OPENSSL=1 USE_LUA=1 USE_ZLIB=1 USE_DEVICEATLAS=1 USE_51DEGREES=1 USE_WURFL=1 USE_SYSTEMD=1
Feature list : +EPOLL -KQUEUE +NETFILTER +PCRE +PCRE_JIT -PCRE2 -PCRE2_JIT +POLL -PRIVATE_CACHE +THREAD -PTHREAD_PSHARED +BACKTRACE -STATIC_PCRE -STATIC_PCRE2 +TPROXY +LINUX_TPROXY +LINUX_SPLICE +LIBCRYPT +CRYPT_H +GETADDRINFO +OPENSSL +LUA +FUTEX +ACCEPT4 +ZLIB -SLZ +CPU_AFFINITY +TFO +NS +DL +RT +DEVICEATLAS +51DEGREES +WURFL +SYSTEMD -OBSOLETE_LINKER +PRCTL +THREAD_DUMP -EVPORTS
Default settings :
bufsize = 16384, maxrewrite = 1024, maxpollevents = 200
Built with multi-threading support (MAX_THREADS=64, default=8).
Built with OpenSSL version : OpenSSL 1.1.1f 31 Mar 2020
Running on OpenSSL version : OpenSSL 1.1.1f 31 Mar 2020
OpenSSL library supports TLS extensions : yes
OpenSSL library supports SNI : yes
OpenSSL library supports : TLSv1.0 TLSv1.1 TLSv1.2 TLSv1.3
Built with Lua version : Lua 5.3.3
Built with DeviceAtlas support (dummy library only).
Built with 51Degrees Pattern support (dummy library).
Built with WURFL support (dummy library version 1.11.2.100)
Built with network namespace support.
Built with zlib version : 1.2.11
Running on zlib version : 1.2.11
Compression algorithms supported : identity("identity"), deflate("deflate"), raw-deflate("deflate"), gzip("gzip")
Built with transparent proxy support using: IP_TRANSPARENT IPV6_TRANSPARENT IP_FREEBIND
Built with PCRE version : 8.39 2016-06-14
Running on PCRE version : 8.39 2016-06-14
PCRE library supports JIT : yes
Encrypted password support via crypt(3): yes
Built with clang compiler version 9.0.1
Available polling systems :
epoll : pref=300, test result OK
poll : pref=200, test result OK
select : pref=150, test result OK
Total: 3 (3 usable), will use epoll.
Available multiplexer protocols :
(protocols marked as <default> cannot be specified using 'proto' keyword)
fcgi : mode=HTTP side=BE mux=FCGI
<default> : mode=HTTP side=FE|BE mux=H1
h2 : mode=HTTP side=FE|BE mux=H2
<default> : mode=TCP side=FE|BE mux=PASS
Available services : none
Available filters :
[SPOE] spoe
[COMP] compression
[TRACE] trace
[CACHE] cache
[FCGI] fcgi-app
package main
// run with: env PORT=8081 go run http-server.go
import (
"fmt"
"log"
"net/http"
"os"
)
func main() {
port := os.Getenv("PORT")
if port == "" {
log.Fatal("Please specify the HTTP port as environment variable, e.g. env PORT=8081 go run http-server.go")
}
http.HandleFunc("/", func(w http.ResponseWriter, r *http.Request){
fmt.Fprintf(w, "Hello World")
})
log.Fatal(http.ListenAndServe(":" + port, nil))
}
更重要的观察结果是,当根本不使用负载均衡器时,arm64的吞吐量要好几倍!我认为问题在于我的设置ーー HAProxy 和4个后端服务器都运行在同一个虚拟机上,所以它们在争夺资源!下面我计划把Golang服务固定到他们自己的 CPU 核心上,让 HAProxy 只使用其他4个 CPU 核心!敬请期待最新消息!
Architecture: aarch64
CPU op-mode(s): 64-bit
Byte Order: Little Endian
CPU(s): 8
On-line CPU(s) list: 0-7
Thread(s) per core: 1
Core(s) per socket: 8
Socket(s): 1
NUMA node(s): 1
Vendor ID: 0x48
Model: 0
Stepping: 0x1
CPU max MHz: 2400.0000
CPU min MHz: 2400.0000
BogoMIPS: 200.00
L1d cache: 512 KiB
L1i cache: 512 KiB
L2 cache: 4 MiB
L3 cache: 32 MiB
NUMA node0 CPU(s): 0-7
Flags: fp asimd evtstrm aes pmull sha1 sha2 crc32 atomics fphp asimdhp cpuid asimdrdm jscvt fcma dcpop asimddp asimdfhm
Note: the VMs are as close as possible in their hardware capabilities — same type and amount of RAM, same disks, network cards and bandwidth. Also the CPUs are as similar as possible but there are some differences
the CPU frequency: 3000 MHz (x86_64) vs 2400 MHz (aarch64)
BogoMIPS: 6000 (x86_64) vs 200 (aarch64)
Level 1 caches: 128 KiB (x86_64) vs 512 KiB (aarch64)
Both VMs run Ubuntu 20.04 with latest software updates.
HAProxy is built from source for the master branch, so it might have few changes since the cut of haproxy-2.2 tag!
HA-Proxy version 2.3-dev0 2020/07/07 - https://haproxy.org/
Status: development branch - not safe for use in production.
Known bugs: https://github.com/haproxy/haproxy/issues?q=is:issue+is:open
Running on: Linux 5.4.0-40-generic #44-Ubuntu SMP Mon Jun 22 23:59:48 UTC 2020 aarch64
Build options :
TARGET = linux-glibc
CPU = generic
CC = clang-9
CFLAGS = -O2 -Wall -Wextra -Wdeclaration-after-statement -fwrapv -Wno-address-of-packed-member -Wno-unused-label -Wno-sign-compare -Wno-unused-parameter -Wno-missing-field-initializers -Wno-string-plus-int -Wtype-limits -Wshift-negative-value -Wnull-dereference -Werror
OPTIONS = USE_PCRE=1 USE_PCRE_JIT=1 USE_OPENSSL=1 USE_LUA=1 USE_ZLIB=1 USE_DEVICEATLAS=1 USE_51DEGREES=1 USE_WURFL=1 USE_SYSTEMD=1
Feature list : +EPOLL -KQUEUE +NETFILTER +PCRE +PCRE_JIT -PCRE2 -PCRE2_JIT +POLL -PRIVATE_CACHE +THREAD -PTHREAD_PSHARED +BACKTRACE -STATIC_PCRE -STATIC_PCRE2 +TPROXY +LINUX_TPROXY +LINUX_SPLICE +LIBCRYPT +CRYPT_H +GETADDRINFO +OPENSSL +LUA +FUTEX +ACCEPT4 +ZLIB -SLZ +CPU_AFFINITY +TFO +NS +DL +RT +DEVICEATLAS +51DEGREES +WURFL +SYSTEMD -OBSOLETE_LINKER +PRCTL +THREAD_DUMP -EVPORTS
Default settings :
bufsize = 16384, maxrewrite = 1024, maxpollevents = 200
Built with multi-threading support (MAX_THREADS=64, default=8).
Built with OpenSSL version : OpenSSL 1.1.1f 31 Mar 2020
Running on OpenSSL version : OpenSSL 1.1.1f 31 Mar 2020
OpenSSL library supports TLS extensions : yes
OpenSSL library supports SNI : yes
OpenSSL library supports : TLSv1.0 TLSv1.1 TLSv1.2 TLSv1.3
Built with Lua version : Lua 5.3.3
Built with DeviceAtlas support (dummy library only).
Built with 51Degrees Pattern support (dummy library).
Built with WURFL support (dummy library version 1.11.2.100)
Built with network namespace support.
Built with zlib version : 1.2.11
Running on zlib version : 1.2.11
Compression algorithms supported : identity("identity"), deflate("deflate"), raw-deflate("deflate"), gzip("gzip")
Built with transparent proxy support using: IP_TRANSPARENT IPV6_TRANSPARENT IP_FREEBIND
Built with PCRE version : 8.39 2016-06-14
Running on PCRE version : 8.39 2016-06-14
PCRE library supports JIT : yes
Encrypted password support via crypt(3): yes
Built with clang compiler version 9.0.1
Available polling systems :
epoll : pref=300, test result OK
poll : pref=200, test result OK
select : pref=150, test result OK
Total: 3 (3 usable), will use epoll.
Available multiplexer protocols :
(protocols marked as <default> cannot be specified using 'proto' keyword)
fcgi : mode=HTTP side=BE mux=FCGI
<default> : mode=HTTP side=FE|BE mux=H1
h2 : mode=HTTP side=FE|BE mux=H2
<default> : mode=TCP side=FE|BE mux=PASS
Available services : none
Available filters :
[SPOE] spoe
[COMP] compression
[TRACE] trace
[CACHE] cache
[FCGI] fcgi-app
I’ve tried to fine tune it as much as I could by following all best practices I was able to find in the official documentation and in the web.
This way HAProxy is used as a load balancer in front of four HTTP servers.
To also use it as a SSL terminator one just needs to comment out line 34 and uncomment line 35.
The best results I’ve achieved by using the multithreaded setup. As the documentation says this is the recommended setup anyway but it also gave me almost twice better throughput! In addition the best results were with 32 threads. The throughput was increasing from 8 to 16 and from 16 to 32, but dropped when used 64 threads.
I’ve also pinned the threads to stay at the same CPU for its lifetime with cpu-map 1/all 0–7.
The other important setting is the algorithm to use to balance between the backends. Just like in Willy Tarreau’s tests for me leastconn gave the best performance.
As recommended at HAProxy Enterprice documentation I’ve disabled irqbalance.
Finally I’ve applied the following kernel settings:
fs.file-max is related also with a change in /etc/security/limits.conf:
root soft nofile 500000
root hard nofile 500000
* soft nofile 500000
* hard nofile 500000
For backend I used very simple HTTP servers written in Golang. They just write “Hello World” back to the client without reading/writing from/to disk or to the network:
package main
// run with: env PORT=8081 go run http-server.go
import (
"fmt"
"log"
"net/http"
"os"
)
func main() {
port := os.Getenv("PORT")
if port == "" {
log.Fatal("Please specify the HTTP port as environment variable, e.g. env PORT=8081 go run http-server.go")
}
http.HandleFunc("/", func(w http.ResponseWriter, r *http.Request){
fmt.Fprintf(w, "Hello World")
})
log.Fatal(http.ListenAndServe(":" + port, nil))
}
Running 30s test @ http://192.168.0.232:8080
8 threads and 96 connections
Thread Stats Avg Stdev Max +/- Stdev
Latency 6.67ms 8.82ms 196.74ms 89.85%
Req/Sec 2.60k 337.06 5.79k 75.79%
621350 requests in 30.09s, 75.85MB read
Requests/sec: 20651.69
Transfer/sec: 2.52MB
x86_64, HTTP
Running 30s test @ http://192.168.0.206:8080
8 threads and 96 connections
Thread Stats Avg Stdev Max +/- Stdev
Latency 3.32ms 4.46ms 75.42ms 94.58%
Req/Sec 4.71k 538.41 8.84k 82.41%
1127664 requests in 30.10s, 137.65MB read
Requests/sec: 37464.85
Transfer/sec: 4.57MB
aarch64, HTTPS
Running 30s test @ https://192.168.0.232:8080
8 threads and 96 connections
Thread Stats Avg Stdev Max +/- Stdev
Latency 7.92ms 12.50ms 248.52ms 91.18%
Req/Sec 2.42k 338.67 4.34k 80.88%
578210 requests in 30.08s, 70.58MB read
Requests/sec: 19220.81
Transfer/sec: 2.35MB
x86_64, HTTPS
Running 30s test @ https://192.168.0.206:8080
8 threads and 96 connections
Thread Stats Avg Stdev Max +/- Stdev
Latency 3.56ms 4.83ms 111.51ms 94.25%
Req/Sec 4.46k 609.37 7.23k 85.60%
1066831 requests in 30.07s, 130.23MB read
Requests/sec: 35474.26
Transfer/sec: 4.33MB
What we see here is:
that HAProxy is almost twice faster on the x86_64 VM than the aarch64 VM!
and also that TLS offloading decreases the throughput with around 5–8%
Update 1 (Jul 10 2020): To see whether the Golang based HTTP servers are not the bottleneck in the above testing I’ve decided to run the same WRK load tests directly against one of the backends, i.e. skip HAProxy.
aarch64, HTTP
Running 30s test @ http://192.168.0.232:8080
8 threads and 96 connections
Thread Stats Avg Stdev Max +/- Stdev
Latency 615.31us 586.70us 22.44ms 90.61%
Req/Sec 20.05k 1.57k 42.29k 73.62%
4794299 requests in 30.09s, 585.24MB read
Requests/sec: 159319.75
Transfer/sec: 19.45MB
x86_64, HTTP
Running 30s test @ http://192.168.0.206:8080
8 threads and 96 connections
Thread Stats Avg Stdev Max +/- Stdev
Latency 774.24us 484.99us 36.43ms 97.04%
Req/Sec 15.28k 413.04 16.89k 73.57%
3658911 requests in 30.10s, 446.64MB read
Requests/sec: 121561.40
Transfer/sec: 14.84MB
Here we see that the HTTP server running on aarch64 is around 30% faster than on x86_64!
And the more important observation is that the throughput is several times better when not using load balancer at all! I think the problem here is in my setup — both HAProxy and the 4 backend servers run on the same VM, so they fight for resources! I will pin the Golang servers to their own CPU cores and let HAProxy use only the other 4 CPU cores! Stay tuned for an update!
Update 2 (Jul 10 2020):
To pin the processes to specific CPUs I will use numactl.
numactl — cpunodebind=0 — membind=0 — physcpubind=4 env PORT=8081 go run etc/haproxy/load/http-server.
go
i.e. this backend instance is pinned to CPU node 0 and to physical CPU 4. The other three backend servers are pinned respectively to physical CPUs 5, 6 and 7.
Also I’ve changed slightly the HAProxy configuration:
i.e. HAProxy will spawn 4 threads and they will be pinned to physical CPUs 0–3.
With these changes the results stayed the same for aarch64:
Running 30s test @ https://192.168.0.232:8080
4 threads and 16 connections
Thread Stats Avg Stdev Max +/- Stdev
Latency 1.44ms 2.11ms 36.48ms 88.36%
Req/Sec 4.98k 651.34 6.62k 74.40%
596102 requests in 30.10s, 72.77MB read
Requests/sec: 19804.19
Transfer/sec: 2.42MB
but dropped for x86_64:
Running 30s test @ https://192.168.0.206:8080
4 threads and 16 connections
Thread Stats Avg Stdev Max +/- Stdev
Latency 767.40us 153.24us 19.07ms 97.72%
Req/Sec 5.21k 173.41 5.51k 63.46%
623911 requests in 30.10s, 76.16MB read
Requests/sec: 20727.89
Transfer/sec: 2.53MB
and same for HTTP (no TLS):
aarch64
Running 30s test @ http://192.168.0.232:8080
4 threads and 16 connections
Thread Stats Avg Stdev Max +/- Stdev
Latency 1.40ms 2.16ms 36.55ms 88.08%
Req/Sec 5.55k 462.65 6.97k 69.85%
665269 requests in 30.10s, 81.21MB read
Requests/sec: 22102.12
Transfer/sec: 2.70MB
x86_64
Running 30s test @ http://192.168.0.206:8080
4 threads and 16 connections
Thread Stats Avg Stdev Max +/- Stdev
Latency 726.01us 125.04us 6.42ms 93.95%
Req/Sec 5.51k 165.80 5.80k 57.24%
658777 requests in 30.10s, 80.42MB read
Requests/sec: 21886.50
Transfer/sec: 2.67MB
So now HAProxy is a bit faster on aarch64 than on x86_64 but still far slower than the “no load balancer” approach with 120 000+ requests per second.
Update 3 (Jul 10 2020): After seeing that the performance of the Golang HTTP server is so good (120–160K reqs/sec) and to simplify the setup I’ve decided to remove the CPU pinning from Update 2 and to use the backends from the other VM, i.e. when hitting HAProxy on the aarch64 VM it will load balance between the backends running on the x86_64 and when WRK hits HAProxy running on the x86_64 VM it will use the Golang HTTP servers running on the aarch64 VM. And here are the new results:
aarch64, HTTP
Running 30s test @ http://192.168.0.232:8080
8 threads and 96 connections
Thread Stats Avg Stdev Max +/- Stdev
Latency 6.33ms 4.93ms 76.85ms 89.14%
Req/Sec 2.10k 316.84 3.52k 74.50%
501840 requests in 30.07s, 61.26MB read
Requests/sec: 16688.53
Transfer/sec: 2.04MB
x86_64, HTTP
Running 30s test @ http://192.168.0.206:8080
8 threads and 96 connections
Thread Stats Avg Stdev Max +/- Stdev
Latency 5.32ms 6.71ms 71.29ms 90.25%
Req/Sec 3.26k 639.12 4.14k 65.52%
779297 requests in 30.08s, 95.13MB read
Requests/sec: 25908.50
Transfer/sec: 3.16MB
aarch64, HTTPS
Running 30s test @ https://192.168.0.232:8080
8 threads and 96 connections
Thread Stats Avg Stdev Max +/- Stdev
Latency 6.17ms 5.41ms 292.21ms 91.08%
Req/Sec 2.13k 238.74 3.85k 86.32%
506111 requests in 30.09s, 61.78MB read
Requests/sec: 16821.60
Transfer/sec: 2.05MB
x86_64, HTTPS
Running 30s test @ https://192.168.0.206:8080
8 threads and 96 connections
Thread Stats Avg Stdev Max +/- Stdev
Latency 3.40ms 2.54ms 58.66ms 97.27%
Req/Sec 3.82k 385.85 4.55k 92.10%
914329 requests in 30.10s, 111.61MB read
Requests/sec: 30376.95
Transfer/sec: 3.71MB
译者: wangxiyuan 作者: Martin Grigorov 原文链接: https://medium.com/@martin.grigorov/compare-haproxy-performance-on-x86-64-and-arm64-cpu-architectures-bfd55d1d5566
本文是由Apache Tomcat PMC Martin带来的Haproxy最新版本的性能测试报告。
{% raw %}
HAProxy v2.2在几天前刚刚发布,所以我决定在 x86_64和 aarch64 虚拟机上对它运行负载测试:
注意: 我尽可能的让虚拟机的硬件配置更接近,即使用相同的 RAM 类型和大小、相同的磁盘、网卡和带宽。此外,cpu 尽可能相似,但难免有一些差异:
两个虚拟机都运行在最新版的Ubuntu 20.04上。
我的HAProxy 是从master分支的源代码构建的,代码与HAProxy v2.2的几乎没有区别。
我已经试图通过遵循我在官方文档和网络上找到的所有最佳实践来尽可能地优化它。
HAProxy的配置如下:
通过这种方式,HAProxy 被用作四个 HTTP 服务的前端负载均衡器。
想使用 SSL方式的话,只需要注释掉第34行并取消注释第35行。
我使用了多线程设置以获得最佳结果。正如文档所说,这是推荐的设置,而且它也使吞吐量提高了近两倍!此外经过我把吞吐量从8个线程增加到16个线程,再从16个线程增加到32个线程的设置后,发现使用32个线程的效果最好,当使用64个线程时吞吐量开始下降。
我还使用
CPU-map 1/all 0-7
将线程固定在同一个 CPU中。另一个重要的设置是用于平衡后端的算法。就像Willy Tarreau的测试一样。
正如在 HAProxy Enterprice 文档中所推荐的,我已经禁用了
irqbalance
。最后,我应用了以下内核设置:
fs.file-max
也与/etc/security/limits. conf
中的一些更改有关:对于后端,我使用了用 Golang 编写的非常简单的 HTTP 服务器。他们只是将“ Hello World”写回客户机,而不从磁盘或网络读/写:
对用负载测试客户端,我使用了与测试Apache Tomcat相同设置的WRK。
结果如下:
我们可以发现:
更新1(2020年7月10日) : 为了确定基于 Golang 的 HTTP 服务器是否是上述测试中的瓶颈,我决定直接针对一个后端(即跳过 HAProxy)运行相同的 WRK 负载测试。
在这里我们看到运行在 aarch64上的 HTTP 服务比运行在 x86-- 64上的要快30% !
更重要的观察结果是,当根本不使用负载均衡器时,arm64的吞吐量要好几倍!我认为问题在于我的设置ーー HAProxy 和4个后端服务器都运行在同一个虚拟机上,所以它们在争夺资源!下面我计划把Golang服务固定到他们自己的 CPU 核心上,让 HAProxy 只使用其他4个 CPU 核心!敬请期待最新消息!
更新2(2020年7月10日) :
为了将进程固定到特定的 cpu,我将使用
numactl
。我已经将 Golang HTTP 服务固定在以下几个方面:
例如,这个后端实例被固定到 CPU 节点0和物理 CPU 4。其他三个后端服务分别固定在物理 cpu 5、6和7上。
我还对 HAProxy 的配置做了一些改动:
也就是说,HAProxy 将产生4个线程,它们将被固定到物理 cpu 0-3上。
通过这些改变,aarch64的结果保持不变:
但是 x86_64下降了:
对于 HTTP (没有 TLS)也是如此:
因此,现在 HAProxy 在 aarch64上的速度比 x86_64稍快一些,但仍然远远低于每秒120000多个请求的“空负载均衡器”方法。
更新3(2020年7月10日) : 在看到 Golang HTTP 服务的性能非常好(120-160K reqs/sec)并简化设置之后,我决定从 Update 2中删除 CPU固定,并使用来自其他 VM 的后端,例如,当在aarch64虚拟机上运行HAProxy时,它将在x86_64上运行的后端之间进行负载均衡;当使用WRK在x86_64上运行HAProxy时,它将使用aarch64虚拟机上运行的 Golang HTTP服务。以下是新的结果:
祝你黑客生活愉快,注意安全!
{% raw %}
HAProxy 2.2 has been released few days ago so I’ve decided to run my load tests against it on my x86_64 and aarch64 VMs:
Note: the VMs are as close as possible in their hardware capabilities — same type and amount of RAM, same disks, network cards and bandwidth. Also the CPUs are as similar as possible but there are some differences
Both VMs run Ubuntu 20.04 with latest software updates.
HAProxy is built from source for the master branch, so it might have few changes since the cut of haproxy-2.2 tag!
I’ve tried to fine tune it as much as I could by following all best practices I was able to find in the official documentation and in the web.
The HAProxy config is:
This way HAProxy is used as a load balancer in front of four HTTP servers.
To also use it as a SSL terminator one just needs to comment out line 34 and uncomment line 35.
The best results I’ve achieved by using the multithreaded setup. As the documentation says this is the recommended setup anyway but it also gave me almost twice better throughput! In addition the best results were with 32 threads. The throughput was increasing from 8 to 16 and from 16 to 32, but dropped when used 64 threads.
I’ve also pinned the threads to stay at the same CPU for its lifetime with
cpu-map 1/all 0–7
.The other important setting is the algorithm to use to balance between the backends. Just like in Willy Tarreau’s tests for me
leastconn
gave the best performance.As recommended at HAProxy Enterprice documentation I’ve disabled
irqbalance
.Finally I’ve applied the following kernel settings:
fs.file-max
is related also with a change in/etc/security/limits.conf
:For backend I used very simple HTTP servers written in Golang. They just write “Hello World” back to the client without reading/writing from/to disk or to the network:
As load testing client I have used WRK with the same setup as for testing Apache Tomcat.
And now the results:
What we see here is:
Update 1 (Jul 10 2020): To see whether the Golang based HTTP servers are not the bottleneck in the above testing I’ve decided to run the same WRK load tests directly against one of the backends, i.e. skip HAProxy.
Here we see that the HTTP server running on aarch64 is around 30% faster than on x86_64!
And the more important observation is that the throughput is several times better when not using load balancer at all! I think the problem here is in my setup — both HAProxy and the 4 backend servers run on the same VM, so they fight for resources! I will pin the Golang servers to their own CPU cores and let HAProxy use only the other 4 CPU cores! Stay tuned for an update!
Update 2 (Jul 10 2020):
To pin the processes to specific CPUs I will use
numactl
.I’ve pinned the Golang HTTP servers with:
i.e. this backend instance is pinned to CPU node 0 and to physical CPU 4. The other three backend servers are pinned respectively to physical CPUs 5, 6 and 7.
Also I’ve changed slightly the HAProxy configuration:
i.e. HAProxy will spawn 4 threads and they will be pinned to physical CPUs 0–3.
With these changes the results stayed the same for aarch64:
but dropped for x86_64:
and same for HTTP (no TLS):
So now HAProxy is a bit faster on aarch64 than on x86_64 but still far slower than the “no load balancer” approach with 120 000+ requests per second.
Update 3 (Jul 10 2020): After seeing that the performance of the Golang HTTP server is so good (120–160K reqs/sec) and to simplify the setup I’ve decided to remove the CPU pinning from Update 2 and to use the backends from the other VM, i.e. when hitting HAProxy on the aarch64 VM it will load balance between the backends running on the x86_64 and when WRK hits HAProxy running on the x86_64 VM it will use the Golang HTTP servers running on the aarch64 VM. And here are the new results:
Happy hacking and stay safe!
{% raw %}