haproxytech / kubernetes-ingress

HAProxy Kubernetes Ingress Controller
https://www.haproxy.com/documentation/kubernetes/
Apache License 2.0
719 stars 202 forks source link

Alpine (musl) based haproxy ingress images performance issue #541

Open amorozkin opened 1 year ago

amorozkin commented 1 year ago

Could you please consider adding an option to use non-alpine based haproxy ingress images?

Alpine's PTHREAD implementaion has a drasitc CPU overhead - (internals/details can be found here https://stackoverflow.com/questions/73807754/how-one-pthread-waits-for-another-to-finish-via-futex-in-linux/73813907#73813907 )

Here are two strace statistics samples for the same load profile (25K RPS via 3 haproxy ingress pods) for the equal period of time (about 1 minute): 1. GLIBC based haproxy

% time     seconds  usecs/call     calls    errors syscall
------ ----------- ----------- --------- --------- ----------------
 47.55  147.946790          53   2787268    880506 recvfrom
 26.33   81.933249          88    929414           sendto
 16.81   52.295309          54    962217           epoll_ctl
  3.37   10.486387          51    203040           getpid
  1.48    4.597493          51     90048           clock_gettime
  1.41    4.380619          97     44924           epoll_wait
  0.64    2.003053          54     36497           getsockopt
  0.56    1.731618          97     17829           close
  0.51    1.582058          56     28118           setsockopt
  0.39    1.207813          66     18144      8945 accept4
  0.38    1.188416         116     10223     10223 connect
  0.29    0.903808          88     10223           socket
  0.18    0.548180          53     10223           fcntl
  0.10    0.299368          79      3785      1130 futex
  0.00    0.011658          60       193           timer_settime
  0.00    0.010546          54       193        30 rt_sigreturn
------ ----------- ----------- --------- --------- ----------------
100.00  311.126365               5152339    900834 total

2. MUSL based haproxy:

% time     seconds  usecs/call     calls    errors syscall
------ ----------- ----------- --------- --------- ----------------
 68.24  412.454997          96   4259899    419280 futex
 10.00   60.440537         120    502107           madvise
  8.74   52.833292         111    472438    121948 recvfrom
  4.22   25.477060         166    152913           sendto
  2.80   16.921311         107    157293           getpid
  2.26   13.680361         109    125062           epoll_ctl
  1.38    8.351141         119     69682           writev
  0.54    3.254861         106     30535           clock_gettime
  0.37    2.255775         148     15187           epoll_pwait
  0.34    2.033282         178     11419           close
  0.31    1.844610         117     15724      5964 accept4
  0.25    1.530881         110     13850           setsockopt
  0.25    1.509742         107     14001           getsockopt
  0.08    0.466851         157      2966           munmap
  0.06    0.392208         170      2294      2294 connect
  0.06    0.378519         107      3505           mmap
  0.05    0.287839         125      2294           socket
  0.04    0.234976         102      2294           fcntl
  0.00    0.014530          94       154           timer_settime
  0.00    0.014262          92       154        15 rt_sigreturn
  0.00    0.006613         143        46        23 read
  0.00    0.003571         148        24           write
  0.00    0.003377         241        14           shutdown
------ ----------- ----------- --------- --------- ----------------
100.00  604.390596               5853855    549524 total

As you can see - the last one (MUSL based one) - 60+% of time spends on futex (FUTEX_WAKE_PRIVATE to be exact) system calls. As a reuslt - more than twice higher CPU utilisation on the same load profile acommpaned by upstream's sessions number spikes: image

PKizzle commented 1 year ago

I tested it on my Raspberry Pi but did not encounter such a huge performance difference. What TLS ciphers were used in the graph above?

amorozkin commented 1 year ago

I tested it on my Raspberry Pi but did not encounter such a huge performance difference. What TLS ciphers were used in the graph above?

In both cases the same haproxy config was used with TLS options:

  ssl-default-bind-options no-sslv3 no-tlsv10 no-tlsv11
  ssl-default-bind-ciphers TLS13-AES-256-GCM-SHA384:TLS13-AES-128-GCM-SHA256:TLS13-CHACHA20-POLY1305-SHA256:EECDH+AESGCM:EECDH+CHACHA20
  tune.ssl.default-dh-param 2048

Both Haproxy itslef and Upstream (single one in test above) use 4096 bit length TLS certificates (annotaion "haproxy.org/server-ssl: "true" is configured in ingress)

K8s nodes: KVM VMs (Ubuntu 20.04.4 LTS, 5.4.0-109-generic, k8s version v1.23.4)

PODs:

        resources:
          limits:
            cpu: "12"
            memory: 24Gi
          requests:
            cpu: "10"
            memory: 24Gi
....
      securityContext:
        sysctls:
        - name: net.ipv4.ip_local_port_range
          value: 1024 65535
        - name: net.ipv4.tcp_rmem
          value: 8192 87380 33554432
        - name: net.ipv4.tcp_wmem
          value: 8192 65536 33554432
        - name: net.ipv4.tcp_max_syn_backlog
          value: "20000"
        - name: net.core.somaxconn
          value: "20000"
        - name: net.ipv4.tcp_tw_reuse
          value: "1"
        - name: net.ipv4.tcp_syncookies
          value: "0"
        - name: net.ipv4.tcp_slow_start_after_idle
          value: "0"
        - name: net.ipv4.tcp_fin_timeout
          value: "30"
        - name: net.ipv4.tcp_keepalive_time
          value: "30"
        - name: net.ipv4.tcp_keepalive_intvl
          value: "10"
        - name: net.ipv4.tcp_keepalive_probes
          value: "3"
        - name: net.ipv4.tcp_no_metrics_save
          value: "1"

Haproxy: nbthread: "8"

IMHO TLS handshakes should not play a great deal with keepalive connections used on both ends: client<->haproxy AND haproxy<->upstream

dkorunic commented 10 months ago

@amorozkin I am reasonably sure this is not related to Alpine MUSL at all, but related to OpenSSL 3.0/3.1 mutex contention issues. I suspect your Glibc-based distribution is using OpenSSL 1.1.1, isn't it?