loxilb-io / loxilb

eBPF based cloud-native load-balancer for Kubernetes|Edge|Telco|IoT|XaaS.
https://www.loxilb.io
Apache License 2.0
1.39k stars 120 forks source link

SCTP: >10x BPF program runtime compared to TCP #447

Closed luisgerhorst closed 10 months ago

luisgerhorst commented 10 months ago

I ran a variant of your iperf2/3 TCP/SCTP benchmark and also measured the time spent in the BPF programs loaded by loxilb. For a 10s single-threaded iperf2/3 TCP benchmark, around 0.5s are spent in the BPF program (total system CPU time spent is 60-70s) achieving ~10Gbit/s.

For a iperf3 SCTP benchmark, 8.0s are spent in the BPF program which is much more, is this expected? The SCTP benchmark achieves approx. 200Mbit/s also spending approx. 60s of CPU time.

nik-netlox commented 10 months ago

Hi @luisgerhorst, This may be due to checksum calculation. TCP checksum calculation are offloaded but there is no support to offload SCTP checksum so loxilb calculates the SCTP checksum in its eBPF core engine. If you disable checksum calculations at your client and server side and run loxilb with --disable-csum option then you will see better performance numbers.

If you happen to see low throughput for multi-threaded iperf3 SCTP traffic then this may be due to RSS distribution in Linux. For SCTP, the traffic is not well distributed and goes to only couple of cores which leads to bottleneck. Though, we have a workaround where loxilb takes care of the RSS distribution. Please run loxilb with --rss-enable option then you will see better performance in that case too.

luisgerhorst commented 10 months ago

If you disable checksum calculations at your client and server side and run loxilb with --disable-csum option then you will see better performance numbers.

Thanks! As I am trying to have a realistic real world test, I assume it is better to keep it enabled. Or would you disagree? (i.e., do you think people do this frequently for SCTP?)

multi-threaded iperf3 SCTP traffic

At least on Debian 10, multi-threaded iperf3 does not seem to exist. Or were you referring to running multiple instances of the iperf3 against a single server?

man netperf:

-P, --parallel n number of parallel client streams to run. Note that iperf3 is single threaded, so if you are CPU bound, this will not yield higher throughput.

This also explains why I could not reproduce the TCP iperf2 numbers with iperf3. Thus I decided to stick to single-threaded benchmarks as they should be sufficient to measure the overhead of my kernel BPF runtime patches.

Please run loxilb with --rss-enable option then you will see better performance in that case too.

As this does not appear to have any downsides I will do that, thanks!

luisgerhorst commented 10 months ago

Closing because the original issue has been resolved.