Open cangyin opened 8 months ago
With the default IPsec tunnels, this happens because the IPsec protocol doesn’t support splitting tunnels, and the encryption and packet ordering must happen on a single core.
We have considered in the past enabling multiple parallel tunnels, on one gateway or across multiple gateways (in the latter case, with added HA benefits); but that requires deciding how to split traffic across the available tunnels.
In parallel, protocol extensions are being discussed to allow IPsec tunnels to be split to avoid these bottlenecks; see the current draft for details. It seems preferable for Submariner to support that, once it becomes available, instead of coming up with its own solution.
For performance-critical scenarios, especially in cases where a dedicated (private) network is available between gateways, Submariner supports VXLAN tunnels instead of IPsec.
For performance-critical scenarios, especially in cases where a dedicated (private) network is available between gateways, Submariner supports VXLAN tunnels instead of IPsec.
The 56% performance drop in question is exactly that of inter-cluster VXLAN tunnel (vxlan-tunnel
VTEP). Which gives 44% of underlying network capability of only one NIC, while the IPsec tunnel gives 26%.
It seems multiple active VXLAN gateways is easier to implement, intuitively.
protocol extensions are being discussed to allow IPsec tunnels to be split to avoid these bottlenecks
If we go this path, we should likely raise a more specific issue.
Need more investigation before prioritizing. First we want to implement load-balancer mode.
Decided to push to following releases
What would you like to be added:
Multiple active gateways for higher inter-cluster data transfer performance.
Why is this needed:
Currently there is only one gateway per cluster. As per the benchmark result in #2890, there is significant performance drop (about 56%) for PODs running on non-gateway nodes. Suppose the gateway node has a 10Gbit/s NIC. For DBMS servers running on non-gateway nodes, they only share 560 MByte/s. This means the whole cluster is only able to transfer 46TB of data per day at maximum, theoretically, which is unacceptable for productional clusters (JFYI, a small to median sized productional ClickHouse cluster can receive more than 40TB of data per day).