Milestone v1.14: Performance: Per-allocation CPU load-balancing

This issue is to plan & discuss the performance optimizations that should go into v1.14.

Problem: Currently STUNner UDP performance is limited at about 100-200 kpps per UDP listener (i.e., per UDP Gateway/listener in the Kubernetes Gateway API terminology). This is because we allocate a single net.PacketConn per UDP listener, which is then drained by a single CPU thread/go-routine. This means that all client allocations made via that listener will share the same CPU thread and there is no way to load-balance client allocations across CPUs; i.e., each listener is restricted to a single CPU. If STUNner is exposed via a single UDP listener (the most common setting) then it will be restricted to about 1200-1500 mcore.

Notes:

This is not a problem in Kubernetes: instead of vertical scaling (let a single STUNner instance use as many CPUs as available), Kubernetes defaults to horizontal scaling; if a single stunnerd pod is a bottleneck we simply fire up more (e.g., using HPA). In fact, the single-CPU-restriction makes HPA simpler since the CPU triggers are easier to set (e.g., we have to scale-out when when the average CPU load approaches 1000 mcores); when the application can vertically scale to some arbitrary number of CPUs by itself we never know how to fix the CPU trigger for HPA (this is when vertical scaling interferes with horizonmtal scaling). Eventually we'll have as many pods as CPU cores and Kubernetes will readily load-balance client connections across our pods. This makes us wonder whether to solve the vertical scaling problem at all, since there is very little use of such a feature in Kubernetes.
The single-CPU restriction apples per-UDP-listener: if STUNner is exposed via multiple UDP TURN listeners then each listener will receive a separate CPU thread.
This limitation applies to UDP only: for TCP, TLS and DTLS the TURN sockets are connected back to the client and therefore a separate CPU thread/go-routine is created for each allocation.

Solution: The plan is to create a separate net.Conn for each UDP allocation, by (1) sharing the same listener server address using REUSEADDR/REUSEPORT, (2) connecting each per-allocation connection back to the client (this will turn the net.PacketConn into a connected net.Conn), and (3) firing up a separate read-loop/go-routine per each allocation/socket. Extreme care must be taken though in implementing this: if we blindly create a new socket per received UDP packet then a simple UDP portscan will DoS the TURN listener.

Plan:

Move the creation of per-allocation connection creation after the client has authenticated with the server, e.g., when the TURN allocation request has been successfully processed. Note that this still allows a client with a valid credential to DoS the server, so we need to quota per-client connections.
Implement per-client quotas as per RFC8656, Section 7.2., "Receiving an Allocate Request", point 10:

At any point, the server MAY choose to reject the request with a 486 (Allocation Quota Reached) error if it feels the client is trying to exceed some locally defined allocation quota. The server is free to define this allocation quota any way it wishes, but it SHOULD define it based on the username used to authenticate the request and not on the client's transport address.

Expose the client quota via turn.ServerConfig. Possibly also expose a setting to let users to opt in to per-allocation CPU load-balancing.
Test and upstream.

Feedback appreciated.

l7mp / stunner

Milestone v1.14: Performance: Per-allocation CPU load-balancing #60