Open Raynos opened 7 years ago
Figuring out which serviceNames to enable with this "degraded load balancing" strategy is going to be involved.
One strategy is to:
choosePeer()
dominates the CPUIn theory, there should only be ~10ish service names that have both: "high QPS" and "high number of peers" which causes choosePeer()
to dominate the flamegraph.
we could also add a timing stat that's cluster-wide and only tagged by service name
From a flame graph I've observed that some workers / services are really struggling with peer selection
If we implement a random peer selection strategy and add a flipr where we can change the peer selection strategy per serviceName
We already have boolean logic to enabled / disable peer heap per serviceName.
A round robin peer selection will reduce CPU utilization and slightly degrade load balancing by increasing variance.
If round robin is involved we can also just implement random peer selection.