googleforgames / quilkin

Quilkin is a non-transparent UDP proxy specifically designed for use with large scale multiplayer dedicated game server deployments, to ensure security, access control, telemetry data, metrics and more.
Apache License 2.0
1.29k stars 93 forks source link

Add session affinity support to load balancer #374

Open iffyio opened 3 years ago

iffyio commented 3 years ago

We currently have a naive load balancer that doesn't account for any packet information like the source address (each packet is routed to a distinct server according to the lb policy) - this is usually the unwanted behavior, rather we should send subsequent packets from the same client to the same server.

Impl wise, we can introduce a lookup state to the load balancer so that if there exists a client->server pair in the map there is no need to consult the lb policy impl (we know where to route the packet)

gilesheron commented 3 years ago

PR #381 achieves this using a source hash.

But yes, there are cases where we'd want to track session state :)

iffyio commented 3 years ago

Ah, not exactly if I understood the PR correctly. The goal with this was to remember what endpoint we chose for a client so e.g if the order of endpoints in the list changes (the order isn't currently guaranteed at all atm), #381 hashing won't handle that properly. then this isn't specific to hashing, the other policies should do the same e.g a random policy should select an endpoint at random but only do that once for a client

gilesheron commented 3 years ago

Well yes, if you want to survive changes in the endpoint lists then you'll need something more complex than a simple hash.

Consistent hashing will solve that without requiring session state. See: https://en.wikipedia.org/wiki/Consistent_hashing

markmandel commented 3 years ago

We happy to close this issue now(just merged #381), or do we think we'll be looking at specific other types of hashing?

iffyio commented 3 years ago

We can't close this yet #381 doesn't resolve it, the idea with this is that if the lb selects an upstream A for a client, all packets from that client should go to A at least until A goes away, regardless of what policy is being used. And the current hashing in #381 doesn't do this either, e.g endpoints list [A, B, C] and [B,C,A] are equivalent but the current implementation will pick a different upstream for a client if the order changes since we're blindly using offsets to index into the list.

gilesheron commented 3 years ago

to fix that we'd have to sort the endpoints before picking one. That way the hash would always land on the same endpoint.

Am happy to make the change.

Beyond that you could implement a consistent hash - as I understand the usual algorithm you map each endpoint onto multiple positions on a "wheel" of hash values. When hashing you pick the "next" endpoint on the wheel. If an endpoint dies you remove all its entries from the wheel so that those flows get remapped across the remaining endpoints, but flows currently mapped to other endpoints don't get remapped.

iffyio commented 3 years ago

Ah, so to clarify this ticket isn't specific to hashing in any way, its for all lb policies we'll have. It'll be wasteful to sort the list on each packet in this case and we still have the issue of when the list changes, sessions get broken. Consistent hashing won't solve this either unfortunately, that only minimises the number of sessions that break (some will still break).

Thankfully though since we have all the information we need in memory - there's isn't really any reason why adding a new endpoint should break any ongoing sessions e.g if we have a client X talking to endpoint C in endpoint list [A, B, C], its session shouldn't break if the list changes to [B, C, D]. We can achieve this with a map to track the fact that X is currently talking to C and that should be enough, the endpoint_chooser policy impls we have won't even know about it (they're only consulted when there's no entry in the map for a client) something like:

fn read(&self, mut ctx: ReadContext) -> Option<ReadResponse> {
  if let Some(upstream) = self.state.get(ctx.from) {
    // this keep isn't by index we need to extend ctx.endpoints to be smarter
    ctx.endpoints.keep(upstream); // If we already have a session use that
  } else {
    // otherwise select an upstream for this client
    let upstream = self.endpoint_chooser
      .choose_endpoints(&mut ctx.endpoints, ctx.from);
    ctx.endpoints.keep(upstream); // and remember it for next time.
    self.state.insert(ctx.from, upstream);
  }
  Some(ctx.into())
}
gilesheron commented 3 years ago

ah cool - if you already have session state that's great. So there's no need for the hashed implementation if you have this since clients will be pinned to their existing endpoints.

that also neatly solves my next thing. I've implemented a LB mode where we call out to an external API using gRPC. Currently I call the API once per packet, and I was about to add a user cache there. But it'd be easier just to use your solution, I think?

iffyio commented 3 years ago

Yeah I think that would be easier with the new lb policy since it won't need to worry include its own cache.

gilesheron commented 3 years ago

yeah - for a demo I may just hack my own cache in and then remove it. I've got the API stuff working now - it's pretty cute. We can route traffic based on the source IP/port (so e.g. you can direct customers from ISP A to backend X and from ISP B to backend Y).