Multicluster: Add gRPC dial timeout

adamhosier commented 4 years ago

What happened:

When using the multicluster allocator, dialling remote clusters which are unhealthy can cause allocation requests to block for 2 minutes+.

What you expected to happen:

Some sensible default, or configuration option to control dial timeout of multicluster allocation gRPC requests.

How to reproduce it (as minimally and precisely as possible):

Deploy a gameserverallocationpolicy that allocates to some blocking endpoint. We have observed this issue occurring with a DNS name that resolves to the IP of a google LB that has no configured upstreams (because the cluster has been taken down for maintenance, for example.)

apiVersion: multicluster.agones.dev/v1
kind: GameServerAllocationPolicy
metadata:
  name: test
spec:
  connectionInfo:
    allocationEndpoints:
    - my-blackhole-server.com
    clusterName: test
    secretName: test
  priority: 10
  weight: 100

Observe that allocation requests following this policy will block for ~150 seconds before responding.

Anything else we need to know?:

Our use-case is slightly more involved than the minimum repro scenario, we'd like to have a second, low-priority, cluster configured as "failover" for when the first cluster is unhealthy or down for maintenance. The behaviour we've observed is that when the high-priority cluster blackholes requests (such is the default behaviour when a GKE cluster with GCP loadbalancers is taken down), allocation requests take ~150 seconds to succeed as they block waiting for the high-priority cluster to respond.

Environment:

Agones version: 1.7.0
Kubernetes version: 1.16.10
Cloud provider or hardware configuration: GKE + bare-metal
Install method (yaml/helm): yaml

pooneh-m commented 4 years ago

There is a retry logic that retries for 7 times to sent request to the high priority cluster. It retried every 100ms * try^2:

Which should take around 25s if request immediately fails. Can you share your logs on the type of error the service is encountering? It could be that the request is failing with DEADLINE_EXCEEDED which suggest the default gRPC timeout is reached. DEADLINE_EXCEEDED should add 10s default gRPC deadline per retry, so in this case, it could take 70s + 25s. Which is close to what you are experiencing.

Some potential solutions: 1) Tune retries for this scenario (e.g. for DEADLINE_EXCEEDED failover to the secondary cluster). 2) If a cluster fails after all the retries, take it out of remote allocation for 10m before trying again. 3) Immediate failover. e.g. Don't retry. 4) Reduce the gRPC timeout 5) As you suggested have a configurable cap on the retry timeout. 6) Verify the health of the service using healthz

adamhosier commented 4 years ago

Hi @pooneh-m, I see the retry logic taking affect from the logs. I've redacted some information but the logs are as follows:

2020-07-21T13:57:52.07 "allocation request received"
2020-07-21T13:58:12.07 "remote allocation failed with: rpc error: code = Unavailable desc = all SubConns are in TransientFailure, latest connection error: connection error: desc = \"transport: Error while dialing dial tcp <REDACTED ADDRESS>: i/o timeout\""
2020-07-21T13:58:32.17 "remote allocation failed with: rpc error: code = Unavailable desc = all SubConns are in TransientFailure, latest connection error: connection error: desc = \"transport: Error while dialing dial tcp <REDACTED ADDRESS>: i/o timeout\""
2020-07-21T13:58:52.37 "remote allocation failed with: rpc error: code = Unavailable desc = all SubConns are in TransientFailure, latest connection error: connection error: desc = \"transport: Error while dialing dial tcp <REDACTED ADDRESS>: i/o timeout\""
2020-07-21T13:59:12.77 "remote allocation failed with: rpc error: code = Unavailable desc = all SubConns are in TransientFailure, latest connection error: connection error: desc = \"transport: Error while dialing dial tcp <REDACTED ADDRESS>: i/o timeout\""
2020-07-21T13:59:33.57 "remote allocation failed with: rpc error: code = Unavailable desc = all SubConns are in TransientFailure, latest connection error: connection error: desc = \"transport: Error while dialing dial tcp <REDACTED ADDRESS>: i/o timeout\""
2020-07-21T13:59:55.17 "remote allocation failed with: rpc error: code = Unavailable desc = all SubConns are in TransientFailure, latest connection error: connection error: desc = \"transport: Error while dialing dial tcp <REDACTED ADDRESS>: i/o timeout\""
2020-07-21T14:00:18.37 "remote allocation failed with: rpc error: code = Unavailable desc = all SubConns are in TransientFailure, latest connection error: connection error: desc = \"transport: Error while dialing dial tcp <REDACTED ADDRESS>: i/o timeout\""
2020-07-21T14:00:18.37 "remote-allocation failed"

It seems like the gRPC dial timeout here is 20 seconds, with ~6 seconds being added from the backoff for ~146 seconds blocked in total. Sadly I imagine there are situations where we would want to continue retrying on "Unavailable" codes, but the other solutions you offered seem reasonable to me.

pooneh-m commented 4 years ago

This is very helpful, thanks! The formula for retry is sum(100ms try^2){1, 7} = 1002 + 1002^2 + ... + 1002^7 = 25.4s

It seems that it took 14:00:18-13:57:52 = 166s which is 7*20s (timeout) +26s (backoff)

I think the gRPC timeout should be set to 10s for (4) with a default cap of total retry timeout at 30s, if the helm config for (5) is not set. Does it seem reasonable for your case?

In the meantime, I highly recommend updating the GameServerAllocationPolicy resource during the maintenance window not to direct the traffic to the cluster that is taken down for maintenance. This cloud solution also provides a management solution for GameServerAllocationPolicy. If your cluster is down for for maintenance, it removes GameServerAllocationPolicy from other clusters in your realm to stop forwarding multi-cluster allocation requests to your cluster.

adamhosier commented 4 years ago

I think the gRPC timeout should be set to 10s for (4) with a default cap of total retry timeout at 30s, if the helm config for (5) is not set. Does it seem reasonable for your case?

Very reasonable yes - introducing a total cap of 30s would be a great start, with configuration for this option being a bonus for us.

I agree with your intermediate solution of adjusting the allocation policies manually where possible, we'll definitely be doing that where we can. But it's those unplanned "maintenance windows" that worry us a bit here 😉

aLekSer commented 4 years ago

There could be a solution with adding new fields to the CRD also: backoffCap and timeout. I think it would be ready in the nearest time. Our colleague is going to open a PR soon. So this is an enhancement of the previous PR and more general approach.

robbieheywood commented 4 years ago

@aLekSer - that sound fantastic - looking forward to it!

pooneh-m commented 3 years ago

There could be a solution with adding new fields to the CRD also: backoffCap and timeout. I think it would be ready in the nearest time. Our colleague is going to open a PR soon. So this is an enhancement of the previous PR and more general approach.

Why do we need to add a per cluster setting for the timeout and backoffCap? Do we expect every cluster would have a different timeout setting? If not, an environment setting for agones-allocator should suffice.

Since the API is stable every change to the fields should go through a feature stage, starting with alpha.

googleforgames / agones

Multicluster: Add gRPC dial timeout #1700