Closed travisgroth closed 4 years ago
If you can reliably reproduce this, can you enable logging and include the output here (or in a pastebin)?
Yes very easy to repro:
INFO: 2020/02/04 01:05:10 parsed scheme: "dns"
INFO: 2020/02/04 01:05:10 ccResolverWrapper: sending update to cc: {[{10.1.6.16:443 <nil> 0 <nil>}] <nil> <nil>}
INFO: 2020/02/04 01:05:10 base.baseBalancer: got new ClientConn state: {{[{10.1.6.16:443 <nil> 0 <nil>}] <nil> <nil>} <nil>}
INFO: 2020/02/04 01:05:10 base.baseBalancer: handle SubConn state change: 0xc00007f8b0, CONNECTING
INFO: 2020/02/04 01:05:10 base.baseBalancer: handle SubConn state change: 0xc00007f8b0, READY
INFO: 2020/02/04 01:05:10 roundrobinPicker: newPicker called with info: {map[0xc00007f8b0:{{10.1.6.16:443 <nil> 0 <nil>}}]}
INFO: 2020/02/04 01:06:40 transport: loopyWriter.run returning. connection error: desc = "transport is closing"
INFO: 2020/02/04 01:06:40 base.baseBalancer: handle SubConn state change: 0xc00007f8b0, CONNECTING
INFO: 2020/02/04 01:06:40 roundrobinPicker: newPicker called with info: {map[]}
WARNING: 2020/02/04 01:06:40 grpc: addrConn.createTransport failed to connect to {10.1.6.16:443 <nil> 0 <nil>}. Err :connection error: desc = "transport: Error while dialing dial tcp 10.1.6.16:443: connect: connection refused". Reconnecting...
INFO: 2020/02/04 01:06:40 base.baseBalancer: handle SubConn state change: 0xc00007f8b0, TRANSIENT_FAILURE
WARNING: 2020/02/04 01:06:40 grpc: failed dns A record lookup due to lookup pomerium-authorize.default.svc.cluster.local on 10.96.0.10:53: no such host.
INFO: 2020/02/04 01:06:40 ccResolverWrapper: sending update to cc: {[] <nil> <nil>}
INFO: 2020/02/04 01:06:40 base.baseBalancer: got new ClientConn state: {{[] <nil> <nil>} <nil>}
INFO: 2020/02/04 01:06:40 base.baseBalancer: handle SubConn state change: 0xc00007f8b0, SHUTDOWN
It looks like I was missing something in the chain of events. I was testing a small footprint deployment and for a brief period there might be zero endpoints, thus the resolution error. This seems to be a state it won't recover from. There were no further log messages even upon subsequent retries.
If I start with one replica of my service, add a second one, then remove the first:
INFO: 2020/02/04 01:41:57 parsed scheme: "dns"
INFO: 2020/02/04 01:41:57 ccResolverWrapper: sending update to cc: {[{10.1.6.21:443 <nil> 0 <nil>}] <nil> <nil>}
INFO: 2020/02/04 01:41:57 base.baseBalancer: got new ClientConn state: {{[{10.1.6.21:443 <nil> 0 <nil>}] <nil> <nil>} <nil>}
INFO: 2020/02/04 01:41:57 base.baseBalancer: handle SubConn state change: 0xc000162390, CONNECTING
INFO: 2020/02/04 01:41:57 base.baseBalancer: handle SubConn state change: 0xc000162390, READY
INFO: 2020/02/04 01:41:57 roundrobinPicker: newPicker called with info: {map[0xc000162390:{{10.1.6.21:443 <nil> 0 <nil>}}]}
INFO: 2020/02/04 01:52:35 transport: loopyWriter.run returning. connection error: desc = "transport is closing"
INFO: 2020/02/04 01:52:35 base.baseBalancer: handle SubConn state change: 0xc000162390, CONNECTING
INFO: 2020/02/04 01:52:35 roundrobinPicker: newPicker called with info: {map[]}
WARNING: 2020/02/04 01:52:35 grpc: addrConn.createTransport failed to connect to {10.1.6.21:443 <nil> 0 <nil>}. Err :connection error: desc = "transport: Error while dialing dial tcp 10.1.6.21:443: connect: connection refused". Reconnecting...
INFO: 2020/02/04 01:52:35 base.baseBalancer: handle SubConn state change: 0xc000162390, TRANSIENT_FAILURE
INFO: 2020/02/04 01:52:35 ccResolverWrapper: sending update to cc: {[{10.1.6.28:443 <nil> 0 <nil>}] <nil> <nil>}
INFO: 2020/02/04 01:52:35 base.baseBalancer: got new ClientConn state: {{[{10.1.6.28:443 <nil> 0 <nil>}] <nil> <nil>} <nil>}
INFO: 2020/02/04 01:52:35 base.baseBalancer: handle SubConn state change: 0xc00049f290, CONNECTING
INFO: 2020/02/04 01:52:35 roundrobinPicker: newPicker called with info: {map[]}
INFO: 2020/02/04 01:52:35 base.baseBalancer: handle SubConn state change: 0xc000162390, SHUTDOWN
INFO: 2020/02/04 01:52:35 base.baseBalancer: handle SubConn state change: 0xc00049f290, READY
INFO: 2020/02/04 01:52:35 roundrobinPicker: newPicker called with info: {map[0xc00049f290:{{10.1.6.28:443 <nil> 0 <nil>}}]}
In this case we've missed the addition of a second endpoint until the first one fails. The good news is there is no user error, but we're still not connecting to all the available endpoints as they change.
Here's the 1.25.1 behavior: (a) I start with 1 service replica and simply restart it. We get some transient failure, but the new endpoint is added eventually:
INFO: 2020/02/04 02:39:04 parsed scheme: "dns"
INFO: 2020/02/04 02:39:04 grpc: failed dns SRV record lookup due to lookup _grpclb._tcp.pomerium-authorize.default.svc.cluster.local on 10.96.0.10:53: no such host.
INFO: 2020/02/04 02:39:04 ccResolverWrapper: got new service config:
WARNING: 2020/02/04 02:39:04 ccResolverWrapper: error parsing service config: no JSON service config provided
INFO: 2020/02/04 02:39:04 ccResolverWrapper: sending new addresses to cc: [{10.1.6.31:443 0 <nil>}]
INFO: 2020/02/04 02:39:04 base.baseBalancer: got new ClientConn state: {{[{10.1.6.31:443 0 <nil>}] <nil>} <nil>}
INFO: 2020/02/04 02:39:04 base.baseBalancer: handle SubConn state change: 0xc00017d180, CONNECTING
INFO: 2020/02/04 02:39:04 base.baseBalancer: handle SubConn state change: 0xc00017d180, READY
INFO: 2020/02/04 02:39:04 roundrobinPicker: newPicker called with readySCs: map[{10.1.6.31:443 0 <nil>}:0xc00017d180]
INFO: 2020/02/04 02:39:34 grpc: failed dns SRV record lookup due to lookup _grpclb._tcp.pomerium-authorize.default.svc.cluster.local on 10.96.0.10:53: no such host.
INFO: 2020/02/04 02:39:34 ccResolverWrapper: got new service config:
WARNING: 2020/02/04 02:39:34 ccResolverWrapper: error parsing service config: no JSON service config provided
INFO: 2020/02/04 02:39:34 ccResolverWrapper: sending new addresses to cc: [{10.1.6.31:443 0 <nil>}]
INFO: 2020/02/04 02:39:34 base.baseBalancer: got new ClientConn state: {{[{10.1.6.31:443 0 <nil>}] <nil>} <nil>}
INFO: 2020/02/04 02:40:04 grpc: failed dns SRV record lookup due to lookup _grpclb._tcp.pomerium-authorize.default.svc.cluster.local on 10.96.0.10:53: no such host.
INFO: 2020/02/04 02:40:04 ccResolverWrapper: got new service config:
WARNING: 2020/02/04 02:40:04 ccResolverWrapper: error parsing service config: no JSON service config provided
INFO: 2020/02/04 02:40:04 ccResolverWrapper: sending new addresses to cc: [{10.1.6.31:443 0 <nil>}]
INFO: 2020/02/04 02:40:04 base.baseBalancer: got new ClientConn state: {{[{10.1.6.31:443 0 <nil>}] <nil>} <nil>}
INFO: 2020/02/04 02:40:34 grpc: failed dns SRV record lookup due to lookup _grpclb._tcp.pomerium-authorize.default.svc.cluster.local on 10.96.0.10:53: no such host.
INFO: 2020/02/04 02:40:34 ccResolverWrapper: got new service config:
WARNING: 2020/02/04 02:40:34 ccResolverWrapper: error parsing service config: no JSON service config provided
INFO: 2020/02/04 02:40:34 ccResolverWrapper: sending new addresses to cc: [{10.1.6.31:443 0 <nil>}]
INFO: 2020/02/04 02:40:34 base.baseBalancer: got new ClientConn state: {{[{10.1.6.31:443 0 <nil>}] <nil>} <nil>}
INFO: 2020/02/04 02:41:04 grpc: failed dns SRV record lookup due to lookup _grpclb._tcp.pomerium-authorize.default.svc.cluster.local on 10.96.0.10:53: no such host.
INFO: 2020/02/04 02:41:04 ccResolverWrapper: got new service config:
WARNING: 2020/02/04 02:41:04 ccResolverWrapper: error parsing service config: no JSON service config provided
INFO: 2020/02/04 02:41:04 ccResolverWrapper: sending new addresses to cc: [{10.1.6.31:443 0 <nil>}]
INFO: 2020/02/04 02:41:04 base.baseBalancer: got new ClientConn state: {{[{10.1.6.31:443 0 <nil>}] <nil>} <nil>}
INFO: 2020/02/04 02:41:08 transport: loopyWriter.run returning. connection error: desc = "transport is closing"
INFO: 2020/02/04 02:41:08 base.baseBalancer: handle SubConn state change: 0xc00017d180, CONNECTING
INFO: 2020/02/04 02:41:08 roundrobinPicker: newPicker called with readySCs: map[]
WARNING: 2020/02/04 02:41:08 grpc: addrConn.createTransport failed to connect to {10.1.6.31:443 0 <nil>}. Err :connection error: desc = "transport: Error while dialing dial tcp 10.1.6.31:443: connect: connection refused". Reconnecting...
INFO: 2020/02/04 02:41:08 base.baseBalancer: handle SubConn state change: 0xc00017d180, TRANSIENT_FAILURE
INFO: 2020/02/04 02:41:10 base.baseBalancer: handle SubConn state change: 0xc00017d180, CONNECTING
INFO: 2020/02/04 02:41:10 roundrobinPicker: newPicker called with readySCs: map[]
WARNING: 2020/02/04 02:41:10 grpc: addrConn.createTransport failed to connect to {10.1.6.31:443 0 <nil>}. Err :connection error: desc = "transport: Error while dialing dial tcp 10.1.6.31:443: connect: connection refused". Reconnecting...
INFO: 2020/02/04 02:41:10 base.baseBalancer: handle SubConn state change: 0xc00017d180, TRANSIENT_FAILURE
INFO: 2020/02/04 02:41:11 base.baseBalancer: handle SubConn state change: 0xc00017d180, CONNECTING
INFO: 2020/02/04 02:41:11 roundrobinPicker: newPicker called with readySCs: map[]
WARNING: 2020/02/04 02:41:31 grpc: addrConn.createTransport failed to connect to {10.1.6.31:443 0 <nil>}. Err :connection error: desc = "transport: Error while dialing dial tcp 10.1.6.31:443: i/o timeout". Reconnecting...
INFO: 2020/02/04 02:41:31 base.baseBalancer: handle SubConn state change: 0xc00017d180, TRANSIENT_FAILURE
INFO: 2020/02/04 02:41:34 base.baseBalancer: handle SubConn state change: 0xc00017d180, CONNECTING
INFO: 2020/02/04 02:41:34 roundrobinPicker: newPicker called with readySCs: map[]
INFO: 2020/02/04 02:41:34 grpc: failed dns SRV record lookup due to lookup _grpclb._tcp.pomerium-authorize.default.svc.cluster.local on 10.96.0.10:53: no such host.
INFO: 2020/02/04 02:41:35 ccResolverWrapper: got new service config:
WARNING: 2020/02/04 02:41:35 ccResolverWrapper: error parsing service config: no JSON service config provided
INFO: 2020/02/04 02:41:35 ccResolverWrapper: sending new addresses to cc: [{10.1.6.44:443 0 <nil>}]
INFO: 2020/02/04 02:41:35 base.baseBalancer: got new ClientConn state: {{[{10.1.6.44:443 0 <nil>}] <nil>} <nil>}
WARNING: 2020/02/04 02:41:35 grpc: addrConn.createTransport failed to connect to {10.1.6.31:443 0 <nil>}. Err :connection error: desc = "transport: Error while dialing dial tcp 10.1.6.31:443: operation was canceled". Reconnecting...
INFO: 2020/02/04 02:41:35 base.baseBalancer: handle SubConn state change: 0xc00019eaa0, CONNECTING
INFO: 2020/02/04 02:41:35 base.baseBalancer: handle SubConn state change: 0xc00017d180, SHUTDOWN
INFO: 2020/02/04 02:41:35 base.baseBalancer: handle SubConn state change: 0xc00019eaa0, READY
INFO: 2020/02/04 02:41:35 roundrobinPicker: newPicker called with readySCs: map[{10.1.6.44:443 0 <nil>}:0xc00019eaa0]
(b) I start with 1 service replica, then add a second. It shows up ~30 seconds later so that I have two endpoints available:
INFO: 2020/02/04 02:34:59 parsed scheme: "dns"
INFO: 2020/02/04 02:34:59 grpc: failed dns SRV record lookup due to lookup _grpclb._tcp.pomerium-authorize.default.svc.cluster.local on 10.96.0.10:53: no such host.
INFO: 2020/02/04 02:34:59 ccResolverWrapper: got new service config:
WARNING: 2020/02/04 02:34:59 ccResolverWrapper: error parsing service config: no JSON service config provided
INFO: 2020/02/04 02:34:59 ccResolverWrapper: sending new addresses to cc: [{10.1.6.31:443 0 <nil>}]
INFO: 2020/02/04 02:34:59 base.baseBalancer: got new ClientConn state: {{[{10.1.6.31:443 0 <nil>}] <nil>} <nil>}
INFO: 2020/02/04 02:34:59 base.baseBalancer: handle SubConn state change: 0xc0005579c0, CONNECTING
INFO: 2020/02/04 02:34:59 base.baseBalancer: handle SubConn state change: 0xc0005579c0, READY
INFO: 2020/02/04 02:34:59 roundrobinPicker: newPicker called with readySCs: map[{10.1.6.31:443 0 <nil>}:0xc0005579c0]
INFO: 2020/02/04 02:35:29 grpc: failed dns SRV record lookup due to lookup _grpclb._tcp.pomerium-authorize.default.svc.cluster.local on 10.96.0.10:53: no such host.
INFO: 2020/02/04 02:35:29 ccResolverWrapper: got new service config:
WARNING: 2020/02/04 02:35:29 ccResolverWrapper: error parsing service config: no JSON service config provided
INFO: 2020/02/04 02:35:29 ccResolverWrapper: sending new addresses to cc: [{10.1.6.31:443 0 <nil>}]
INFO: 2020/02/04 02:35:29 base.baseBalancer: got new ClientConn state: {{[{10.1.6.31:443 0 <nil>}] <nil>} <nil>}
INFO: 2020/02/04 02:35:59 grpc: failed dns SRV record lookup due to lookup _grpclb._tcp.pomerium-authorize.default.svc.cluster.local on 10.96.0.10:53: no such host.
INFO: 2020/02/04 02:35:59 ccResolverWrapper: got new service config:
WARNING: 2020/02/04 02:35:59 ccResolverWrapper: error parsing service config: no JSON service config provided
INFO: 2020/02/04 02:35:59 ccResolverWrapper: sending new addresses to cc: [{10.1.6.41:443 0 <nil>} {10.1.6.31:443 0 <nil>}]
INFO: 2020/02/04 02:35:59 base.baseBalancer: got new ClientConn state: {{[{10.1.6.41:443 0 <nil>} {10.1.6.31:443 0 <nil>}] <nil>} <nil>}
INFO: 2020/02/04 02:35:59 base.baseBalancer: handle SubConn state change: 0xc000556b90, CONNECTING
INFO: 2020/02/04 02:35:59 base.baseBalancer: handle SubConn state change: 0xc000556b90, READY
INFO: 2020/02/04 02:35:59 roundrobinPicker: newPicker called with readySCs: map[{10.1.6.31:443 0 <nil>}:0xc0005579c0 {10.1.6.41:443 0 <nil>}:0xc000556b90]
It looks like I was missing something in the chain of events. I was testing a small footprint deployment and for a brief period there might be zero endpoints, thus the resolution error. This seems to be a state it won't recover from.
Interesting. I think this is an important step for reproducing; I'll take a look at this.
In this case we've missed the addition of a second endpoint until the first one fails. The good news is there is no user error, but we're still not connecting to all the available endpoints as they change.
This sounds like it's working as intended. The DNS resolver does not regularly poll as of 1.26; it only refreshes when connection errors occur. This is a behavior change from 1.25 (see #3228), which is why the new backend appears there without any error occurring. You can set MaxConnectionAge
on the server to cause connections to be gracefully stopped periodically, which will trigger a re-resolve on the client.
@travisgroth if you don't mind, could you run your scenario with zero endpoints using #3361 and confirm this fixes the problem, i.e. the ClientConn should now connect to the new endpoint when it is added. Thanks!
(a) Great! It seems to be resolved with that branch. We get a failed SC and it continues through the empty list until we get a new one that gets to a ready state:
INFO: 2020/02/05 00:19:40 transport: loopyWriter.run returning. connection error: desc = "transport is closing"
INFO: 2020/02/05 00:19:40 base.baseBalancer: handle SubConn state change: 0xc0005b86f0, CONNECTING
INFO: 2020/02/05 00:19:40 roundrobinPicker: newPicker called with info: {map[]}
WARNING: 2020/02/05 00:19:40 grpc: addrConn.createTransport failed to connect to {10.1.6.71:443 <nil> 0 <nil>}. Err :connection error: desc = "transport: Error while dialing dial tcp 10.1.6.71:443: connect: connection refused". Reconnecting...
INFO: 2020/02/05 00:19:40 base.baseBalancer: handle SubConn state change: 0xc0005b86f0, TRANSIENT_FAILURE
INFO: 2020/02/05 00:19:40 ccResolverWrapper: sending update to cc: {[] <nil> <nil>}
INFO: 2020/02/05 00:19:40 base.baseBalancer: got new ClientConn state: {{[] <nil> <nil>} <nil>}
INFO: 2020/02/05 00:19:41 base.baseBalancer: handle SubConn state change: 0xc0005b86f0, CONNECTING
INFO: 2020/02/05 00:19:41 roundrobinPicker: newPicker called with info: {map[]}
WARNING: 2020/02/05 00:20:01 grpc: addrConn.createTransport failed to connect to {10.1.6.71:443 <nil> 0 <nil>}. Err :connection error: desc = "transport: Error while dialing dial tcp 10.1.6.71:443: i/o timeout". Reconnecting...
INFO: 2020/02/05 00:20:01 base.baseBalancer: handle SubConn state change: 0xc0005b86f0, TRANSIENT_FAILURE
INFO: 2020/02/05 00:20:03 base.baseBalancer: handle SubConn state change: 0xc0005b86f0, CONNECTING
INFO: 2020/02/05 00:20:03 roundrobinPicker: newPicker called with info: {map[]}
INFO: 2020/02/05 00:20:10 ccResolverWrapper: sending update to cc: {[{10.1.6.76:443 <nil> 0 <nil>}] <nil> <nil>}
INFO: 2020/02/05 00:20:10 base.baseBalancer: got new ClientConn state: {{[{10.1.6.76:443 <nil> 0 <nil>}] <nil> <nil>} <nil>}
INFO: 2020/02/05 00:20:10 base.baseBalancer: handle SubConn state change: 0xc00046e410, CONNECTING
INFO: 2020/02/05 00:20:10 base.baseBalancer: handle SubConn state change: 0xc0005b86f0, SHUTDOWN
WARNING: 2020/02/05 00:20:10 grpc: addrConn.createTransport failed to connect to {10.1.6.71:443 <nil> 0 <nil>}. Err :connection error: desc = "transport: Error while dialing dial tcp 10.1.6.71:443: operation was canceled". Reconnecting...
INFO: 2020/02/05 00:20:10 base.baseBalancer: handle SubConn state change: 0xc00046e410, READY
INFO: 2020/02/05 00:20:10 roundrobinPicker: newPicker called with info: {map[0xc00046e410:{{10.1.6.76:443 <nil> 0 <nil>}}]}
(b) I feel like there are more corner cases that might trip over the total removal of the client regularly checking in for endpoints. Artificially closing connections regularly seems like a disruptive way of forcing a rediscovery interval. It also removes a lot of control over discovery semantics from the client. I wouldn't mind so much having to explicitly enable the feature, but being able to trigger client discovery outside of a server-initiated close sounds much more resilient. Thoughts?
@travisgroth thanks for testing! When this is merged I'll make a patch release with it.
Regarding (b), this is a common concern, however I don't think it's a problem in practice. Please see https://github.com/grpc/grpc/issues/12295#issuecomment-432833454 (and several other comments in that issue) for more discussion on this topic. Note that HTTP/1 recreates connections far more often, so cycling these connections periodically (every 20 minutes or so) shouldn't be a major impact. If you need more responsive discovery, using something other than DNS that doesn't need to be polled at all would be recommended.
I appreciate the quick response on this!
After reviewing the thread, I see the reasoning. I'm happy to sideline the DNS discussion for the moment and see how the settings work in practice.
What version of gRPC are you using?
1.26
What version of Go are you using (
go version
)?1.13.4
What operating system (Linux, Windows, …) and version?
Linux / Kubernetes / Docker - various.
What did you do?
When using
grpc.WithBalancerName(roundrobin.Name)
, with adns:///
Dial target, errors do not trigger re-resolve.Detail: a client starts with
dns:///my-service
as a dial target and the round_robin load balancer.my-service
DNS returns an IP192.168.0.19
. Due to an operational change,my-service
is moved and DNS is updated to return192.168.0.20
. The service stops running on192.168.0.19
, refusing connections.This is approximately the workflow that would occur when using Kubernetes and a headless service if a pod is replaced. This happens for a variety of reasons, including routine updates and rescheduling.
What did you expect to see?
(a) Upon receiving a connection refused, DNS should be re-queried for an updated endpoint list (b) Without receiving an error from an existing endpoint, the load balancer should be repopulated occasionally with the latest records in DNS as endpoint sets change
This was the behavior previous to 1.26. By rolling code back to grpc 1.25.1, we were able to restore this behavior.
https://github.com/grpc/grpc-go/pull/3165 is likely the culprit.
What did you see instead?
(a) Endpoint list is never re-queried and the client simply throws a persistent error:
(b) Since every subsequent request fails, the endpoint list does not seem to getting updated with new records in DNS at all.
Additional
I recognize the API around DNS and load balancing may have changed recently (and are evolving) but I'm not seeing any clear indication of how to retain the expected behavior.