Distribution DNS Failover

jcurtis789 commented 1 year ago

Description

Hello,

I have an FQDN configured to an on-premise registry such that two servers answer to the FQDN (i.e. www.my-private-repo.com). These servers run a JFROG Artifactory repo, although I believe that to be insignificant for this issue.

When both servers are healthy, performing a docker push/docker pull works flawlessly. However, I noticed that if one server goes down (maintenance, otherwise) these operations fail.

Reproduce

To reproduce, I have configured a custom FQDN where one DNS entry is one of the real servers hosting my image repository and another is a real DNS record that resolves to a fake server.

nslookup www.my-private-repo.com ... Address 1.2.3.4 ... Address 5.6.7.8

nslookup1.2.3.4 ... name = my-fake-server

nslookup 5.6.7.8 ... name = my-real-server

A 'docker pull' times out after 15 seconds. Throwing --debug doesn't provide any more information. When my-fake-server is removed from www.my-private-repo.com, pulls begin working again.

Expected behavior

I would expect the Docker CLI to realize one of the DNS entries is faulty (i.e. my-fake-server) via connection timeout, 502 or otherwise and attempt the request on another entry (i.e. my-real-server). Perhaps I am missing a configuration to do so.

Dockerhub (registry.hub.docker.com) DNS resolves to three separate IPs. If one of these were to become unavailable, I would expect similar behavior, but more widespread issues across the community unless I'm perhaps missing something.

docker version

Client: Docker Engine - Community
 Version:           20.10.14
 API version:       1.41
 Go version:        go1.16.15
 Git commit:        a224086
 Built:             Thu Mar 24 01:49:57 2022
 OS/Arch:           linux/amd64
 Context:           default
 Experimental:      true

Server: Docker Engine - Community
 Engine:
  Version:          20.10.14
  API version:      1.41 (minimum version 1.12)
  Go version:       go1.16.15
  Git commit:       87a90dc
  Built:            Thu Mar 24 01:48:24 2022
  OS/Arch:          linux/amd64
  Experimental:     false
 containerd:
  Version:          1.5.11
  GitCommit:        3df54a852345ae127d1fa3092b95168e4a88e2f8
 runc:
  Version:          1.0.3
  GitCommit:        v1.0.3-0-gf46b6ba
 docker-init:
  Version:          0.19.0
  GitCommit:        de40ad0

docker info

Client:
 Context:    default
 Debug Mode: false
 Plugins:
  app: Docker App (Docker Inc., v0.9.1-beta3)
  buildx: Docker Buildx (Docker Inc., v0.8.1-docker)
  scan: Docker Scan (Docker Inc., v0.17.0)

Server:
 Containers: 0
  Running: 0
  Paused: 0
  Stopped: 0
 Images: 0
 Server Version: 20.10.14
 Storage Driver: overlay2
  Backing Filesystem: xfs
  Supports d_type: true
  Native Overlay Diff: true
  userxattr: false
 Logging Driver: json-file
 Cgroup Driver: systemd
 Cgroup Version: 1
 Plugins:
  Volume: local
  Network: bridge host ipvlan macvlan null overlay
  Log: awslogs fluentd gcplogs gelf journald json-file local logentries splunk syslog
 Swarm: inactive
 Runtimes: io.containerd.runtime.v1.linux runc io.containerd.runc.v2
 Default Runtime: runc
 Init Binary: docker-init
 containerd version: 3df54a852345ae127d1fa3092b95168e4a88e2f8
 runc version: v1.0.3-0-gf46b6ba
 init version: de40ad0
 Security Options:
  seccomp
   Profile: default
 Kernel Version: 3.10.0-1160.88.1.el7.x86_64
 Operating System: CentOS Linux 7 (Core)
 OSType: linux
 Architecture: x86_64
 CPUs: 2
 Total Memory: 7.638GiB
 Docker Root Dir: /var/lib/docker
 Debug Mode: true
  EventsListeners: 0
 Registry: https://index.docker.io/v1/
 Labels:
 Experimental: false
 Live Restore Enabled: false

Additional Info

Thank you!

neersighted commented 1 year ago

As per https://github.com/docker/cli/issues/4296#issuecomment-1553442163, I would say this is by design; any sort of fallback behavior would have to be implemented as part of the daemon itself (or in distribution/distribution), and given that the distribution spec is silent on what to do here, I don't think introducing a new behavior is the right move.

jcurtis789 commented 1 year ago

That's fair, I guess perhaps I'm looking for guidance on my current setup then.

Using dockerhub as an example, my current understanding is that if one of the three following records becomes unavailable, image pulls for a large number of folks would start failing.

> nslookup registry.hub.docker.com
Non-authoritative answer:
Name:   registry.hub.docker.com
Address: 52.1.184.176
Name:   registry.hub.docker.com
Address: 18.215.138.58
Name:   registry.hub.docker.com
Address: 34.194.164.123

Is this assumption accurate? How would you recommend mitigating against this possible failure scenario? I'm leaning towards putting both of my servers behind a load balancer that is capable of performing health checks but that wouldn't solve the issue stated above.

neersighted commented 1 year ago

I think in general, if an IP 'must' respond without fail, the solution is anycast routing. There are other options like low TTLs and DNS trickery as well; the main objective for a highly available registry is that the DNS name must resolve to an IP where a currently functional registry can be located.

jcurtis789 commented 1 year ago

the main objective for a highly available registry is that the DNS name must resolve to an IP where a currently functional registry can be located

So I think this is the crux of my issue with how the Docker CLI currently behaves. Any production-ready registries, including those available over the internet, will employ a round robin DNS configuration for their domain. Using my example above, registry.hub.docker.com resolves to three distinct IP addresses.

nslookup registry.hub.docker.com Non-authoritative answer: Name: registry.hub.docker.com Address: 52.1.184.176 Name: registry.hub.docker.com Address: 18.215.138.58 Name: registry.hub.docker.com Address: 34.194.164.123

In it's current state, one-third of 'docker pull/push' commands would fail if the facility that contains 34.194.164.123 were to catch on fire. These failures would continue to occur until either 1) The DNS record for registry.hub.docker.com were updated to only include the two "good" IP addresses and the clients received the DNS update or 2) The fire is put out and service is restored.

Apache HTTPComponents 4+ (Java) resolved this issue by implementing retry logic around connection timeout parameters. Perhaps my inquiry goes deeper into underlying Go libraries that are being used. I'd love to hear your thoughts on my hypothetical scenario.

neersighted commented 1 year ago

That is by design -- all IP addresses are treated as equal by most software. Going out of your way to try and retry another response from the DNS packet is pretty involved, and I would generally go so far as to call it an anti-feature.

This is how libc does DNS. resolv.conf has the same semantics (all nameservers are treated equally; you can't "fall through" to a second nameserver because the first one return an error; musl libc goes so far as to do lookups in parallel and pick the first to return).

neersighted commented 1 year ago

Hmm, I might have to take that back -- apparently the thinking in this area has moved on from gethostbyname() -- specifically I found https://www.rfc-editor.org/rfc/rfc6724#section-2, which states:

As a consequence, we intend that implementations of APIs such as getaddrinfo() will use the destination address selection algorithm specified here to sort the list of IPv6 and IPv4 addresses that they return. Separately, the IPv6 network layer will use the source address selection algorithm when an application or upper layer has not specified a source address. Application of this specification to source address selection in an IPv4 network layer might be possible, but this is not explored further here.

Well-behaved applications SHOULD NOT simply use the first address returned from an API such as getaddrinfo() and then give up if it fails. For many applications, it is appropriate to iterate through the list of addresses returned from getaddrinfo() until a working address is found. For other applications, it might be appropriate to try multiple addresses in parallel (e.g., with some small delay in between) and use the first one to succeed.

net.Dial also claims:

When using TCP, and the host resolves to multiple IP addresses, Dial will try each IP address in order until one succeeds.

There might be a little more to this than my very systems-C biased experience/recollection indicates.

I'm curious as to what @corhere thinks, and I'll need to spend some time figuring out what the Go stdlib actually intends to do (if it's not trying to do a simple gethostbyname()).

jcurtis789 commented 1 year ago

Thanks :-)

That is by design -- all IP addresses are treated as equal by most software. Going out of your way to try and retry another response from the DNS packet is pretty involved, and I would generally go so far as to call it an anti-feature.

This is how libc does DNS. resolv.conf has the same semantics (all nameservers are treated equally; you can't "fall through" to a second nameserver because the first one return an error; musl libc goes so far as to do lookups in parallel and pick the first to return).

I just wanted to clarify one point which I don't think I correctly articulated in my original post:

I 100% agree that a retry should not be attempted if a server returns an error response (4xx, etc.). It is when a connection to the attempted server can not even be established where I would expect another address to be attempted.

corhere commented 1 year ago

The documentation for net.Dial does appear to be correct insofar as it does try all resolved addresses, so we already have DNS failover at the transport layer for distribution requests. I have strong doubts that it would be appropriate to fail over to the next resolved address on HTTP 502 or for any other reason after the transport-layer connection has been successfully established. As far as I can tell the HTTP RFCs are silent on the matter of transport establishment so RFC 6724 would seem to apply. Given that RFC 6724 talks about "success" and "failure" in the context of connect()/sendto()/bind(), I believe that "success" means that a connection is established.

Failing over to the next address because the server at one address is responsive but otherwise unable to complete the HTTP request to the client's satisfaction—whatever the reason—could be bad for server operators. In the event that the server is responding with failures because it is overloaded, failing over to the next DNS record would merely compound the problem by overloading the fallback server with a thundering herd of retried requests.

neersighted commented 1 year ago

This entire time I've been reasoning in terms of the transport layer, and the successful establishment of a TCP connection. I very much agree that the application layer should have no implications for the connection semantics.

However, if our HTTP client is already retrying failed TCP handshakes with the next IP returned by DNS, it sounds like there might only be a docs issue here in the end.

jcurtis789 commented 1 year ago

This issue is easily reproduced on my end by setting up a round robin DNS entry configured with one "good" server and one "bad" server. The docker pull/push commands time out after 15 seconds until the "bad" server is removed from the DNS entry after which the commands work without issue.

nslookup www.my-private-repo.com Address 1.2.3.4 Address 5.6.7.8

nslookup1.2.3.4 name = my-fake-server

nslookup 5.6.7.8 name = my-real-server

docker pull www.my-private-repo.com/my-image:1.0

If this sort of configuration is working for both of you then perhaps it's the fact that I have an old-ish Docker binary (version 20.10.14 / go1.16.15)?

corhere commented 1 year ago

@jcurtis789 what's bad about the "bad" server in your tests?

jcurtis789 commented 1 year ago

In this particular scenario, the server is powered down so nothing is listening on port 80. I can also replicate by simply shutting down the HA Proxy instance in front of Artifactory, which accomplishes the same thing.

Restoring power/starting up HA Proxy fixes the problem.

corhere commented 1 year ago

https://github.com/moby/moby/blob/f5106148e333be4ad92fc6c9b9a30b0ff1e96f8d/registry/registry.go#L167-L170 In the case of the registry domain name resolving to two addresses, the dialer will fail over to the second address after fifteen seconds.

https://github.com/moby/moby/blob/f5106148e333be4ad92fc6c9b9a30b0ff1e96f8d/registry/auth.go#L121-L124 After fifteen seconds, the http.Client timeout expires and the request fails, defeating the net.Dialer failover. Looks like a bug!

jcurtis789 commented 1 year ago

Nice find - thanks! From my perspective, 15 seconds is an order of magnitude higher than I would expect to wait for a connection to get established. Just my two cents, but I would expect that value to be 1-2 seconds at most to further prevent a degradation of service.

Thanks again both of you for your time in investigating :-)

thaJeztah commented 1 year ago

possible duplicates;

moby / moby