ori-edge / k8s_gateway

A CoreDNS plugin to resolve all types of external Kubernetes resources
Apache License 2.0
316 stars 65 forks source link

Resolving zone domain and proper fallthrough #35

Closed cubic3d closed 3 years ago

cubic3d commented 3 years ago

Let's take this example

k8s_gateway domain.tld {
    kubeconfig /path/kubeconfig
    fallthrough
}

If querying for domain.tld the current implementation will call serveApex in https://github.com/ori-edge/k8s_gateway/blob/master/gateway.go#L123 and return SOA or NS entries. If fallthrough is declared it will won't handle it here and still return those two possible responses (inconsistent with other plugins).

I built it skipping this case and handle the zone domain same as other subdomains, so having an Ingress defining domain.tld in its hosts works fine and also the default fallthrough handler. I'm not sure why the response handler has an override for the zone domain. If SOA and NS are required we could let pass any other request to the default handler to be able to resolve the zone domain or if it's not desired at least let the zone domain fall through.

Any ideas? I'm happy to PR either way, but would prefer it could resolve the zone domain itself using k8s resources.

networkop commented 3 years ago

Can you add a few examples of current vs expected behavior?

cubic3d commented 3 years ago

Can you add a few examples of current vs expected behavior?

Of course! In the following queries I assume an Ingress resource defining a host with domain.tld and the Corefile from above:

.:1053 {
...
  k8s_gateway domain.tld {
    kubeconfig kubeconfig
    fallthrough
  }
  forward . tls://1.1.1.1 tls://1.0.0.1 tls://2606:4700:4700::1111 tls://2606:4700:4700::1001 {
    tls_servername cloudflare-dns.com
  }
}

Current behaviour (not resolving and sending SOA):

; <<>> DiG 9.16.18 <<>> @localhost -p 1053 domain.tld
; (1 server found)
;; global options: +cmd
;; Got answer:
;; ->>HEADER<<- opcode: QUERY, status: NOERROR, id: 41923
;; flags: qr rd; QUERY: 1, ANSWER: 0, AUTHORITY: 1, ADDITIONAL: 1
;; WARNING: recursion requested but not available

;; OPT PSEUDOSECTION:
; EDNS: version: 0, flags:; udp: 4096
; COOKIE: efe621840a495f3f (echoed)
;; QUESTION SECTION:
;domain.tld.            IN  A

;; AUTHORITY SECTION:
domain.tld.     60  IN  SOA dns1.kube-system.domain.tld. hostmaster.dns1.kube-system.domain.tld. 12345 7200 1800 86400 60

;; Query time: 3 msec
;; SERVER: 127.0.0.1#1053(127.0.0.1)
;; WHEN: Mon Jul 05 22:19:19 CEST 2021
;; MSG SIZE  rcvd: 162

Expected (resolving query):

; <<>> DiG 9.16.18 <<>> @localhost -p 1053 domain.tld
; (1 server found)
;; global options: +cmd
;; Got answer:
;; ->>HEADER<<- opcode: QUERY, status: NOERROR, id: 45722
;; flags: qr aa rd; QUERY: 1, ANSWER: 1, AUTHORITY: 0, ADDITIONAL: 1
;; WARNING: recursion requested but not available

;; OPT PSEUDOSECTION:
; EDNS: version: 0, flags:; udp: 4096
; COOKIE: 966178b144c51d99 (echoed)
;; QUESTION SECTION:
;domain.tld.        IN  A

;; ANSWER SECTION:
domain.tld. 60  IN  A   192.168.99.4

;; Query time: 0 msec
;; SERVER: 127.0.0.1#1053(127.0.0.1)
;; WHEN: Mon Jul 05 22:23:17 CEST 2021
;; MSG SIZE  rcvd: 87

In the following examples there is no resource containing a host for domain.tld:

Current behaviour (not finding a resource and responding with SOA):

; <<>> DiG 9.16.18 <<>> @localhost -p 1053 domain.tld
; (1 server found)
;; global options: +cmd
;; Got answer:
;; ->>HEADER<<- opcode: QUERY, status: NOERROR, id: 35567
;; flags: qr rd; QUERY: 1, ANSWER: 0, AUTHORITY: 1, ADDITIONAL: 1
;; WARNING: recursion requested but not available

;; OPT PSEUDOSECTION:
; EDNS: version: 0, flags:; udp: 4096
; COOKIE: d3c04a2768259411 (echoed)
;; QUESTION SECTION:
;domain.tld.            IN  A

;; AUTHORITY SECTION:
domain.tld.     60  IN  SOA dns1.kube-system.domain.tld. hostmaster.dns1.kube-system.domain.tld. 12345 7200 1800 86400 60

;; Query time: 0 msec
;; SERVER: 127.0.0.1#1053(127.0.0.1)
;; WHEN: Mon Jul 05 22:28:39 CEST 2021
;; MSG SIZE  rcvd: 162

Expected (not finding a resource and falling through):

; <<>> DiG 9.16.18 <<>> @localhost -p 1053 domain.tld
; (1 server found)
;; global options: +cmd
;; Got answer:
;; ->>HEADER<<- opcode: QUERY, status: NXDOMAIN, id: 18948
;; flags: qr rd ra ad; QUERY: 1, ANSWER: 0, AUTHORITY: 1, ADDITIONAL: 1

;; OPT PSEUDOSECTION:
; EDNS: version: 0, flags:; udp: 4096
; PAD: (350 bytes)
;; QUESTION SECTION:
;domain.tld.            IN  A

;; AUTHORITY SECTION:
.           85891   IN  SOA a.root-servers.net. nstld.verisign-grs.com. 2021070501 1800 900 604800 86400

;; Query time: 206 msec
;; SERVER: 127.0.0.1#1053(127.0.0.1)
;; WHEN: Mon Jul 05 22:29:27 CEST 2021
;; MSG SIZE  rcvd: 468
networkop commented 3 years ago

ok, so it only happens for queries matching the zone root and it is not a problem for foo.domain.tld, correct?

networkop commented 3 years ago

and also wanted to confirm, when you said it's inconsistent with other plugins, which ones did you mean specifically? I haven't tested it but looks like kubernetes behaves in a similar way https://github.com/coredns/coredns/blob/master/plugin/kubernetes/handler.go#L64

cubic3d commented 3 years ago

ok, so it only happens for queries matching the zone root and it is not a problem for foo.domain.tld, correct?

Yes, it's only about the zone's root that the plugin receives.

and also wanted to confirm, when you said it's inconsistent with other plugins, which ones did you mean specifically?

I didn't look specifically on kubernetes (mostly local domains that are excluded in k8s from fallthrough) but at hosts

hosts /etc/hosts domain.tld {
  fallthrough
}

which will fallthrough for a query for the root domain.tld. I don't see the same behaviour in the referenced code.

To clarify: In the case of k8s_gateway there are two places where fallthrough is happening currently:

But not in https://github.com/ori-edge/k8s_gateway/blob/master/gateway.go#L122-L125 which always skips the other two if the query was made for the zone's root. The idea is to allow a query for the zone's root and not just to return SOA which automatically would lead to propper fallthrough or keep the current behaviour, but also allow fallthrough in this case if it's enabled on the plugin.

networkop commented 3 years ago

so k8s_gateway should never return SOA in case fallthrough is configured as an option, right?

cubic3d commented 3 years ago

so k8s_gateway should never return SOA in case fallthrough is configured as an option, right?

Correct

networkop commented 3 years ago

ok, I think I got it. I'll try to cook something up in the next few days... unless you wanted to do a PR?

cubic3d commented 3 years ago

Sounds good! Up to you, I could but probably not in the next few days, maybe end of weekend or in the course of next week ;)

cubic3d commented 3 years ago

I'm very confused how to tackle that in the correct way. After testing on kubernetes plugin on CoreDNS in-cluster it answers on every SOA query, regardless if it's a subdomain, existent or non-existent with SOA in answer section and never setting the authority bit.

; <<>> DiG 9.16.16 <<>> soa cluster.local
;; global options: +cmd
;; Got answer:
;; WARNING: .local is reserved for Multicast DNS
;; You are currently testing what happens when an mDNS query is leaked to DNS
;; ->>HEADER<<- opcode: QUERY, status: NOERROR, id: 21336
;; flags: qr aa rd; QUERY: 1, ANSWER: 1, AUTHORITY: 0, ADDITIONAL: 1
;; WARNING: recursion requested but not available

;; OPT PSEUDOSECTION:
; EDNS: version: 0, flags:; udp: 4096
; COOKIE: 3dc581ebea467303 (echoed)
;; QUESTION SECTION:
;cluster.local.         IN  SOA

;; ANSWER SECTION:
cluster.local.      5   IN  SOA ns.dns.cluster.local. hostmaster.cluster.local. 1626158767 7200 1800 86400 5

;; Query time: 0 msec
;; SERVER: 10.43.0.10#53(10.43.0.10)
;; WHEN: Wed Jul 14 14:54:26 UTC 2021
;; MSG SIZE  rcvd: 147
; <<>> DiG 9.16.16 <<>> soa non-existent.cluster.local
;; global options: +cmd
;; Got answer:
;; WARNING: .local is reserved for Multicast DNS
;; You are currently testing what happens when an mDNS query is leaked to DNS
;; ->>HEADER<<- opcode: QUERY, status: NOERROR, id: 20813
;; flags: qr aa rd; QUERY: 1, ANSWER: 1, AUTHORITY: 0, ADDITIONAL: 1
;; WARNING: recursion requested but not available

;; OPT PSEUDOSECTION:
; EDNS: version: 0, flags:; udp: 4096
; COOKIE: d49502537dce6e18 (echoed)
;; QUESTION SECTION:
;non-existent.cluster.local.    IN  SOA

;; ANSWER SECTION:
cluster.local.      5   IN  SOA ns.dns.cluster.local. hostmaster.cluster.local. 1626158767 7200 1800 86400 5

;; Query time: 0 msec
;; SERVER: 10.43.0.10#53(10.43.0.10)
;; WHEN: Wed Jul 14 14:55:46 UTC 2021
;; MSG SIZE  rcvd: 160
cubic3d commented 3 years ago

There's an edge case: Fallthrough is on, query for SOA on root zone and root zone does match a resource (contradicts the above agreement on no SOA on fallthrough?)

cubic3d commented 3 years ago

I opened a PR with a proposal for the changes, tests are not adjusted yet.

Let me know if it does make any sense :wink:

networkop commented 3 years ago

There's an edge case: Fallthrough is on, query for SOA on root zone and root zone does match a resource (contradicts the above agreement on no SOA on fallthrough?)

I think if there's at least a single match, fallthrough should not be triggered

networkop commented 3 years ago

I'm very confused how to tackle that in the correct way. After testing on kubernetes plugin on CoreDNS in-cluster it answers on every SOA query, regardless if it's a subdomain, existent or non-existent with SOA in answer section and never setting the authority bit.

Authority bit aside, I think this is the expected behavior. Why do you find this strange?

cubic3d commented 3 years ago

I'm very confused how to tackle that in the correct way. After testing on kubernetes plugin on CoreDNS in-cluster it answers on every SOA query, regardless if it's a subdomain, existent or non-existent with SOA in answer section and never setting the authority bit.

Authority bit aside, I think this is the expected behavior. Why do you find this strange?

At this point I had probably too much playing with fall through and without - I have no idea what confused me there, yes this should be expected.

cubic3d commented 3 years ago

For the sake of search: This PR also unintentionally fixed a wrong RFC behaviour in case of fall through: If AAAA was queried, the plugin would initiate a fall through regardless of existing A records for the requested domain and possibly cause a mismatch cache if next plugin would return a negative response (NXDOMAIN). As always: musl on Alpine manages to fail in this case, even if the resolver queries both AAAAand A.