kubernetes / dns

Kubernetes DNS service
Apache License 2.0
930 stars 467 forks source link

stale service A records returned by dnsmasq? #60

Closed ravilr closed 6 years ago

ravilr commented 7 years ago

kubedns: gcr.io/google_containers/kubedns-amd64:1.8 dnsmasq: gcr.io/google_containers/kube-dnsmasq-amd64:1.4

A service resource say qa-svc1 was created and deleted after some time. the same qa-svc1, if recreated and got assigned a different ClusterIP, we are seeing kube-dns/Cluster-First dns-policy pods continue to see older ClusterIP on dns resolution of qa-svc1. I believe this is from the dnsmasq cache. Should there be a max-cache-ttl setting set on all dnsmasq cached records? or can kube-dns invalidate the cache in dnsmasq?

@bowei @thockin

bowei commented 7 years ago

Record TTL is set to 30 seconds currently. Are you seeing inconsistencies beyond that length of time?

ravilr commented 7 years ago

yes, I was seeing inconsistent results. But, i forgot to capture dig output at the time when it was happening. I restarted all kube-dns pods since then.

Before restarting kube-dns (10.10.10.109 is the correct clusterIP) :

[qa.default.zk2@tachyon-qa-bf1]# host qa-default-zk1
qa-default-zk1.tachyon.svc.starfleet.local has address 10.10.10.33
[qa.default.zk2@tachyon-qa-bf1]# host qa-default-zk1
qa-default-zk1.tachyon.svc.starfleet.local has address 10.10.10.33
[qa.default.zk2@tachyon-qa-bf1]# host qa-default-zk1
qa-default-zk1.tachyon.svc.starfleet.local has address 10.10.10.109
[qa.default.zk2@tachyon-qa-bf1]# host qa-default-zk1
qa-default-zk1.tachyon.svc.starfleet.local has address 10.10.10.109
[qa.default.zk2@tachyon-qa-bf1]# host qa-default-zk1
qa-default-zk1.tachyon.svc.starfleet.local has address 10.10.10.109
[qa.default.zk2@tachyon-qa-bf1]# host qa-default-zk1

dig output after restarting kube-dns (shows ttl=30):

[qa.default.zk2@tachyon-qa-bf1]# dig qa-default-zk1.tachyon.svc.starfleet.local 

; <<>> DiG 9.8.2rc1-RedHat-9.8.2-0.37.rc1.el6_7.4 <<>> qa-default-zk1.tachyon.svc.starfleet.local
;; global options: +cmd
;; Got answer:
;; ->>HEADER<<- opcode: QUERY, status: NOERROR, id: 56373
;; flags: qr aa rd ra; QUERY: 1, ANSWER: 1, AUTHORITY: 0, ADDITIONAL: 0

;; QUESTION SECTION:
;qa-default-zk1.tachyon.svc.starfleet.local. IN A

;; ANSWER SECTION:
qa-default-zk1.tachyon.svc.starfleet.local. 30 IN A 10.10.10.109

;; Query time: 1 msec
;; SERVER: 10.10.10.10#53(10.10.10.10)
;; WHEN: Sat Feb 18 20:44:22 2017
;; MSG SIZE  rcvd: 76

I'll see if i can reproduce this and report back here.

fejta-bot commented 6 years ago

Issues go stale after 90d of inactivity. Mark the issue as fresh with /remove-lifecycle stale. Stale issues rot after an additional 30d of inactivity and eventually close.

Prevent issues from auto-closing with an /lifecycle frozen comment.

If this issue is safe to close now please do so with /close.

Send feedback to sig-testing, kubernetes/test-infra and/or @fejta. /lifecycle stale

fejta-bot commented 6 years ago

Stale issues rot after 30d of inactivity. Mark the issue as fresh with /remove-lifecycle rotten. Rotten issues close after an additional 30d of inactivity.

If this issue is safe to close now please do so with /close.

Send feedback to sig-testing, kubernetes/test-infra and/or @fejta. /lifecycle rotten /remove-lifecycle stale

cmluciano commented 6 years ago

/close