mesosphere / mesos-dns

DNS-based service discovery for Mesos.
https://mesosphere.github.com/mesos-dns
Apache License 2.0
484 stars 137 forks source link

Not serving internal records #52

Closed kozyraki closed 9 years ago

kozyraki commented 9 years ago

After >24h of rather intense testing, I got mesos-dns to exhibit a erroneous behavior. Right now it is at the state where it serves external requests but not internal requests.

$ dig @172.16.0.13 -p 31053 chronos.marathon-0.7.6.mesos ; <<>> DiG 9.9.4-RedHat-9.9.4-14.el7_0.1 <<>> @172.16.0.13 -p 31053 chronos.marathon-0.7.6.mesos ; (1 server found) ;; global options: +cmd ;; connection timed out; no servers could be reached

dig @172.16.0.13 -p 31053 www.kathimerini.gr

; <<>> DiG 9.9.4-RedHat-9.9.4-14.el7_0.1 <<>> @172.16.0.13 -p 31053 www.kathimerini.gr ; (1 server found) ;; global options: +cmd ;; Got answer: ;; ->>HEADER<<- opcode: QUERY, status: NOERROR, id: 48876 ;; flags: qr rd ra; QUERY: 1, ANSWER: 1, AUTHORITY: 7, ADDITIONAL: 11

;; OPT PSEUDOSECTION: ; EDNS: version: 0, flags:; udp: 4096 ;; QUESTION SECTION: ;www.kathimerini.gr. IN A

;; ANSWER SECTION: www.kathimerini.gr. 79524 IN A 88.198.40.186

;; AUTHORITY SECTION: gr. 78119 IN NS grdns-de.denic.de. gr. 78119 IN NS estia.ics.forth.gr. gr. 78119 IN NS gr-br.ics.forth.gr. gr. 78119 IN NS gr-at.ics.forth.gr. gr. 78119 IN NS gr-m.ics.forth.gr. gr. 78119 IN NS grdns.ics.forth.gr. gr. 78119 IN NS gr-ix.ics.forth.gr.

;; ADDITIONAL SECTION: gr-m.ics.forth.gr. 78119 IN A 194.0.4.10 gr-m.ics.forth.gr. 78119 IN AAAA 2001:678:7::4:10 estia.ics.forth.gr. 4274 IN A 139.91.191.3 estia.ics.forth.gr. 4274 IN AAAA 2001:648:2c30::191:3 gr-at.ics.forth.gr. 78119 IN A 78.104.145.227 gr-br.ics.forth.gr. 78119 IN A 200.160.7.163 gr-ix.ics.forth.gr. 78119 IN A 176.126.38.132 gr-ix.ics.forth.gr. 78119 IN AAAA 2001:7f8:6e:1::132 grdns.ics.forth.gr. 106442 IN A 139.91.1.1 grdns-de.denic.de. 78119 IN A 81.91.161.80

;; Query time: 3 msec ;; SERVER: 172.16.0.13#31053(172.16.0.13) ;; WHEN: Sat Jan 03 17:53:23 UTC 2015 ;; MSG SIZE rcvd: 690

There is absolutely nothing in stderror that can help me debug this (no messages about issues with internal requests). Let's add some.

We should also consider some periodic health messages on stderr or stdout. In addition to what we have right now, we should probably print # of internal records, print error messages when internal state is inconsistent, etc.

In general, scan through places where errors can happen and add some logging.

kozyraki commented 9 years ago

I was able to recreate the behavior quite quickly after a restart of mesos-dns. Even in verbose mode I have no way of seeing what's wrong:

[christos@srv3.hw.ca1 mesos-dns]$ nslookup -port=31053 www.google.com 172.16.0.13 Server: 172.16.0.13 Address: 172.16.0.13#31053

Non-authoritative answer: Name: www.google.com Address: 64.233.171.147 Name: www.google.com Address: 64.233.171.99 Name: www.google.com Address: 64.233.171.103 Name: www.google.com Address: 64.233.171.104 Name: www.google.com Address: 64.233.171.105 Name: www.google.com Address: 64.233.171.106

[christos@srv3.hw.ca1 mesos-dns]$ nslookup -port=31053 chronos.marathon-0.7.6.mesos 172.16.0.13 ;; connection timed out; trying next origin ;; connection timed out; trying next origin ;; connection timed out; no servers could be reached

Another worrisome issue is the statistics: VERBOSE: 2015/01/03 18:25:56 logging.go:32: {MesosRequests:3243 MesosSuccess:3243 MesosNXDomain:3242 MesosFailed:0 NonMesosRequests:477145 NonMesosSuccess:423833 NonMesosNXDomain:40987 NonMesosFailed:53312 NonMesosRecursed:112429}

This says that the vast majority of requests for the mesos domain return NXDomain. Even though the names used for the query show up in the verbose messages about records built. This is strange.

Given the way I have written the tests, MesosRequests and NonMesosRequests should also be equal (more or less at least). In this case they are off by a factor or 100x. Something is off here too.

kozyraki commented 9 years ago

On further inspection, the problem is just with A records for mesos domain. SRV records are served just fine after all.

[christos@srv3.hw.ca1 mesos-dns]$ dig @172.16.0.13 -p 31053 _kafka.service.mm._tcp.marathon-0.7.6.mesos. SRV

; <<>> DiG 9.9.4-RedHat-9.9.4-14.el7_0.1 <<>> @172.16.0.13 -p 31053 _kafka.service.mm._tcp.marathon-0.7.6.mesos. SRV ; (1 server found) ;; global options: +cmd ;; Got answer: ;; ->>HEADER<<- opcode: QUERY, status: NOERROR, id: 11714 ;; flags: qr aa rd ra; QUERY: 1, ANSWER: 3, AUTHORITY: 0, ADDITIONAL: 0

;; QUESTION SECTION: ;_kafka.service.mm._tcp.marathon-0.7.6.mesos. IN SRV

;; ANSWER SECTION: _kafka.service.mm._tcp.marathon-0.7.6.mesos. 60 IN SRV 0 0 31000 srv3.hw.ca1.mesosphere.com. _kafka.service.mm._tcp.marathon-0.7.6.mesos. 60 IN SRV 0 0 31258 srv4.hw.ca1.mesosphere.com. _kafka.service.mm._tcp.marathon-0.7.6.mesos. 60 IN SRV 0 0 31484 srv2.hw.ca1.mesosphere.com.

;; Query time: 0 msec ;; SERVER: 172.16.0.13#31053(172.16.0.13) ;; WHEN: Sat Jan 03 19:14:25 UTC 2015 ;; MSG SIZE rcvd: 328

[christos@srv3.hw.ca1 mesos-dns]$ dig @172.16.0.13 -p 31053 cassandra.database.mm.marathon-0.7.6.mesos. A

; <<>> DiG 9.9.4-RedHat-9.9.4-14.el7_0.1 <<>> @172.16.0.13 -p 31053 cassandra.database.mm.marathon-0.7.6.mesos. A ; (1 server found) ;; global options: +cmd ;; connection timed out; no servers could be reached

kozyraki commented 9 years ago

And in case it helps, here is the output from generator.go

VERBOSE: 2015/01/03 23:20:00 logging.go:32: {MesosRequests:79463 MesosSuccess:79463 MesosNXDomain:79460 MesosFailed:0 NonMesosRequests:6518707 NonMesosSuccess:5362819 NonMesosNXDomain:207524 NonMesosFailed:1155888 NonMesosRecursed:1727526} VERBOSE: 2015/01/03 23:21:02 generator.go:125: reloading from master 172.16.0.14 VERBOSE: 2015/01/03 23:21:05 generator.go:274: ChronosTaskdns-test-external.chronos-2.3.0_mesos-0.20.1-SNAPSHOT.mesos. srv2.hw.ca1.mesosphere.com VERBOSE: 2015/01/03 23:21:05 generator.go:274: mesos-dns.slave.buildbot.tools.marathon-0.7.6.mesos. srv3.hw.ca1.mesosphere.com VERBOSE: 2015/01/03 23:21:05 generator.go:274: _registry.docker.tools._tcp.marathon-0.7.6.mesos. srv2.hw.ca1.mesosphere.com:31485 VERBOSE: 2015/01/03 23:21:05 generator.go:274: _registry.docker.tools._udp.marathon-0.7.6.mesos. srv2.hw.ca1.mesosphere.com:31485 VERBOSE: 2015/01/03 23:21:05 generator.go:274: registry.docker.tools.marathon-0.7.6.mesos. srv2.hw.ca1.mesosphere.com:31485 VERBOSE: 2015/01/03 23:21:05 generator.go:274: _portal.application.mm._tcp.marathon-0.7.6.mesos. srv4.hw.ca1.mesosphere.com:31263 VERBOSE: 2015/01/03 23:21:05 generator.go:274: _portal.application.mm._udp.marathon-0.7.6.mesos. srv4.hw.ca1.mesosphere.com:31263 VERBOSE: 2015/01/03 23:21:05 generator.go:274: portal.application.mm.marathon-0.7.6.mesos. srv4.hw.ca1.mesosphere.com:31263 VERBOSE: 2015/01/03 23:21:05 generator.go:274: _portal.application.mm._tcp.marathon-0.7.6.mesos. srv3.hw.ca1.mesosphere.com:31756 VERBOSE: 2015/01/03 23:21:05 generator.go:274: _portal.application.mm._udp.marathon-0.7.6.mesos. srv3.hw.ca1.mesosphere.com:31756 VERBOSE: 2015/01/03 23:21:05 generator.go:274: portal.application.mm.marathon-0.7.6.mesos. srv3.hw.ca1.mesosphere.com:31756 VERBOSE: 2015/01/03 23:21:05 generator.go:274: _kafka.service.mm._tcp.marathon-0.7.6.mesos. srv3.hw.ca1.mesosphere.com:31000 VERBOSE: 2015/01/03 23:21:05 generator.go:274: _kafka.service.mm._udp.marathon-0.7.6.mesos. srv3.hw.ca1.mesosphere.com:31000 VERBOSE: 2015/01/03 23:21:05 generator.go:274: kafka.service.mm.marathon-0.7.6.mesos. srv3.hw.ca1.mesosphere.com:31000 VERBOSE: 2015/01/03 23:21:05 generator.go:274: _kafka.service.mm._tcp.marathon-0.7.6.mesos. srv4.hw.ca1.mesosphere.com:31258 VERBOSE: 2015/01/03 23:21:05 generator.go:274: _kafka.service.mm._udp.marathon-0.7.6.mesos. srv4.hw.ca1.mesosphere.com:31258 VERBOSE: 2015/01/03 23:21:05 generator.go:274: kafka.service.mm.marathon-0.7.6.mesos. srv4.hw.ca1.mesosphere.com:31258 VERBOSE: 2015/01/03 23:21:05 generator.go:274: _collector.service.mm._tcp.marathon-0.7.6.mesos. srv3.hw.ca1.mesosphere.com:31912 VERBOSE: 2015/01/03 23:21:05 generator.go:274: _collector.service.mm._udp.marathon-0.7.6.mesos. srv3.hw.ca1.mesosphere.com:31912 VERBOSE: 2015/01/03 23:21:05 generator.go:274: collector.service.mm.marathon-0.7.6.mesos. srv3.hw.ca1.mesosphere.com:31912 VERBOSE: 2015/01/03 23:21:05 generator.go:274: _cassandra.database.mm._tcp.marathon-0.7.6.mesos. srv2.hw.ca1.mesosphere.com:31777 VERBOSE: 2015/01/03 23:21:05 generator.go:274: _cassandra.database.mm._udp.marathon-0.7.6.mesos. srv2.hw.ca1.mesosphere.com:31777 VERBOSE: 2015/01/03 23:21:05 generator.go:274: cassandra.database.mm.marathon-0.7.6.mesos. srv2.hw.ca1.mesosphere.com:31777 VERBOSE: 2015/01/03 23:21:05 generator.go:274: _cassandra.database.mm._tcp.marathon-0.7.6.mesos. srv3.hw.ca1.mesosphere.com:31992 VERBOSE: 2015/01/03 23:21:05 generator.go:274: _cassandra.database.mm._udp.marathon-0.7.6.mesos. srv3.hw.ca1.mesosphere.com:31992 VERBOSE: 2015/01/03 23:21:05 generator.go:274: cassandra.database.mm.marathon-0.7.6.mesos. srv3.hw.ca1.mesosphere.com:31992 VERBOSE: 2015/01/03 23:21:05 generator.go:274: _cassandra.database.mm._tcp.marathon-0.7.6.mesos. srv4.hw.ca1.mesosphere.com:31306 VERBOSE: 2015/01/03 23:21:05 generator.go:274: _cassandra.database.mm._udp.marathon-0.7.6.mesos. srv4.hw.ca1.mesosphere.com:31306 VERBOSE: 2015/01/03 23:21:05 generator.go:274: cassandra.database.mm.marathon-0.7.6.mesos. srv4.hw.ca1.mesosphere.com:31306 VERBOSE: 2015/01/03 23:21:05 generator.go:274: datadog.infra.mm.marathon-0.7.6.mesos. srv2.hw.ca1.mesosphere.com VERBOSE: 2015/01/03 23:21:05 generator.go:274: _chronos._tcp.marathon-0.7.6.mesos. srv3.hw.ca1.mesosphere.com:31820 VERBOSE: 2015/01/03 23:21:05 generator.go:274: _chronos._udp.marathon-0.7.6.mesos. srv3.hw.ca1.mesosphere.com:31820 VERBOSE: 2015/01/03 23:21:05 generator.go:274: chronos.marathon-0.7.6.mesos. srv3.hw.ca1.mesosphere.com:31820 VERBOSE: 2015/01/03 23:21:05 generator.go:274: _storm.service.mm._tcp.marathon-0.7.6.mesos. srv2.hw.ca1.mesosphere.com:31140 VERBOSE: 2015/01/03 23:21:05 generator.go:274: _storm.service.mm._udp.marathon-0.7.6.mesos. srv2.hw.ca1.mesosphere.com:31140 VERBOSE: 2015/01/03 23:21:05 generator.go:274: storm.service.mm.marathon-0.7.6.mesos. srv2.hw.ca1.mesosphere.com:31140 VERBOSE: 2015/01/03 23:21:05 generator.go:274: otto.slave.buildbot.tools.marathon-0.7.6.mesos. srv3.hw.ca1.mesosphere.com VERBOSE: 2015/01/03 23:21:05 generator.go:274: _portal.application.mm._tcp.marathon-0.7.6.mesos. srv4.hw.ca1.mesosphere.com:31824 VERBOSE: 2015/01/03 23:21:05 generator.go:274: _portal.application.mm._udp.marathon-0.7.6.mesos. srv4.hw.ca1.mesosphere.com:31824 VERBOSE: 2015/01/03 23:21:05 generator.go:274: portal.application.mm.marathon-0.7.6.mesos. srv4.hw.ca1.mesosphere.com:31824 VERBOSE: 2015/01/03 23:21:05 generator.go:274: mesos.slave.buildbot.tools.marathon-0.7.6.mesos. srv3.hw.ca1.mesosphere.com VERBOSE: 2015/01/03 23:21:05 generator.go:274: _kairosdb-finagle-server.service.mm._tcp.marathon-0.7.6.mesos. srv4.hw.ca1.mesosphere.com:31077 VERBOSE: 2015/01/03 23:21:05 generator.go:274: _kairosdb-finagle-server.service.mm._udp.marathon-0.7.6.mesos. srv4.hw.ca1.mesosphere.com:31077 VERBOSE: 2015/01/03 23:21:05 generator.go:274: kairosdb-finagle-server.service.mm.marathon-0.7.6.mesos. srv4.hw.ca1.mesosphere.com:31077 VERBOSE: 2015/01/03 23:21:05 generator.go:274: datadog.infra.mm.marathon-0.7.6.mesos. srv4.hw.ca1.mesosphere.com VERBOSE: 2015/01/03 23:21:05 generator.go:274: _kafka.service.mm._tcp.marathon-0.7.6.mesos. srv2.hw.ca1.mesosphere.com:31484 VERBOSE: 2015/01/03 23:21:05 generator.go:274: _kafka.service.mm._udp.marathon-0.7.6.mesos. srv2.hw.ca1.mesosphere.com:31484 VERBOSE: 2015/01/03 23:21:05 generator.go:274: kafka.service.mm.marathon-0.7.6.mesos. srv2.hw.ca1.mesosphere.com:31484 VERBOSE: 2015/01/03 23:21:05 generator.go:274: _github-hook.buildbot.tools._tcp.marathon-0.7.6.mesos. srv3.hw.ca1.mesosphere.com:31001 VERBOSE: 2015/01/03 23:21:05 generator.go:274: _github-hook.buildbot.tools._udp.marathon-0.7.6.mesos. srv3.hw.ca1.mesosphere.com:31001 VERBOSE: 2015/01/03 23:21:05 generator.go:274: github-hook.buildbot.tools.marathon-0.7.6.mesos. srv3.hw.ca1.mesosphere.com:31001 VERBOSE: 2015/01/03 23:21:05 generator.go:274: master.buildbot.tools.marathon-0.7.6.mesos. srv2.hw.ca1.mesosphere.com VERBOSE: 2015/01/03 23:21:05 generator.go:274: datadog.infra.mm.marathon-0.7.6.mesos. srv3.hw.ca1.mesosphere.com VERBOSE: 2015/01/03 23:21:05 logging.go:32: {MesosRequests:79463 MesosSuccess:79463 MesosNXDomain:79460 MesosFailed:0 NonMesosRequests:6555818 NonMesosSuccess:5394548 NonMesosNXDomain:208598 NonMesosFailed:1161270 NonMesosRecursed:1738600}

kozyraki commented 9 years ago

Closing because so far I have not been able to recreate.