Closed tas50 closed 4 years ago
I've had this happen as well, though not as frequently as once a day. No indication of error in nextdns logs, nextdns status says running, but DNS does not resolve. I'm using an EdgeRouter X SFP v2.0.8-hotfix.1 and keeping current with nextdns release versions.
Mine has that issue as well after enabling discovery services. For example, this morning it wasn't resolving any domains and there was no error message except the last log was mDNS message. Sometimes it goes kernel panic and log console would go haywire then restarts itself just fine.
I will just disable discovery and go to the first config where there were no issues. After all, it can't even discover host names in etc/hosts and reverse lookup local IPs with conditional forwarder set.
It does discover IPs in /etc/hosts. Doesn’t it work for you?
It does discover IPs in /etc/hosts. Doesn’t it work for you?
I reseted the OpenWrt, installed the latest NextDNS CLI, set it to listen on 192.168.1.253:5342, set dnsmasq forwarder to it, then populated the /etc/hosts and it worked.
As far as hang issue goes, lets see if latest version does this...
Still hangs without giving any error. Cant resolve anything until I restart nextdns and dnsmasq. Im gonna make it automatic with scheduler, restart those two once a day and see how it goes.
edit: openwrt 19.07.2
It's dnsmasq that's hanging! Just found out that. Which is also why we see no NextDNS errors in the log, it just waits to listen on and on.
I put a script runs every 5 min. and queries name of one of my local devices then checks its IP is given or not (device name and IP is in /etc/hosts so no query goes to NextDNS unnecessarily). If there's no IP then dnsmasq is not working, restart it.
If it helps anyone just ask me and I will give the script.
@Xtreme512 Care to share the script? I'm seeing this on a brand new install of an Edgerouter X with v1.10.11 and NextDNS nextdns-v1.5.8 installed. A restart fixes it each time.
If you guys can try master and next time it hangs, first send a kill -QUIT to its pid before restart so it dumps a full stack trace in the logs. This trace would be useful to understand what’s going on.
@Xtreme512 Care to share the script? I'm seeing this on a brand new install of an Edgerouter X with v1.10.11 and NextDNS nextdns-v1.5.8 installed. A restart fixes it each time.
It's a very simple shell script, works on my OpenWrt without bash or fancy packages.
#!/bin/sh
resolvedIP=$(nslookup **device-etc/host-name-here** | tail -2 | head -1 | awk '{print $3}')
if [ "$resolvedIP" = "192.168.1.2" ]
then
sleep 1
else
/etc/init.d/dnsmasq restart
fi
Crontab to run every 5 minutes
*/5 * * * * ./dnscheck.sh
.sh file must be UNIX format if edited in Windows environment.
Happened again today, used kill -QUIT
If it’s Merlin, in /jffs/syslog.log
@rs I’m using stock EdgeOS.
Just hung again for the 3rd time today. Here is what I can see from the log:
May 20 14:29:41 ubnt nextdns[23535]: Connected 45.90.28.0:443 (con=10ms tls=13ms, TLS<0>) May 20 14:50:35 ubnt nextdns[23535]: Connected 108.61.155.162:443 (con=14ms tls=103ms, TLS13) May 20 14:50:35 ubnt nextdns[23535]: Switching endpoint: https://vultr-ewr-1.edge.nextdns.io#108.61.155.162,2001:19f0:5:59ed:5400:2ff:fea1:38c5 May 20 14:56:35 ubnt nextdns[23535]: Connected 108.61.155.162:443 (con=16ms tls=122ms, TLS13) May 20 15:26:56 ubnt nextdns[23535]: Discovered(DHCP) 10.0.0.229 = localhost May 20 15:33:56 ubnt nextdns[23535]: Received signal: quit (ignored) May 20 15:34:05 ubnt nextdns[23535]: Received signal: terminated May 20 15:34:05 ubnt nextdns[23535]: Stopping NextDNS 1.5.8/linux May 20 15:34:05 ubnt nextdns[23535]: Restore router settings May 20 15:34:07 ubnt nextdns[23535]: Deactivating May 20 15:34:07 ubnt nextdns[23535]: NextDNS 1.5.8/linux stopped May 20 15:34:08 ubnt nextdns[27598]: Starting NextDNS 1.5.8/linux on 127.0.0.1:5342 May 20 15:34:08 ubnt nextdns[27598]: Starting discovery resolver May 20 15:34:08 ubnt nextdns[27598]: Listening on TCP/127.0.0.1:5342 May 20 15:34:08 ubnt nextdns[27598]: Listening on UDP/127.0.0.1:5342
Each time there is a "Discovered" event then it freezes.
Also found this in another log just only once, didn't see this on the other hang: May 20 12:41:34 ubnt nextdns[19956]: Discovered(MDNS) fe88::163b:2b8b:71f4:969d =MacBook May 20 12:50:15 ubnt kernel: Process 20003 (dnsmasq) has crashed (parent 1 (init) signal 3, code 0, addr 00005bb3), coredumps disabled May 20 12:50:21 ubnt nextdns[19956]: Received signal: terminated
You need to run master for the QUIT signal to work
@rs I'm sorry I don't know what that means. Any documentation on how to do that?
I means checking out the code and compile it. I will create snapshots so you can test without going thru that.
@rs Ok thank you. Let me know when they are ready and I can test ASAP. I’m getting about 3-5 freezes a day so it shouldn’t take long to get a log file.
You will find binaries here: https://drive.google.com/drive/folders/1-uurvV67jBtBOH6Y_SQv2-W8O6e4fHI3
Thanks for the files. Running into a problem installing.
ubnt@ubnt:~$ sh -c 'sh -c "$(curl -sL https://nextdns.io/install)"' INFO: OS: edgeos INFO: GOARCH: mipsle_softfloat INFO: GOOS: linux c) Configure NextDNS r) Remove NextDNS e) Exit
ubnt@ubnt:~$ sudo dpkg -i nextdns_v1.5.8-SNAPSHOT-394b795_linux_mipsle_softfloat.deb dpkg: error processing nextdns_v1.5.8-SNAPSHOT-394b795_linux_mipsle_softfloat.deb (--install): package architecture (mipslesoftfloat) does not match system (mipsel) Errors were encountered while processing: nextdns_v1.5.8-SNAPSHOT-394b795_linux_mipsle_softfloat.deb
Try the tarball instead of deb
I think I reproduce the issue at home. I tracked it down to a nasty bug in the Go http2 library: https://github.com/golang/go/issues/23559.
Please try master again.
Snapshot: https://drive.google.com/drive/folders/1W73Er37Do9Lg50rMQ0yunBEvBWexG6YW
New build up and running, I'll keep an eye on it over the weekend.
Happened again this morning.
ubnt@ubnt:~ sudo nextdns version nextdns version v1.5.8-SNAPSHOT-2ebe526 ubnt@ubnt:~ sudo nextdns status running ubnt@ubnt:~ sudo kill -QUIT 6784 ubnt@ubnt:~ sudo nextdns restart
May 22 08:05:10 ubnt nextdns[6784]: Connected 108.61.155.162:443 (con=14ms tls=137ms, TLS13) May 22 08:22:17 ubnt nextdns[6784]: Received signal: quit (ignored) May 22 08:22:31 ubnt nextdns[6784]: Received signal: terminated May 22 08:22:31 ubnt nextdns[6784]: Stopping NextDNS 1.5.8/linux May 22 08:22:31 ubnt nextdns[6784]: Restore router settings May 22 08:22:33 ubnt nextdns[6784]: Deactivating May 22 08:22:33 ubnt nextdns[6784]: NextDNS 1.5.8/linux stopped May 22 08:22:34 ubnt nextdns[18308]: Starting NextDNS 1.5.8/linux on 127.0.0.1:5342 May 22 08:22:34 ubnt nextdns[18308]: Starting discovery resolver May 22 08:22:34 ubnt nextdns[18308]: Listening on TCP/127.0.0.1:5342 May 22 08:22:34 ubnt nextdns[18308]: Listening on UDP/127.0.0.1:5342
The version running is not the version installed apparently. The quit (ignored)
proves it. Where did you copy the binary?
You were right, it‘s installed in /use/bin/nextdns but the old version must of been in memory. Uninstalled, rebooted and installed again. Now it’s showing the snapshot version in the logs.
fingercrossed
I've been running into what appears to be the same issue ever since I started using nextdns
on my OpenWRT device a few days back.
netstat
showed this, two connections (I have a conditional forwarder too) with a non-zero send-q. It remained this way until the two connections eventually timed out, I think the broken pipe
message in the log coincides with the timeout. After that resolution was working again.
tcp 0 11314 59.92.184.56:38830 45.90.28.0:443 ESTABLISHED 2105/nextdns
tcp 0 0 25.94.144.119:44908 34.93.164.22:443 ESTABLISHED 2105/nextdns
tcp 0 0 :::53 :::* LISTEN 2105/nextdns
tcp 0 26960 2001:4490:4e4d:182c::1:56520 2001:4860:4860::8844:443 ESTABLISHED 2105/nextdns
root@bumblebee:~# nextdns version
nextdns version 1.5.8
2020 May 24 22:52:48 bumblebee err nextdns[2105]: INFO: 22:52:48 Received signal: broken pipe (ignored)
2020 May 24 22:52:48 bumblebee err nextdns[2105]: INFO: 22:52:48 Query 192.168.1.163 UDP A play.google.com. (qry=33/res=-1) 486312ms : doh resolve: context deadline exceeded
2020 May 24 22:52:48 bumblebee err nextdns[2105]: INFO: 22:52:48 Query 192.168.1.163 UDP A www.google.com. (qry=32/res=-1) 334415ms : doh resolve: context deadline exceeded
2020 May 24 22:52:48 bumblebee err nextdns[2105]: INFO: 22:52:48 Query 192.168.1.163 UDP A www.google.com. (qry=32/res=-1) 411819ms : doh resolve: context deadline exceeded
2020 May 24 22:52:48 bumblebee err nextdns[2105]: INFO: 22:52:48 Query 192.168.1.163 UDP A play.google.com. (qry=33/res=-1) 307160ms : doh resolve: context deadline exceeded
2020 May 24 22:52:48 bumblebee err nextdns[2105]: INFO: 22:52:48 Query 192.168.1.163 UDP A www.google.com. (qry=32/res=-1) 333760ms : doh resolve: context deadline exceeded
root@bumblebee:~#
Also, it appears to log the time taken for domains that are not using nextdns
. In my case, I am using Google DNS for google.com
(Records returned by nextdns
sends me to endpoints in France and I'm in India 😉)
@rs Would there be a new release with this bug fix soon? For now, I have installed the snapshot build from the link that was shared earlier.
2020 May 24 23:55:04 bumblebee notice nextdns[2895]: Starting NextDNS v1.5.8-SNAPSHOT-2ebe526/linux on :53
Will report back if I do run into the same issue again.
The new release will be released soon. I wanted to validate it fixed the issue before. Please report if it fixed the issue for you.
Fix is looking good. No hangs after 4 days.
Fix is working great for me as well, no more DNS outages like before 😅
This could be unrelated to this bug, I have my OpenWRT router setup with multi-WAN failover, and at times when there is a failover, there is a short DNS outage even after the failover to backup WAN is complete (the failover thresholds usually trigger a failover in less than 45s or so). I would be able to ping an external IP address, but DNS resolution if not already cached just times out (due to https://github.com/nextdns/nextdns/issues/230, its a bit obvious in my case as I use Google DNS for the domain google.com) It will recover on its own after 1-2 minutes when it tries to reconnect and uses the backup WAN to go out.
Perhaps it should be a bit more aggressive in trying to detect connectivity failures and re-establish connectivity?
Yes, we'll work on that.
@rs unfortunately I experienced "the hang" twice today. I am running version 1.6.3(latest) on my OpenWRT router. Restarting the service does remedy the issue, but its no resolution.
Please send a kill -QUIT to the deamon pid when this happens. It will print a stack trace in the logs. That would help me understand what’s going on.
@rs here are some logs
I may have been suffering from https://github.com/nextdns/nextdns/issues/238 I will monitor this today and see if the resolution to this issue helps with my issue. I will post an update tomorrow.
Sadly enough I am still suffering from this issue. It has become increasingly unreliable because for me it does not have once a day but once every hour. Sadly enough I cannot post any new stacktraces as these are being truncated. Already pointed this out in https://github.com/nextdns/nextdns/issues/238#issuecomment-636312316
It would be nice if this issue is reopened as its not really resolved.
When it hangs, what is the behavior of a dig?
When it hangs, what is the behavior of a dig?
Its a pretty straight forward timeout.
root@OpenWrt:~# time dig chaos test.com
; <<>> DiG 9.16.3 <<>> chaos test.com
;; global options: +cmd
;; connection timed out; no servers could be reached
Command exited with non-zero status 9
real 0m 15.03s
user 0m 0.03s
sys 0m 0.02s
root@OpenWrt:~# nextdns restart
root@OpenWrt:~# time dig chaos test.com
;; Warning: Message parser reports malformed message packet.
; <<>> DiG 9.16.3 <<>> chaos test.com
;; global options: +cmd
;; Got answer:
;; ->>HEADER<<- opcode: QUERY, status: NOERROR, id: 54366
;; flags: qr rd ra; QUERY: 1, ANSWER: 1, AUTHORITY: 0, ADDITIONAL: 9
;; QUESTION SECTION:
;test.com. CH A
;; ANSWER SECTION:
test.com. 300 IN A 91.207.57.193
;; ADDITIONAL SECTION:
conf.nextdns.io. 0 CH TXT "xxxxxx"
proto.nextdns.io. 0 CH TXT "DOH"
server.nextdns.io. 0 CH TXT "zepto-bru-1"
client.nextdns.io. 0 CH TXT "x.x.x.x"
client-name.nextdns.io. 0 CH TXT "nextdns-cli"
device-name.nextdns.io. 0 CH TXT "OpenWrt"
device-id.nextdns.io. 0 CH TXT "xxxx"
lists.nextdns.io. 0 CH TXT "blocklist:disconnect-ads" "blocklist:disconnect-malvertising" "blocklist:goodbye-ads" "blocklist:notracking"
smart-ecs.nextdns.io. 0 CH TXT "not sent"
;; Query time: 9 msec
;; SERVER: 127.0.0.1#53(127.0.0.1)
;; WHEN: Sun May 31 13:00:25 UTC 2020
;; MSG SIZE rcvd: 420
real 0m 0.02s
user 0m 0.01s
sys 0m 0.00s
What do you get for nextdns log | grep Start
What do you get for
nextdns log | grep Start
root@OpenWrt:~# nextdns log | grep Start Sun May 31 22:19:57 2020 daemon.notice nextdns[7027]: Starting NextDNS 1.6.3/linux on :53 Sun May 31 22:19:57 2020 daemon.notice nextdns[7027]: Starting mDNS discovery
Please contact us on the support chat so we can debug together.
For the record, I'm not experiencing any hanging after started using the script anymore.
The daemon seems to randomly hang about once a day. The status command shows it as running and the logs show no errors before it fails. Restarting fixes it, but the daemon goes down causing a full internet outage each time. Is there a way to gather additional logs from the daemon?
Context