systemd / systemd

The systemd System and Service Manager
https://systemd.io
GNU General Public License v2.0
13.31k stars 3.81k forks source link

systemd-resolved sometimes stops resolving #21123

Open gudvinr opened 3 years ago

gudvinr commented 3 years ago

systemd version the issue has been seen with

systemd 249 (249.5-2-arch)

Used distribution

Arch Linux

Linux kernel version used (uname -a)

5.14.14-arch1-1 #1 SMP PREEMPT Wed, 20 Oct 2021 21:35:18 +0000 x86_64 GNU/Linux

CPU architecture issue was seen on

x86_64

Expected behaviour you didn't see

Resolving domains in custom TLD from local DNS server.

Unexpected behaviour you saw

Not resolving said domains

Steps to reproduce the problem

  1. Add domains to static DNS records on the router (MikroTik hAP ac²)
  2. DHCP pushes DNS server address to PC with systemd-resolved enabled
  3. NetworkManager receives DNS address and adds it to systemd-resolved
  4. systemd-resolved correctly resolves static DNS records for a while
  5. Observe errors when resolving domains

I do not see errors or any logs at the time of me noticing these issues. It just stops resolving and last log messages for systemd-resolved.service are about it being started when I boot PC.

Additional program output to the terminal or log subsystem illustrating the issue

$ resolvectl status
Global
           Protocols: -LLMNR -mDNS -DNSOverTLS DNSSEC=no/unsupported
    resolv.conf mode: stub
  Current DNS Server: 9.9.9.9#dns.quad9.net
         DNS Servers: 2620:fe::fe#dns.quad9.net 2620:fe::9#dns.quad9.net 9.9.9.9#dns.quad9.net 149.112.112.112#dns.quad9.net
Fallback DNS Servers: 8.8.8.8#dns.google 8.8.4.4#dns.google 2001:4860:4860::8888#dns.google 2001:4860:4860::8844#dns.google

Link 2 (enp5s0)
Current Scopes: none
     Protocols: -DefaultRoute +LLMNR -mDNS -DNSOverTLS DNSSEC=no/unsupported

Link 3 (eno1)
Current Scopes: none
     Protocols: -DefaultRoute +LLMNR -mDNS -DNSOverTLS DNSSEC=no/unsupported

Link 4 (wlp6s0)
    Current Scopes: DNS
         Protocols: +DefaultRoute +LLMNR -mDNS -DNSOverTLS DNSSEC=no/unsupported
Current DNS Server: 192.168.0.1
       DNS Servers: 192.168.0.1

Link 5 (docker0)
Current Scopes: none
     Protocols: -DefaultRoute +LLMNR -mDNS -DNSOverTLS DNSSEC=no/unsupported

$ resolvectl query network-device.mytld
network-device.mytld: resolve call failed: 'network-device.mytld' not found
$ resolvectl -i wlp6s0 query network-device.mytld
network-device.mytld: resolve call failed: 'network-device.mytld' not found

$ drill @192.168.0.1 network-device.mytld
;; ->>HEADER<<- opcode: QUERY, rcode: NOERROR, id: 32879
;; flags: qr rd ra ; QUERY: 1, ANSWER: 1, AUTHORITY: 0, ADDITIONAL: 0 
;; QUESTION SECTION:
;; network-device.mytld.   IN      A

;; ANSWER SECTION:
network-device.mytld.      3600    IN      A       192.168.0.2

;; AUTHORITY SECTION:

;; ADDITIONAL SECTION:

;; Query time: 1 msec
;; SERVER: 192.168.0.1
;; WHEN: Mon Oct 25 22:26:18 2021
;; MSG SIZE  rcvd: 51

$ resolvectl flush-caches 
$ resolvectl query network-device.mytld
network-device.mytld: 192.168.0.2                -- link: wlp6s0

Configuration

$ cat /etc/systemd/resolved.conf.d/dns.conf
[Resolve]
DNS=2620:fe::fe#dns.quad9.net 2620:fe::9#dns.quad9.net 9.9.9.9#dns.quad9.net 149.112.112.112#dns.quad9.net
FallbackDNS=8.8.8.8#dns.google 8.8.4.4#dns.google 2001:4860:4860::8888#dns.google 2001:4860:4860::8844#dns.google
DNSSEC=no
DNSOverTLS=no
MulticastDNS=no
LLMNR=no
Cache=yes
$ cat /etc/NetworkManager/conf.d/dns.conf 
[main]
dns=systemd-resolved
poettering commented 3 years ago

please enable debug logging for resolved, then reproduce the issue and provide the generated log output of resolved here. i.e. systemctl service-log-level systemd-resolved debug and then to collect the logs do journalctl -e -u systemd-resolved

gudvinr commented 3 years ago

systemctl service-log-level systemd-resolved debug

Will this configuration persist after system/daemon restart or do I have to run it manually every time after boot?

poettering commented 3 years ago

that command line only has an effect until resolved exits/is restarted.

You can make the change persistently. Type "systemctl edit systemd-resolved", then enter:

[Service]
Environment=SYSTEMD_LOG_LEVEL=debug

Then save and restart resolved.

gudvinr commented 3 years ago

@poettering I was able to reproduce error. Can I send logs via more private channel (email maybe?) because there are quite sensitive data.

rfried-nrl commented 3 years ago

I experience the same, I can also try to reproduce with more logs. I added the environment variable above, where should I see the logs ? I don't see anything added to journalctl rather then the default stuff systemd-resolved spew.

gudvinr commented 3 years ago

@rfried-nrl you should see it in journalctl -u systemd-resolved (also you could add -f to see it live or -e to skip) but don't forget to run systemctl daemon-reload and systemctl restart systemd-resolved

rfried-nrl commented 3 years ago

@gudvinr I think I understood the problem I had. For some reason our DHCP server offered 3 DNS, two in AWS and 8.8.8.8 as the third one. It appears that sometimes because of timeout or something like that, the systemd-resolve cycled to 8.8.8.8 and that's why local domains stopped being resolved. I removed the 8.8.8.8 from the DHCP and looks like it's working now.

I see that you declared 8.8.8.8 as fallback on your setup. when it occurs, can you paste the output of systemd-resolved --status ?

gudvinr commented 3 years ago

For some reason our DHCP server offered 3 DNS, two in AWS and 8.8.8.8 as the third one. It appears that sometimes because of timeout or something like that, the systemd-resolve cycled to 8.8.8.8 and that's why local domains stopped being resolved. I removed the 8.8.8.8 from the DHCP and looks like it's working now.

In my case DHCP server only advertises itself (as seen from resolvectl status there's only 192.168.0.1 on wlp6s0)

gudvinr commented 3 years ago

Right now I don't have any resolvers except the one from DHCP:

Global
       Protocols: -LLMNR -mDNS -DNSOverTLS DNSSEC=no/unsupported
resolv.conf mode: stub

Link 2 (enp5s0)
Current Scopes: none
     Protocols: -DefaultRoute +LLMNR -mDNS -DNSOverTLS DNSSEC=no/unsupported

Link 3 (eno1)
Current Scopes: none
     Protocols: -DefaultRoute +LLMNR -mDNS -DNSOverTLS DNSSEC=no/unsupported

Link 4 (wlp6s0)
    Current Scopes: DNS
         Protocols: +DefaultRoute +LLMNR -mDNS -DNSOverTLS DNSSEC=no/unsupported
Current DNS Server: 192.168.0.1
       DNS Servers: 192.168.0.1

Link 5 (docker0)
Current Scopes: none
     Protocols: -DefaultRoute +LLMNR -mDNS -DNSOverTLS DNSSEC=no/unsupported

And local domains are not resolving while I can browse websites as usual.

Logs from some time ago show this:

Nov 01 09:34:49 systemd-resolved[772628]: Removing cache entry for hostname.domain IN A (expired 1s ago)
Nov 01 09:34:55 systemd-resolved[772628]: idn2_lookup_u8: hostname.domain → hostname.domain
Nov 01 09:34:55 systemd-resolved[772628]: Looking up RR for hostname.domain IN A.
Nov 01 09:34:55 systemd-resolved[772628]: Looking up RR for hostname.domain IN AAAA.
Nov 01 09:34:55 systemd-resolved[772628]: varlink-17: Sending message: {"error":"io.systemd.Resolve.DNSError","parameters":{"rcode":3}}

At some point cache entry expired and resolved never tried to get it again returning error every time after that

gudvinr commented 3 years ago

I'd say this is really disturbing. Last day I flushed caches manually 5 times or so

Drc-DEV commented 3 years ago

Disabling systemd-resolved and switching to dnsmasq solves the issue for me.

ronalde commented 2 years ago

Same here. A year ago, and now again. I tried to follow the diagnose instructions here and in an older closely related issue; this one seems a Duplicate of #12142. but gave up again. I ran resolved on an arch desktop, wireless connected to a hotspot (ie. as a DHCP client), with dnsmasq binded to the (static configured) wired interface, which serves as both dhcp and (caching/forwarding) dns server for wired hosts connected to the same switch/segment.

gudvinr commented 2 years ago

This bug finally started to be annoying enough to ditch resolved for dnsmasq. No sudden issues with DNS ever since.

tve commented 2 years ago

I seem to have the same issue. Server ran fine for months. Have not done any update, but some of the application DNS query patterns have undoubtedly changed. Suddenly all DNS queries fail. Restarting systemd-resolved brings everything back to life. Happens several times a day.

It just happened and I typed "ping ocore.voneicken.com" into a shell, which responded with ping: ocore.voneicken.com: No address associated with hostname. I captured the log below. I then did a systemctl restart systemd-resolved and that, as always, restored normal operation.

Aug 31 11:21:43 ncore systemd-resolved[1766412]: Got DNS stub UDP query packet for id 36949
Aug 31 11:21:43 ncore systemd-resolved[1766412]: Looking up RR for ocore.voneicken.com IN A.
Aug 31 11:21:43 ncore systemd-resolved[1766412]: Cache miss for ocore.voneicken.com IN A
Aug 31 11:21:43 ncore systemd-resolved[1766412]: Transaction 6568 for <ocore.voneicken.com IN A> scope dns on enp2s0/*.
Aug 31 11:21:43 ncore systemd-resolved[1766412]: Using feature level UDP+EDNS0 for transaction 6568.
Aug 31 11:21:43 ncore systemd-resolved[1766412]: Using DNS server fdad:eefa:a6dc::1 for transaction 6568.
Aug 31 11:21:43 ncore systemd-resolved[1766412]: Sending query packet with id 6568.
Aug 31 11:21:43 ncore systemd-resolved[1766412]: Processing query...
Aug 31 11:21:43 ncore systemd-resolved[1766412]: Got DNS stub UDP query packet for id 41489
Aug 31 11:21:43 ncore systemd-resolved[1766412]: Looking up RR for ocore.voneicken.com IN AAAA.
Aug 31 11:21:43 ncore systemd-resolved[1766412]: Cache miss for ocore.voneicken.com IN AAAA
Aug 31 11:21:43 ncore systemd-resolved[1766412]: Transaction 7847 for <ocore.voneicken.com IN AAAA> scope dns on enp2s0/*.
Aug 31 11:21:43 ncore systemd-resolved[1766412]: Using feature level UDP+EDNS0 for transaction 7847.
Aug 31 11:21:43 ncore systemd-resolved[1766412]: Using DNS server fdad:eefa:a6dc::1 for transaction 7847.
Aug 31 11:21:43 ncore systemd-resolved[1766412]: Sending query packet with id 7847.
Aug 31 11:21:43 ncore systemd-resolved[1766412]: Processing query...
Aug 31 11:21:43 ncore systemd-resolved[1766412]: Processing incoming packet on transaction 6568 (rcode=SUCCESS).
Aug 31 11:21:43 ncore systemd-resolved[1766412]: Not caching negative entry without a SOA record: ocore.voneicken.com IN A
Aug 31 11:21:43 ncore systemd-resolved[1766412]: Transaction 6568 for <ocore.voneicken.com IN A> on scope dns on enp2s0/* now complete with <success> from network (unsigned).
Aug 31 11:21:43 ncore systemd-resolved[1766412]: Sending response packet with id 36949 on interface 1/AF_INET.
Aug 31 11:21:43 ncore systemd-resolved[1766412]: Freeing transaction 6568.
Aug 31 11:21:43 ncore systemd-resolved[1766412]: Processing incoming packet on transaction 7847 (rcode=SUCCESS).
Aug 31 11:21:43 ncore systemd-resolved[1766412]: Not caching negative entry without a SOA record: ocore.voneicken.com IN AAAA
Aug 31 11:21:43 ncore systemd-resolved[1766412]: Transaction 7847 for <ocore.voneicken.com IN AAAA> on scope dns on enp2s0/* now complete with <success> from network (unsigned).
Aug 31 11:21:43 ncore systemd-resolved[1766412]: Sending response packet with id 41489 on interface 1/AF_INET.
Aug 31 11:21:43 ncore systemd-resolved[1766412]: Freeing transaction 7847.
Aug 31 11:21:43 ncore systemd-resolved[1766412]: Got DNS stub UDP query packet for id 49995
Aug 31 11:21:43 ncore systemd-resolved[1766412]: Looking up RR for ocore.voneicken.com.voneicken.com IN A.
Aug 31 11:21:43 ncore systemd-resolved[1766412]: Cache miss for ocore.voneicken.com.voneicken.com IN A
Aug 31 11:21:43 ncore systemd-resolved[1766412]: Transaction 6362 for <ocore.voneicken.com.voneicken.com IN A> scope dns on enp2s0/*.
Aug 31 11:21:43 ncore systemd-resolved[1766412]: Using feature level UDP+EDNS0 for transaction 6362.
Aug 31 11:21:43 ncore systemd-resolved[1766412]: Using DNS server fdad:eefa:a6dc::1 for transaction 6362.
Aug 31 11:21:43 ncore systemd-resolved[1766412]: Sending query packet with id 6362.
Aug 31 11:21:43 ncore systemd-resolved[1766412]: Processing query...
Aug 31 11:21:43 ncore systemd-resolved[1766412]: Got DNS stub UDP query packet for id 53098
Aug 31 11:21:43 ncore systemd-resolved[1766412]: Looking up RR for ocore.voneicken.com.voneicken.com IN AAAA.
Aug 31 11:21:43 ncore systemd-resolved[1766412]: Cache miss for ocore.voneicken.com.voneicken.com IN AAAA
Aug 31 11:21:43 ncore systemd-resolved[1766412]: Transaction 1533 for <ocore.voneicken.com.voneicken.com IN AAAA> scope dns on enp2s0/*.
Aug 31 11:21:43 ncore systemd-resolved[1766412]: Using feature level UDP+EDNS0 for transaction 1533.
Aug 31 11:21:43 ncore systemd-resolved[1766412]: Using DNS server fdad:eefa:a6dc::1 for transaction 1533.
Aug 31 11:21:43 ncore systemd-resolved[1766412]: Sending query packet with id 1533.
Aug 31 11:21:43 ncore systemd-resolved[1766412]: Processing query...
Aug 31 11:21:43 ncore systemd-resolved[1766412]: Processing incoming packet on transaction 6362 (rcode=NXDOMAIN).
Aug 31 11:21:43 ncore systemd-resolved[1766412]: Server returned error NXDOMAIN in EDNS0 mode, retrying transaction with reduced feature level UDP (DVE-2018-0001 mitigation)
Aug 31 11:21:43 ncore systemd-resolved[1766412]: Retrying transaction 6362.
Aug 31 11:21:43 ncore systemd-resolved[1766412]: Cache miss for ocore.voneicken.com.voneicken.com IN A
Aug 31 11:21:43 ncore systemd-resolved[1766412]: Transaction 6362 for <ocore.voneicken.com.voneicken.com IN A> scope dns on enp2s0/*.
Aug 31 11:21:43 ncore systemd-resolved[1766412]: Using feature level UDP for transaction 6362.
Aug 31 11:21:43 ncore systemd-resolved[1766412]: Sending query packet with id 6362.
Aug 31 11:21:43 ncore systemd-resolved[1766412]: Processing incoming packet on transaction 1533 (rcode=NXDOMAIN).
Aug 31 11:21:43 ncore systemd-resolved[1766412]: Server returned error NXDOMAIN in EDNS0 mode, retrying transaction with reduced feature level UDP (DVE-2018-0001 mitigation)
Aug 31 11:21:43 ncore systemd-resolved[1766412]: Retrying transaction 1533.
Aug 31 11:21:43 ncore systemd-resolved[1766412]: Cache miss for ocore.voneicken.com.voneicken.com IN AAAA
Aug 31 11:21:43 ncore systemd-resolved[1766412]: Transaction 1533 for <ocore.voneicken.com.voneicken.com IN AAAA> scope dns on enp2s0/*.
Aug 31 11:21:43 ncore systemd-resolved[1766412]: Using feature level UDP for transaction 1533.
Aug 31 11:21:43 ncore systemd-resolved[1766412]: Sending query packet with id 1533.
Aug 31 11:21:43 ncore systemd-resolved[1766412]: Processing incoming packet on transaction 6362 (rcode=NXDOMAIN).
Aug 31 11:21:43 ncore systemd-resolved[1766412]: Not caching negative entry without a SOA record: ocore.voneicken.com.voneicken.com IN A
Aug 31 11:21:43 ncore systemd-resolved[1766412]: Transaction 6362 for <ocore.voneicken.com.voneicken.com IN A> on scope dns on enp2s0/* now complete with <rcode-failure> from network >
Aug 31 11:21:43 ncore systemd-resolved[1766412]: Sending response packet with id 49995 on interface 1/AF_INET.
Aug 31 11:21:43 ncore systemd-resolved[1766412]: Freeing transaction 6362.
Aug 31 11:21:43 ncore systemd-resolved[1766412]: Processing incoming packet on transaction 1533 (rcode=NXDOMAIN).
Aug 31 11:21:43 ncore systemd-resolved[1766412]: Not caching negative entry without a SOA record: ocore.voneicken.com.voneicken.com IN AAAA
Aug 31 11:21:43 ncore systemd-resolved[1766412]: Transaction 1533 for <ocore.voneicken.com.voneicken.com IN AAAA> on scope dns on enp2s0/* now complete with <rcode-failure> from netwo>
Aug 31 11:21:43 ncore systemd-resolved[1766412]: Sending response packet with id 53098 on interface 1/AF_INET.
Aug 31 11:21:43 ncore systemd-resolved[1766412]: Freeing transaction 1533.

just for comparison, after the restart this is how a resolution for the same hostname looks like:

Aug 31 11:21:43 ncore systemd-resolved[1766412]: Got DNS stub UDP query packet for id 36949
Aug 31 11:21:43 ncore systemd-resolved[1766412]: Looking up RR for ocore.voneicken.com IN A.
Aug 31 11:21:43 ncore systemd-resolved[1766412]: Cache miss for ocore.voneicken.com IN A
Aug 31 11:21:43 ncore systemd-resolved[1766412]: Transaction 6568 for <ocore.voneicken.com IN A> scope dns on enp2s0/*.
Aug 31 11:21:43 ncore systemd-resolved[1766412]: Using feature level UDP+EDNS0 for transaction 6568.
Aug 31 11:21:43 ncore systemd-resolved[1766412]: Using DNS server fdad:eefa:a6dc::1 for transaction 6568.
Aug 31 11:21:43 ncore systemd-resolved[1766412]: Sending query packet with id 6568.
Aug 31 11:21:43 ncore systemd-resolved[1766412]: Processing query...
Aug 31 11:21:43 ncore systemd-resolved[1766412]: Got DNS stub UDP query packet for id 41489
Aug 31 11:21:43 ncore systemd-resolved[1766412]: Looking up RR for ocore.voneicken.com IN AAAA.
Aug 31 11:21:43 ncore systemd-resolved[1766412]: Cache miss for ocore.voneicken.com IN AAAA
Aug 31 11:21:43 ncore systemd-resolved[1766412]: Transaction 7847 for <ocore.voneicken.com IN AAAA> scope dns on enp2s0/*.
Aug 31 11:21:43 ncore systemd-resolved[1766412]: Using feature level UDP+EDNS0 for transaction 7847.
Aug 31 11:21:43 ncore systemd-resolved[1766412]: Using DNS server fdad:eefa:a6dc::1 for transaction 7847.
Aug 31 11:21:43 ncore systemd-resolved[1766412]: Sending query packet with id 7847.
Aug 31 11:21:43 ncore systemd-resolved[1766412]: Processing query...
Aug 31 11:21:43 ncore systemd-resolved[1766412]: Processing incoming packet on transaction 6568 (rcode=SUCCESS).
Aug 31 11:21:43 ncore systemd-resolved[1766412]: Not caching negative entry without a SOA record: ocore.voneicken.com IN A
Aug 31 11:21:43 ncore systemd-resolved[1766412]: Transaction 6568 for <ocore.voneicken.com IN A> on scope dns on enp2s0/* now complete with <success> from network (unsigned).
Aug 31 11:21:43 ncore systemd-resolved[1766412]: Sending response packet with id 36949 on interface 1/AF_INET.
Aug 31 11:21:43 ncore systemd-resolved[1766412]: Freeing transaction 6568.
Aug 31 11:21:43 ncore systemd-resolved[1766412]: Processing incoming packet on transaction 7847 (rcode=SUCCESS).
Aug 31 11:21:43 ncore systemd-resolved[1766412]: Not caching negative entry without a SOA record: ocore.voneicken.com IN AAAA
Aug 31 11:21:43 ncore systemd-resolved[1766412]: Transaction 7847 for <ocore.voneicken.com IN AAAA> on scope dns on enp2s0/* now complete with <success> from network (unsigned).
Aug 31 11:21:43 ncore systemd-resolved[1766412]: Sending response packet with id 41489 on interface 1/AF_INET.
Aug 31 11:21:43 ncore systemd-resolved[1766412]: Freeing transaction 7847.
Aug 31 11:21:43 ncore systemd-resolved[1766412]: Got DNS stub UDP query packet for id 49995
Aug 31 11:21:43 ncore systemd-resolved[1766412]: Looking up RR for ocore.voneicken.com.voneicken.com IN A.
Aug 31 11:21:43 ncore systemd-resolved[1766412]: Cache miss for ocore.voneicken.com.voneicken.com IN A
Aug 31 11:21:43 ncore systemd-resolved[1766412]: Transaction 6362 for <ocore.voneicken.com.voneicken.com IN A> scope dns on enp2s0/*.
Aug 31 11:21:43 ncore systemd-resolved[1766412]: Using feature level UDP+EDNS0 for transaction 6362.
Aug 31 11:21:43 ncore systemd-resolved[1766412]: Using DNS server fdad:eefa:a6dc::1 for transaction 6362.
Aug 31 11:21:43 ncore systemd-resolved[1766412]: Sending query packet with id 6362.
Aug 31 11:21:43 ncore systemd-resolved[1766412]: Processing query...
Aug 31 11:21:43 ncore systemd-resolved[1766412]: Got DNS stub UDP query packet for id 53098
Aug 31 11:21:43 ncore systemd-resolved[1766412]: Looking up RR for ocore.voneicken.com.voneicken.com IN AAAA.
Aug 31 11:21:43 ncore systemd-resolved[1766412]: Cache miss for ocore.voneicken.com.voneicken.com IN AAAA
Aug 31 11:21:43 ncore systemd-resolved[1766412]: Transaction 1533 for <ocore.voneicken.com.voneicken.com IN AAAA> scope dns on enp2s0/*.
Aug 31 11:21:43 ncore systemd-resolved[1766412]: Using feature level UDP+EDNS0 for transaction 1533.
Aug 31 11:21:43 ncore systemd-resolved[1766412]: Using DNS server fdad:eefa:a6dc::1 for transaction 1533.
Aug 31 11:21:43 ncore systemd-resolved[1766412]: Sending query packet with id 1533.
Aug 31 11:21:43 ncore systemd-resolved[1766412]: Processing query...
Aug 31 11:21:43 ncore systemd-resolved[1766412]: Processing incoming packet on transaction 6362 (rcode=NXDOMAIN).
Aug 31 11:21:43 ncore systemd-resolved[1766412]: Server returned error NXDOMAIN in EDNS0 mode, retrying transaction with reduced feature level UDP (DVE-2018-0001 mitigation)
Aug 31 11:21:43 ncore systemd-resolved[1766412]: Retrying transaction 6362.
Aug 31 11:21:43 ncore systemd-resolved[1766412]: Cache miss for ocore.voneicken.com.voneicken.com IN A
Aug 31 11:21:43 ncore systemd-resolved[1766412]: Transaction 6362 for <ocore.voneicken.com.voneicken.com IN A> scope dns on enp2s0/*.
Aug 31 11:21:43 ncore systemd-resolved[1766412]: Using feature level UDP for transaction 6362.
Aug 31 11:21:43 ncore systemd-resolved[1766412]: Sending query packet with id 6362.
Aug 31 11:21:43 ncore systemd-resolved[1766412]: Processing incoming packet on transaction 1533 (rcode=NXDOMAIN).
Aug 31 11:21:43 ncore systemd-resolved[1766412]: Server returned error NXDOMAIN in EDNS0 mode, retrying transaction with reduced feature level UDP (DVE-2018-0001 mitigation)
Aug 31 11:21:43 ncore systemd-resolved[1766412]: Retrying transaction 1533.
Aug 31 11:21:43 ncore systemd-resolved[1766412]: Cache miss for ocore.voneicken.com.voneicken.com IN AAAA
Aug 31 11:21:43 ncore systemd-resolved[1766412]: Transaction 1533 for <ocore.voneicken.com.voneicken.com IN AAAA> scope dns on enp2s0/*.
Aug 31 11:21:43 ncore systemd-resolved[1766412]: Using feature level UDP for transaction 1533.
Aug 31 11:21:43 ncore systemd-resolved[1766412]: Sending query packet with id 1533.
Aug 31 11:21:43 ncore systemd-resolved[1766412]: Processing incoming packet on transaction 6362 (rcode=NXDOMAIN).
Aug 31 11:21:43 ncore systemd-resolved[1766412]: Not caching negative entry without a SOA record: ocore.voneicken.com.voneicken.com IN A
Aug 31 11:21:43 ncore systemd-resolved[1766412]: Transaction 6362 for <ocore.voneicken.com.voneicken.com IN A> on scope dns on enp2s0/* now complete with <rcode-failure> from network >
Aug 31 11:21:43 ncore systemd-resolved[1766412]: Sending response packet with id 49995 on interface 1/AF_INET.
Aug 31 11:21:43 ncore systemd-resolved[1766412]: Freeing transaction 6362.
Aug 31 11:21:43 ncore systemd-resolved[1766412]: Processing incoming packet on transaction 1533 (rcode=NXDOMAIN).
Aug 31 11:21:43 ncore systemd-resolved[1766412]: Not caching negative entry without a SOA record: ocore.voneicken.com.voneicken.com IN AAAA
Aug 31 11:21:43 ncore systemd-resolved[1766412]: Transaction 1533 for <ocore.voneicken.com.voneicken.com IN AAAA> on scope dns on enp2s0/* now complete with <rcode-failure> from netwo>
Aug 31 11:21:43 ncore systemd-resolved[1766412]: Sending response packet with id 53098 on interface 1/AF_INET.
Aug 31 11:21:43 ncore systemd-resolved[1766412]: Freeing transaction 1533.

The server on which this happens is running Ubuntu 20.03.3 LTS x64, systemd package Version: 245.4-4ubuntu3.17 The DNS server to which queries are forwarded runs Dnsmasq version 2.80 on OpenWRT

fansari commented 1 year ago

We have RedHat and Debian systems and systemd-resolved has caused much trouble in our production because of suddenly stopping to resolve. We will migrate to dnsmasq.

dominichayesferen commented 1 year ago

I have this exact issue on completely vanilla Ubuntu 22.04 systemd packages - what output should I send to help with debugging?

EntityinArray commented 1 year ago

Experiencing this issue on Arch Linux, systemctl restart systemd-resolved fixes the issue. Disabling this thing for good

jimdigriz commented 1 year ago

Debian "bookworm" 12 here running systemd-resolved version 252.12-1~deb12u1 and I see the same.

Like @EntityinArray, I find restarting it fixes it everytime.

My /etc/systemd/resolved.conf is no more than:

[Resolve]
FallbackDNS=1.1.1.1#cloudflare-dns.com 1.0.0.1#cloudflare-dns.com 2606:4700:4700::1111#cloudflare-dns.com 2606:4700:4700::1001#cloudflare-dns.com
Cache=no

Using systemd-networkd too.

Trying to figure out how to migrate off systemd-resolved as this is a laptop and when roaming I need to use the local resolvers; including the ones provided via IPv6 RA's...

ei-grad commented 1 year ago

For me this often could be reproduced with launch of cloudquery sync, which generates a fair amount of DNS queries due to Go and AWS nature. Also, there is journalctl -u systemd-resolved full of Failed to generate query object: Device or resource busy messages.

ei-grad commented 1 year ago

Even if runs without crashes, systemd-resolved generates a heavy load on CPU and randomly refuses to respond on the requests. Moved to dnsmasq + cloudflared DoH, just works, no CPU load.

jimdigriz commented 1 year ago

Replaced here with unbound, resolvconf and libnss-mdns, 100% reliable and Just Works(tm).

tanji commented 11 months ago

I would say this mostly occurs because of nss-resolve man page misleading suggestion to set nsswitch.conf to hosts: resolve [!UNAVAIL=return]. If you do so, it means that all status cases except systemd-resolved being unavailable will cause nss to return. For example, if resolve is overloaded, it will fail, because that configuration is negating [tryagain=continue] which is the default. In other words it effectively disables the retry mechanism behind nss.

Removing that configuration parameter from nsswitch.conf fixed the issues for me.

Jeansen commented 10 months ago

Same here. Running Debian Bookworm on a Jenkins Server. It very often hangs for good. If I restart the service, it works fine again. No idea what's the problem. It also did not help to change the stub file linked in /etc/resolv.conf with a real resolv.conf file containing the address of DNS to use. It's a real PIA at the moment and nothing in the logs ... ;-(

Installed package: systemd-resolved/stable,now 252.19-1~deb12u1 amd64

flokli commented 8 months ago

I would say this mostly occurs because of nss-resolve man page misleading suggestion to set nsswitch.conf to hosts: resolve [!UNAVAIL=return]. If you do so, it means that all status cases except systemd-resolved being unavailable will cause nss to return. For example, if resolve is overloaded, it will fail, because that configuration is negating [tryagain=continue] which is the default. In other words it effectively disables the retry mechanism behind nss.

Removing that configuration parameter from nsswitch.conf fixed the issues for me.

Can you elaborate on the "misleading suggestion" part? If the docs are misleading, they should be updated, ideally by a PR explaining the reasoning.

poettering commented 8 months ago

The requested logs where never provided. Closing.

camoz commented 8 months ago

@poettering In case you missed it, they collected the logs and asked if they could send them privately: https://github.com/systemd/systemd/issues/21123#issuecomment-952334702 Since several people reported they have the same issue, could we maybe keep it open and ask if someone else could provide logs?

@dominichayesferen Please see the first three responses to this issue which describe how to collect logs (in case you still want to help).

dominichayesferen commented 8 months ago

Alright, I'll keep this tab open for Friday when I'm next available to do debugging.

mrc0mmand commented 8 months ago

@poettering: this happens from time to time even in our CI with TEST-75-RESOLVED:

[ 1554.965934] testsuite-75.sh[53]: + run dig stale1.unsigned.test -t A
[ 1554.966373] testsuite-75.sh[2171]: + dig stale1.unsigned.test -t A
[ 1554.966779] testsuite-75.sh[2172]: + tee /tmp/tmp.7o55Zn8Qey
[ 1554.980412] systemd-resolved[2150]: Received dns UDP packet of size 61, ifindex=0, ttl=64, fragsize=0, sender=127.0.0.1, destination=127.0.0.53
[ 1554.980822] systemd-resolved[2150]: Got DNS stub UDP query packet for id 25311
[ 1554.980855] systemd-resolved[2150]: Looking up RR for stale1.unsigned.test IN A.
[ 1554.980882] systemd-resolved[2150]: Requested with no stale and TTL expired for stale1.unsigned.test IN A
[ 1554.980910] systemd-resolved[2150]: Firing regular transaction 13889 for <stale1.unsigned.test IN A> scope dns on dns0/* (validate=yes).
[ 1554.980936] systemd-resolved[2150]: Using feature level UDP+EDNS0+DO for transaction 13889.
[ 1554.980961] systemd-resolved[2150]: Using DNS server 10.0.0.1 for transaction 13889.
[ 1554.980985] systemd-resolved[2150]: Announcing packet size 1472 in egress EDNS(0) packet.
[ 1554.981010] systemd-resolved[2150]: Emitting UDP, link MTU is 1500, socket MTU is 65535, minimal MTU is 40
[ 1554.981035] systemd-resolved[2150]: Sending query packet with id 13889 of size 72.
[ 1554.981060] systemd-resolved[2150]: Sending query via TCP since UDP is blocked.
[ 1554.981090] systemd-resolved[2150]: Added socket 28 to graveyard
[ 1554.981118] systemd-resolved[2150]: Using feature level UDP+EDNS0+DO for transaction 13889.
[ 1554.981141] systemd-resolved[2150]: Announcing packet size 1472 in egress EDNS(0) packet.
[ 1554.981165] systemd-resolved[2150]: Processing query...
[ 1559.985764] systemd-resolved[2150]: Received dns UDP packet of size 61, ifindex=0, ttl=64, fragsize=0, sender=127.0.0.1, destination=127.0.0.53
[ 1559.986156] systemd-resolved[2150]: Got DNS stub UDP query packet for id 25311
[ 1559.986189] systemd-resolved[2150]: Looking up RR for stale1.unsigned.test IN A.
[ 1559.986216] systemd-resolved[2150]: Processing query...
[ 1564.990464] systemd-resolved[2150]: Connection failure for DNS TCP stream: Connection timed out
[ 1564.990818] systemd-resolved[2150]: Retrying transaction 13889, after switching servers.
[ 1564.990869] systemd-resolved[2150]: dns0: Switching to DNS server fd00:dead:beef:cafe::1.
[ 1564.990898] systemd-resolved[2150]: Positive cache hit for stale1.unsigned.test IN A
[ 1564.990928] systemd-resolved[2150]: Serve Stale response rcode=SUCCESS for stale1.unsigned.test IN A
[ 1564.990955] systemd-resolved[2150]: Regular transaction 13889 for <stale1.unsigned.test IN A> on scope dns on dns0/* now complete with <success> from cache (unsigned; non-confidential).
[ 1564.990981] systemd-resolved[2150]: Sending response packet with id 25311 on interface 1/AF_INET of size 65.
[ 1564.991008] systemd-resolved[2150]: Sending response packet with id 25311 on interface 1/AF_INET of size 65.
[ 1564.991031] systemd-resolved[2150]: Freeing transaction 13889.
[ 1564.991057] systemd-resolved[2150]: Received dns UDP packet of size 61, ifindex=0, ttl=64, fragsize=0, sender=127.0.0.1, destination=127.0.0.53
[ 1564.991089] systemd-resolved[2150]: Got DNS stub UDP query packet for id 25311
[ 1564.991115] systemd-resolved[2150]: Looking up RR for stale1.unsigned.test IN A.
[ 1564.991141] systemd-resolved[2150]: Requested with no stale and TTL expired for stale1.unsigned.test IN A
[ 1564.991166] systemd-resolved[2150]: Firing regular transaction 52306 for <stale1.unsigned.test IN A> scope dns on dns0/* (validate=yes).
[ 1564.991192] systemd-resolved[2150]: Using feature level UDP+EDNS0+DO for transaction 52306.
[ 1564.991218] systemd-resolved[2150]: Using DNS server fd00:dead:beef:cafe::1 for transaction 52306.
[ 1564.991244] systemd-resolved[2150]: Closing graveyard socket fd 28
[ 1564.991271] systemd-resolved[2150]: Announcing packet size 1452 in egress EDNS(0) packet.
[ 1564.991296] systemd-resolved[2150]: Emitting UDP, link MTU is 1500, socket MTU is 65536, minimal MTU is 60
[ 1564.991326] systemd-resolved[2150]: Sending query packet with id 52306 of size 72.
[ 1564.991352] systemd-resolved[2150]: Sending query via TCP since UDP is blocked.
[ 1564.991376] systemd-resolved[2150]: Added socket 28 to graveyard
[ 1564.991402] systemd-resolved[2150]: Using feature level UDP+EDNS0+DO for transaction 52306.
[ 1564.991425] systemd-resolved[2150]: Announcing packet size 1452 in egress EDNS(0) packet.
[ 1564.991449] systemd-resolved[2150]: Processing query...
[ 1570.007839] testsuite-75.sh[2172]: ;; communications error to 127.0.0.53#53: timed out
[ 1570.007839] testsuite-75.sh[2172]: ;; communications error to 127.0.0.53#53: timed out
[ 1570.007839] testsuite-75.sh[2172]: ;; communications error to 127.0.0.53#53: timed out
[ 1570.007839] testsuite-75.sh[2172]: ; <<>> DiG 9.18.24 <<>> stale1.unsigned.test -t A
[ 1570.007839] testsuite-75.sh[2172]: ;; global options: +cmd
[ 1570.007839] testsuite-75.sh[2172]: ;; no servers could be reached

Full journal: https://mrc0mmand.fedorapeople.org/journals/TEST-75-RESOLVE-timeout.journal

poettering commented 8 months ago

@poettering In case you missed it, they collected the logs and asked if they could send them privately: #21123 (comment) Since several people reported they have the same issue, could we maybe keep it open and ask if someone else could provide logs?

Well, the problem with reports like this one is that if three people comment on the same issue, it's not a given they talk about the same issue. It gets very confusing to follow the thread then, because it suggests we are talking about hte same issue so more often than not we might not.

For example @mrc0mmand's report has this line:

Sending query via TCP since UDP is blocked.

this happens if we get EPERM from send(), which typically suggests some kind of firewalling/sandboxing situation.

The original reporter reported some issue with some specific rotuer though.

I am pretty sure the two issues are unrelated. But at this time things already got very very confusing.

Hence: in cases like this let the mantainers figure out if two reports are the same cases, don't make assumptions that experience tells the maintainers are more often false than true...

or in other words, if you have an issue, and can provide debug logs, just open a separate issue, and let's track it there. do not add noise to possibly unrelated issues. assume different issues not identical issues by default please.

mrc0mmand commented 8 months ago

or in other words, if you have an issue, and can provide debug logs, just open a separate issue, and let's track it there. do not add noise to possibly unrelated issues. assume different issues not identical issues by default please.

Agreed that this issue is indeed hard to follow. I extracted my comment into a separate ticket: https://github.com/systemd/systemd/issues/31639

camoz commented 8 months ago

@poettering Makes sense, thanks for the explanation!

dominichayesferen commented 8 months ago

please enable debug logging for resolved, then reproduce the issue and provide the generated log output of resolved here. i.e. systemctl service-log-level systemd-resolved debug and then to collect the logs do journalctl -e -u systemd-resolved

The following has been captured immediately after resolvectl status successfully showed symptoms of the bug: image

unsuccessful launch.txt unsuccessful launch - run 2.txt

Two runs are uploaded as the second time running journalctl doesn't show the initial output for some reason.

dominichayesferen commented 8 months ago

Ok, here's output from only a successfully working resolved boottime-launch:

successful launch.txt

brauliobo commented 8 months ago

I'm seeing this issue when running a heavy multithreaded application (1000+ threads), all doing requests for collecting data. Using latest systemd and linux from Archlinux

$  sudo systemctl status systemd-resolved             
● systemd-resolved.service - Network Name Resolution
     Loaded: loaded (/usr/lib/systemd/system/systemd-resolved.service; enabled; preset: enabled)
     Active: active (running) since Fri 2024-03-22 23:22:22 -03; 3 days ago
       Docs: man:systemd-resolved.service(8)
             man:org.freedesktop.resolve1(5)
             https://www.freedesktop.org/wiki/Software/systemd/writing-network-configuration-managers
             https://www.freedesktop.org/wiki/Software/systemd/writing-resolver-clients
   Main PID: 529 (systemd-resolve)
     Status: "Processing requests..."
      Tasks: 1 (limit: 135079)
     Memory: 87.5M (peak: 98.9M)
        CPU: 2h 20min 4.611s
     CGroup: /system.slice/systemd-resolved.service
             └─529 /usr/lib/systemd/systemd-resolved

mar 26 16:42:41 bhavapower systemd-resolved[529]: Failed to generate query object: Device or resource busy
mar 26 16:42:41 bhavapower systemd-resolved[529]: Failed to generate query object: Device or resource busy
mar 26 16:42:41 bhavapower systemd-resolved[529]: Failed to generate query object: Device or resource busy
mar 26 16:42:41 bhavapower systemd-resolved[529]: Failed to generate query object: Device or resource busy
mar 26 16:42:41 bhavapower systemd-resolved[529]: Failed to generate query object: Device or resource busy
mar 26 16:42:41 bhavapower systemd-resolved[529]: Failed to generate query object: Device or resource busy
mar 26 16:42:41 bhavapower systemd-resolved[529]: Failed to generate query object: Device or resource busy
mar 26 16:42:41 bhavapower systemd-resolved[529]: Failed to generate query object: Device or resource busy
mar 26 16:42:41 bhavapower systemd-resolved[529]: Failed to generate query object: Device or resource busy
mar 26 16:42:41 bhavapower systemd-resolved[529]: Failed to generate query object: Device or resource busy
honomoa commented 7 months ago

I think it’s something caused by ipv6, disable ipv6 on all interface and everything get back

net.ipv6.conf.all.disable_ipv6=1
net.ipv6.conf.default.disable_ipv6=1
net.ipv6.conf.lo.disable_ipv6 = 1
Jeansen commented 7 months ago

I think it’s something caused by ipv6, disable ipv6 on all interface and everything get back

net.ipv6.conf.all.disable_ipv6=1
net.ipv6.conf.default.disable_ipv6=1
net.ipv6.conf.lo.disable_ipv6 = 1

If this is a feature, then it's a bug. If it is a bug, it needs fixing ;-)

tiagonix commented 7 months ago

Hey folks, @poettering,

I'm facing this issue over and over again. It seems very easy to reproduce.

The issue appears when I try to deploy OpenStack using OpenStack Ansible All-In-One (OSA AIO). I tried from two different locations, from home and the office. The problem persists no matter where I am, or which hardware I use.

Mid-deployment, the operating system stops resolving DNS domains. When it fails, I run a quick test ping -c1 google.com, if it fails I run systemctl restart systemd-resolved.service, and try the Ansible playbooks again. It proceeds past the failed task, but then it fails again, and again, over and over, until it finally finishes the playbook successfully.

As a workaround (see the Bash script below), I created a super simple restart_resolved() function and an if statement to "monitor" the playbook run, if it fails, I assume it's systemd-resolved going beserk and then I restart it.

Please don't take me wrong! I admire your work but I'm very disappointed that systemd-resolved is such an unreliable piece of software, and I honestly don't understand why Canonical enables it by default on Ubuntu when it isn't ready for prime time yet. Anyway, let's try to fix this! I like how it's easy to enable DNS over TLS with resolved! I also enjoy the way it's possible to choose different nameservers based on the DNS domain (or network interface? don't remember exactly), which is unique and cool!

I'll try to disable IPv6 to see if it helps! But I already tried in a place without IPv6 (office) and the result was the same (but I haven't explicitly disabled it.

Here's how I'm facing the issue:

Minimum Requirements

1 Laptop ThinkPad (or Desktop PC) with +8 CPU, +16G RAM, +64G SSD 1 Ubuntu Desktop 22.04.4 fresh installed from ISO, and upgraded:

apt update
apt upgrade -y
apt autoremove -y
apt install -y vim git
snap refresh
reboot

Then, on the next boot, I run this openstack-ansible-deploy.sh script:

#!/bin/bash

#
# Bootstrap Instance
#

echo '***'
echo '*** Openstack Ansible AIO Scripted Deployment'
echo '***'

#
# OpenStack Ansible AIO Default: 'aio_lxc'
#

# The default deployment, no Ceph, just the most basic with a "data disk" as a Loopback device.
#export SCENARIO='aio_lxc'

#
# OpenStack Ansible AIO with Ceph: 'aio_lxc_ceph'
#

# The OSA + Ceph (installed via `ceph-ansible`)
export SCENARIO='aio_lxc_ceph'

# Reference:
# https://docs.openstack.org/openstack-ansible/2023.2/user/aio/quickstart.html

#
# Download OpenStack Ansible and Bootstrap Ansible
#

git clone https://opendev.org/openstack/openstack-ansible /opt/openstack-ansible

pushd /opt/openstack-ansible

echo
echo "Git checkout OSA 'stable/2023.2' branch."
echo

git checkout stable/2023.2

# Workaround so OSA TASK can find the partitions in the Loopback device (`loop20`):
if [[ "$SCENARIO" == aio_lxc ]]
then
    echo
    echo "OSA AIO Configuring 'loop20' Data Disk and patching 'prepare_data_disk.yml'"
    echo

    fallocate -l 50G /loopfile1.img
    losetup /dev/loop20 /loopfile1.img

    export BOOTSTRAP_OPTS='bootstrap_host_data_disk_device=loop20'

    sed -i 's/}}1/}}p1/g' /opt/openstack-ansible/tests/roles/bootstrap-host/tasks/prepare_data_disk.yml
    sed -i 's/}}2/}}p2/g' /opt/openstack-ansible/tests/roles/bootstrap-host/tasks/prepare_data_disk.yml
fi

if scripts/bootstrap-ansible.sh
then
        echo
        echo "Script: 'scripts/bootstrap-ansible.sh' finished."
        echo
else
        echo
        echo "Script: 'scripts/bootstrap-ansible.sh' FAILED!"
        echo
        exit 1
fi

popd

#
# Enable Some Stuff (Optional)
#

#pushd /opt/openstack-ansible
#
#cp etc/openstack_deploy/conf.d/{aodh,gnocchi,ceilometer}.yml.aio /etc/openstack_deploy/conf.d/
#for f in $(ls -1 /etc/openstack_deploy/conf.d/*.aio); do mv -v ${f} ${f%.*}; done
#
#popd

#
# Bootstrap OSA AIO
#

pushd /opt/openstack-ansible

if scripts/bootstrap-aio.sh
then
        echo
        echo "Script: 'scripts/bootstrap-aio.sh' finished."
        echo
else
        echo
        echo "Script: 'scripts/bootstrap-aio.sh' FAILED!"
        echo
        exit 1
fi
popd

# Basic settings:
sed -i -e 's/install_method:.*/install_method: distro/' /etc/openstack_deploy/user_variables.yml
echo 'apply_security_hardening: false' >> /etc/openstack_deploy/user_variables.yml
echo 'rabbitmq_install_method: distro' >> /etc/openstack_deploy/user_variables.yml
echo 'ceph_origin: distro' >> /etc/openstack_deploy/user_variables.yml
echo 'ceph_stable_release: quincy' >> /etc/openstack_deploy/user_variables.yml
echo 'ceph_pkg_source: distro' >> /etc/openstack_deploy/user_variables.yml

# Disable undesirable/failing playbooks:
#sed -i '/os-tempest-install\.yml/d' /opt/openstack-ansible/playbooks/setup-openstack.yml

#
# Individual Top-Level Playbooks, all included in setup-everything.yml
#

pushd /opt/openstack-ansible/playbooks

# Function to restart systemd-resolved service
# Workaround bug: https://github.com/systemd/systemd/issues/21123
restart_resolved() {
    echo
    echo "WARNING! Restarting 'systemd-resolved'!"
    echo
    systemctl restart systemd-resolved.service
    echo "sleep 3..."
    sleep 3
}

# Number of retries
max_retries_1=9
retry_count_1=0

# Deploy bare metal and Containers
while true
do
    if openstack-ansible setup-hosts.yml
    then
        echo
        echo "Playbook: 'setup-hosts.yml' finished."
        echo
        break
    else
    echo
    echo "ping -c1 google.com ..."
        ping -c1 google.com
        if [[ $? -ne 0 ]]
    then
            restart_resolved
        fi
        ((retry_count_1++))
        if ((retry_count_1 >= max_retries_1))
    then
            echo
            echo "Playbook: 'setup-hosts.yml' FAILED!"
            echo
            exit 1
        else
            echo
            echo "Retrying playbook (retry $retry_count_1)..."
            echo
        fi
    fi
done

# Number of retries
max_retries_2=9
retry_count_2=0

# Deploy Infra Services, Galera, Rabbit, Memcached, Ceph, etc.
while true
do
    if openstack-ansible setup-infrastructure.yml
    then
        echo
        echo "Playbook: 'setup-infrastructure.yml' finished."
        echo
        break
    else
    echo
    echo "ping -c1 google.com ..."
        ping -c1 google.com
        if [[ $? -ne 0 ]]
    then
            restart_resolved
        fi
        ((retry_count_2++))
        if ((retry_count_2 >= max_retries_2))
    then
            echo
            echo "Playbook: 'setup-infrastructure.yml' FAILED!"
            echo
            exit 1
        else
            echo
            echo "Retrying playbook (retry $retry_count_2)..."
            echo
        fi
    fi
done

# Number of retries
max_retries_3=9
retry_count_3=0

# Deploy OpenStack
while true
do
    if openstack-ansible setup-openstack.yml
    then
        echo
        echo "Playbook: 'setup-openstack.yml' finished."
        echo
        break
    else
    echo
    echo "ping -c1 google.com ..."
        ping -c1 google.com
        if [[ $? -ne 0 ]]
    then
            restart_resolved
        fi
        ((retry_count_3++))
        if ((retry_count_3 >= max_retries_3))
    then
            echo
            echo "Playbook: 'setup-openstack.yml' FAILED!"
            echo
            exit 1
        else
            echo
            echo "Retrying playbook (retry $retry_count_3)..."
            echo
        fi
    fi
done

popd

#
# Galera Rolling Reboot for Fun & Hmmm... Updates!
#

#openstack-ansible /opt/openstack-ansible/scripts/upgrade-utilities/galera-cluster-rolling-restart.yml

#
# OSA Major Upgrade, from Bobcat to Caracal (TODO)
#

#cd /opt/openstack-ansible ; git pull ; git checkout master
#cd /opt/openstack-ansible ; ./scripts/run-upgrade.sh

# Clear PageCache only (Freeing RAM)
sync
echo 1 > /proc/sys/vm/drop_caches

echo '***'
echo "*** OSA Boostrap Done! Your OpenStack Cloud is up and running!!!"
echo '***'

Let me know if I can help somehow, perhaps collecting more info during my tests.

d33pjs commented 7 months ago

I switched from systemd-resolved to resolvconf even though I need to configure resolv.conf manually instead of via DHCP (because it seems that netplan and NetworkManager don't have interfaces to resolvconf).

Anyway, I'm shocked that: a) Ubuntu really did switch to systemd-resolved even though it doesn't seem to be reliable or "finished", b) we still have issues with DNS in 2024 (yeah, I know: "it's always DNS") and/or IPv6 and c) why Ubuntu is using a hole caching service instead of a small(er) resolver/forwarder (and if someone needs caching, he/she can).

My way of maybe fixing finally my DNS issues:

# apt install resolvconf
# systemctl stop systemd-resolved
# systemctl disable systemd-resolved
# rm /etc/resolv.conf
# vi /etc/resolv.conf
  as I said: manually creating resolv.conf incl. the right conents
# dig a testdomain.com
# service restart docker

Edit: By the way, I'm seeing some strange dns requests in my systemd-resolved debug log. Somehow similar to the ones from @tve: After an AAAA request, it somehow just add's the local domain to it: "Request some.domain.de AAAA" to "Request some.domain.de.local A" (like the example log from tve: "ocore.voneicken.com.voneicken.com"). My Pihole on the other side also seems to work like expected... no failures, no problems, no delays - but, to exclude this as a DNS Server, I already switched from adguard to pihole just for testing.

gudvinr commented 6 months ago

even though I need to configure resolv.conf manually instead of via DHCP

@d33pjs try dnsmasq instead, it can receive DNS servers from NM. I fed up with resolved too and ditched it in favor of dnsmasq and had no issues since then.

Liamlu28 commented 6 months ago

I am using the Xena version of OpenStack, but after rebooting my virtual machines, the systemd-resolved service consistently fails. I'm also a bit confused and it might be related to cloud-init. This service significantly impacts the stability of my system, as my virtual machines also need to run Kubernetes nodes. When pods are starting, they need to read from /run/systemd/resolve/resolv.conf. If they cannot read this file, the pods will fail to start

d33pjs commented 6 months ago

even though I need to configure resolv.conf manually instead of via DHCP

@d33pjs try dnsmasq instead, it can receive DNS servers from NM. I fed up with resolved too and ditched it in favor of dnsmasq and had no issues since then.

Thank you. I was also thinking about dnsmasq instead of resolvconf, but unfortunately it seems to not have an interface to netplan or systemd-networkd either. Because I'm running the minimized Version of Ubuntu, I don't have NetworkManager installed.

PS. Sorry for the confusion, I thought netplan was the alternative to NetworkManager, but netplan seems to be a configuration tool for the backends NetworkManager or systemd-networkd.

ei-grad commented 3 months ago

Another alternative to consider is coredns, which offers a user-friendly configuration file, a variety of additional features, and the ability to scale across multiple CPU cores. Additionally, while dnsmasq can fail to respond under certain extreme conditions (though it can still handle 10-100x more load than systemd-resolved), coredns remains reliable in these scenarios.

Example /etc/coredns/Corefile:

# This block configures settings for queries to the root zone (".").
# You can also configure other zones for flexible DNS routing from your machine.
. {

    # Specifies the IP address where CoreDNS listens for incoming DNS requests.
    # "lo" refers to the local loopback interface (127.0.0.1).
    bind lo

    # If uncommented, it would forward all DNS queries to the resolvers defined in /etc/resolv.dhcp.conf.
    #forward . /etc/resolv.dhcp.conf

    # Forwards DNS queries to Cloudflare DNS servers (1.1.1.1 and 1.0.0.1) using TLS.
    forward . tls://1.1.1.1 tls://1.0.0.1 {
        # Sets the TLS server name for certificate verification.
        tls_servername cloudflare-dns.com
    }

    # Configures caching of DNS responses to enhance performance.
    cache {
        # Caches up to 99840 entries; cache duration is between 600 seconds (min) and 259200 seconds (max).
        success 99840 259200 600
        # Allows serving of stale cache entries if upstream DNS servers are unavailable.
        serve_stale
    }

    # Enables error logging for easier debugging and monitoring.
    # Logs should be available via `journalctl -u coredns`.
    errors
}
VGerris commented 1 month ago

same problem on a new installation of Ubuntu 24.04 LTS. Utterly useless, works often a few seconds after restart, then not. Unstable, unpredictable, using a DNS that is in the local network. dnsmasq it is until this is fixed. In case this reference was missed : https://www.reddit.com/r/linux/comments/18kh1r5/im_shocked_that_almost_no_one_is_talking_about/

palapapa commented 1 month ago

It is still happening. For me, it stops resolving after not using my PC for a while. When I would come back to it, it wouldn't resolve anything until several minutes later.

EDIT: I fixed it by disabling DNSSEC.

wsdookadr commented 2 weeks ago

gudvinr opened this issue on Oct 25, 2021

I suppose it's an ongoing issue. Here's another interesting read on a related topic.