matrix-org / synapse

Synapse: Matrix homeserver written in Python/Twisted.
https://matrix-org.github.io/synapse
Apache License 2.0
11.82k stars 2.13k forks source link

Federation errors when DNS server is on same IP #8118

Closed freaktechnik closed 4 years ago

freaktechnik commented 4 years ago

Description

I'm running the DNS server that synapse is using to resolve names on the same machine as synapse itself. As such, the IPv4 and IPv6 addresses (the IPv6 address is not link local, and I don't even see it in the resolv.conf inside the docker container) are identical to the ones synapse would have in the network.

Steps to reproduce

All HTTP requests by synapse fail due to not being able to resolve any domain names. nslookup on the parent machine is fine. I'm quite certain the STR aren't fully reduced, however I'd rather have a working synapse than try to find the exact case that breaks it.

Version information

If not matrix.org:

clokep commented 4 years ago

Is this a setup that was previously working and recently broke or is it a new setup?

Asking in #synapse:matrix.org might be a good place to start, it sounds like something is broken in the docker config -- it could be interesting to modify the docker file to simply run a DNS lookup and see if that works.

This sounds somewhat like #5846, but you do mention it isn't link-local, but maybe it runs into the same issue eventually?

freaktechnik commented 4 years ago

I recently changed the DNS setup and it appears that's when synapse broken (once restarted for the recent version update). So I'm sure it's related to changing the DNS servers. Setting the DNS servers to a different one on the container with --dns also fixed the issue - I'm now trying if I can instead set it to the local DNS server I'm running by its internal docker network IP.

Sadly I can't seem to run anything inside the docker image that does dns lookups since it's a pretty slim image.

freaktechnik commented 4 years ago

I have tried the script from the twisted bug on the host machine and everything seems fine there with any of the DNS IPs the container should possibly be getting, so it must be something in combination with the docker network - though I know for a fact other containers have no issue accessing services running on the same machine via the local network IP.

freaktechnik commented 4 years ago

I've tried the twisted snipped via python repl in docker exec and it seems fine for the IPv4 address, even though explicitly defining it on the container didn't.

It fails for the IPv6 address with "can't assign requested address" - possibly the docker network not having full ipv6. However I'm sure this is unrelated to IPv6 given that resolv.conf only had IPv4 addresses in it and it failed for domains that only resolved to A records (and/or had a SRV record that pointed to something with an A record). The other case I managed to get it to actually time out was when using the local docker network IP.

babolivier commented 4 years ago

Hey, sorry it took some time to get back to you. Would you be able to turn Synapse's logging to DEBUG (in the logging config, it's usually a separate file in the same directory as the main config) and see if it logs anything that seems related to this issue?

freaktechnik commented 4 years ago

With debug logging it just seems slightly more verbose, but I don't see anything that is useful in addition to the basic issue:

2020-08-28 20:04:23,501 - synapse.http.federation.matrix_federation_agent - 261 - INFO - federation_transaction_transmission_loop-3 - Failed to connect to matrix-federation.matrix.org:443: DNS lookup failed: Couldn't find the hostname 'matrix-federation.matrix.org'.
2020-08-28 20:04:23,501 - synapse.http.matrixfederationclient - 543 - INFO - federation_transaction_transmission_loop-3 - {PUT-O-44} [matrix.org] Request failed: PUT matrix://matrix.org/_matrix/federation/v1/send/1598637689015: DNSLookupError("Couldn't find the hostname 'matrix-federation.matrix.org'")
2020-08-28 20:04:23,512 - synapse.http.federation.well_known_resolver - 263 - INFO - federation_transaction_transmission_loop-30 - Error fetching https://archlinux.org/.well-known/matrix/server: DNS lookup failed: Couldn't find the hostname 'www.archlinux.org'.
2020-08-28 20:04:23,513 - synapse.logging.context - 351 - WARNING - federation_transaction_transmission_loop-30 - Expected logging context federation_transaction_transmission_loop-30 was lost
2020-08-28 20:04:23,513 - synapse.http.matrixfederationclient - 543 - INFO - federation_transaction_transmission_loop-30 - {PUT-O-45} [archlinux.org] Request failed: PUT matrix://archlinux.org/_matrix/federation/v1/send/1598637689016: TimeoutError(60, 'Deferred')
2020-08-28 20:04:23,530 - synapse.http.federation.matrix_federation_agent - 261 - INFO - federation_transaction_transmission_loop-71 - Failed to connect to mtx.koyax.org:443: DNS lookup failed: Couldn't find the hostname 'mtx.koyax.org'.
2020-08-28 20:04:23,531 - synapse.http.matrixfederationclient - 543 - INFO - federation_transaction_transmission_loop-71 - {PUT-O-46} [mtx.koyax.org] Request failed: PUT matrix://mtx.koyax.org/_matrix/federation/v1/send/1598637689017: DNSLookupError("Couldn't find the hostname 'mtx.koyax.org'")
2020-08-28 20:04:23,538 - synapse.http.federation.matrix_federation_agent - 261 - INFO - federation_transaction_transmission_loop-74 - Failed to connect to matrix.2gather.community:8448: DNS lookup failed: Couldn't find the hostname 'matrix.2gather.community'.

...

2020-08-28 20:04:24,156 - synapse.http.federation.well_known_resolver - 263 - INFO - federation_transaction_transmission_loop-256 - Error fetching https://destrocodpiece.wtf/.well-known/matrix/server: DNS lookup failed: Couldn't find the hostname 'destrocodpiece.wtf'.
2020-08-28 20:04:24,156 - synapse.logging.context - 351 - WARNING - federation_transaction_transmission_loop-256 - Expected logging context federation_transaction_transmission_loop-256 was lost
2020-08-28 20:04:24,156 - synapse.http.matrixfederationclient - 543 - INFO - federation_transaction_transmission_loop-256 - {PUT-O-274} [destrocodpiece.wtf] Request failed: PUT matrix://destrocodpiece.wtf/_matrix/federation/v1/send/1598637689245: TimeoutError(60, 'Deferred')
2020-08-28 20:04:24,156 - synapse.http.federation.well_known_resolver - 263 - INFO - federation_transaction_transmission_loop-264 - Error fetching https://mathur.io/.well-known/matrix/server: DNS lookup failed: Couldn't find the hostname 'mathur.io'.
2020-08-28 20:04:24,157 - synapse.logging.context - 351 - WARNING - federation_transaction_transmission_loop-264 - Expected logging context federation_transaction_transmission_loop-264 was lost
2020-08-28 20:04:24,157 - synapse.http.matrixfederationclient - 543 - INFO - federation_transaction_transmission_loop-264 - {PUT-O-275} [mathur.io] Request failed: PUT matrix://mathur.io/_matrix/federation/v1/send/1598637689246: TimeoutError(60, 'Deferred')
2020-08-28 20:04:24,157 - synapse.http.federation.well_known_resolver - 263 - INFO - federation_transaction_transmission_loop-257 - Error fetching https://deadgoldfish.party/.well-known/matrix/server: DNS lookup failed: Couldn't find the hostname 'deadgoldfish.party'.
2020-08-28 20:04:24,158 - synapse.logging.context - 351 - WARNING - federation_transaction_transmission_loop-257 - Expected logging context federation_transaction_transmission_loop-257 was lost
2020-08-28 20:04:24,158 - synapse.http.matrixfederationclient - 543 - INFO - federation_transaction_transmission_loop-257 - {PUT-O-276} [deadgoldfish.party] Request failed: PUT matrix://deadgoldfish.party/_matrix/federation/v1/send/1598637689247: TimeoutError(60, 'Deferred')

...

020-08-28 20:04:37,688 - synapse.http.federation.matrix_federation_agent - 261 - INFO - PUT-310 - Failed to connect to matrix-federation.matrix.org:443: DNS lookup failed: Couldn't find the hostname 'matrix-federation.matrix.org'.
2020-08-28 20:04:37,688 - synapse.http.matrixfederationclient - 543 - INFO - PUT-310 - {POST-O-738} [matrix.org] Request failed: POST matrix://matrix.org/_matrix/key/v2/query: DNSLookupError("Couldn't find the hostname 'matrix-federation.matrix.org'")
2020-08-28 20:04:37,689 - synapse.http.federation.well_known_resolver - 263 - INFO - PUT-310 - Error fetching https://mozilla.org/.well-known/matrix/server: DNS lookup failed: Couldn't find the hostname 'mozilla.org'.

Interestingly it seems like some URLs are resolvable, since it does a bunch of .well-known lookups which look ok before it reached the logs above:

2020-08-28 20:04:13,675 - synapse.http.federation.well_known_resolver - 192 - INFO - federation_transaction_transmission_loop-56 - Error parsing well-known for b'macbeth.cc': Non-200 response 404
2020-08-28 20:04:13,708 - synapse.http.federation.well_known_resolver - 185 - INFO - federation_transaction_transmission_loop-55 - Response from .well-known: {'m.server': 'bau-ha.us:8448'}
2020-08-28 20:04:13,729 - synapse.http.federation.well_known_resolver - 185 - INFO - federation_transaction_transmission_loop-63 - Response from .well-known: {'m.server': 'matrix.mailus.de:443'}
2020-08-28 20:04:13,731 - synapse.http.federation.well_known_resolver - 185 - INFO - federation_transaction_transmission_loop-59 - Response from .well-known: {'m.server': 'matrix.cybre.space:8448'}
2020-08-28 20:04:13,786 - synapse.http.federation.well_known_resolver - 192 - INFO - federation_transaction_transmission_loop-38 - Error parsing well-known for b'riotchat.de': Non-200 response 404
2020-08-28 20:04:13,913 - synapse.http.federation.well_known_resolver - 185 - INFO - federation_transaction_transmission_loop-54 - Response from .well-known: {'m.server': 'd0.ee:443'}
2020-08-28 20:04:13,992 - synapse.http.federation.well_known_resolver - 185 - INFO - federation_transaction_transmission_loop-64 - Response from .well-known: {'m.server': 'sorunome.de:443'}
2020-08-28 20:04:14,491 - synapse.http.federation.well_known_resolver - 192 - INFO - federation_transaction_transmission_loop-65 - Error parsing well-known for b'eggy.cc': Non-200 response 404
2020-08-28 20:04:14,608 - synapse.http.federation.well_known_resolver - 192 - INFO - federation_transaction_transmission_loop-62 - Error parsing well-known for b'smokeonthewater.foobar.rocks': Non-200 response 404
freaktechnik commented 4 years ago

Between the .well-known lookups above and the first DNS errors there is one more federation log:

2020-08-28 20:04:19,921 - synapse.http.federation.well_known_resolver - 263 - INFO - federation_transaction_transmission_loop-86 - Error fetching https://cirr.com/.well-known/matrix/server: [<twisted.python.failure.Failure service_identity.exceptions.VerificationError: VerificationError(errors=[DNSMismatch(mismatched_id=DNS_ID(hostname=b'cirr.com'))])>]

Other than that there's just

2020-08-28 20:04:22,287 - synapse.http.site - 217 - INFO - PUT-310 - Connection from client lost before response was sent

That I don't know exactly what it means.

freaktechnik commented 4 years ago

I've enabled query logging on my local DNS server and it seems like synapse looks up the domains with the DNS server every time it tries to connect to something, no DNS cache at all? Either way, I only see successful queries there. Some of the ones where DNS lookup fail that I've randomly chosen don't even show up in there. However, the resolver config for the container also allows requests to my router, so that may be why some don't show up.

richvdh commented 4 years ago

Interesting that .well-known lookups succeed where federation requests fail. I suspect the problem is that specifically SRV lookups are timing out; the key difference between SRV lookups and most other DNS lookups is that the SRV lookups use a pure-python DNS library (twisted.names) rather than getaddrinfo.

Looking at the source of twisted.names, it looks like it will fall back to trying to send DNS queries to 127.0.0.1:53 if it fails to parse either /etc/resolv.conf or /etc/hosts. It might be worth checking that those two files are valid inside your docker container.

freaktechnik commented 4 years ago

There are .well-known lookups failing though?

The resolv.conf inside the container looks fine and as mentioned earlier I did run the twisted.names resolver snippet from the other issue inside tge container and it worked fine.

richvdh commented 4 years ago

oh I see, sorry.

The only thing I can really think of is that DNS lookups are happening slowly enough that a backlog is building up and they are then timing out. (UDP lookups might be failing, causing a fallback to TCP?) I wonder if a tcpdump on the traffic from the docker container would be informative.

MayeulC commented 4 years ago

I got around the slow DNS queries by increasing dnsmasq cache size to be greater than the number of failing DNS requests in my log (the latter was around 2000). I also increased the number of simultaneous connections to 300 from the 150 default, as I sometimes saw that issue popping out in dnsmasq logs.

However, the root cause might have been that my server was sending a lot of data on start-up, which was increasing latency, making it hard to catch-up. (#8338).

If this stems from the same root cause, you might have luck increasing this timeout, without necessarily having to change your DNS setup:

https://github.com/matrix-org/synapse/blob/837293c314b47e988fe9532115476a6536cd6406/synapse/http/matrixfederationclient.py#L261

(I doubled it to 120s).

freaktechnik commented 4 years ago

Increasing the DNS cache size on the DNS side seems to work around this, indeed.

richvdh commented 4 years ago

sounds like a problem on the dns server side then.