Closed MayeulC closed 2 years ago
I think we'll need more comprehensive logs from synapse to understand what's going on here.
From the brief bit of Synapse logs its looks like Dendrite isn't responding to the /send
request, so Synapse will retry and that is the correct behaviour. If Dendrite has finished processing an event (including if its rejected) then it needs to respond to the request with a 200 response (suitably quickly that the request isn't timed out).
More logs would be useful though
I am not actually sure this is the cause, mind you. But our server troubles started when I created that bridge.
I hope this might help: I just grepped my most recent homeserver.log
(16 GB) for the offending homeserver, and redacted it, since its owner did so in their issue. Note: that robotsignal
is my signalbot
's custom username on my server.
I hadn't realized this, but homeserver.log
's output is substantially different from the systemd unit's (python console). I hope this one gives more info.
Some interesting tidbits usually start with Marking origin 'dendrite.example.com' as up
, like at 2021-10-28 13:34:13,512
. You can clearly see exponential backoff kicking it until the server receives an event
errors.log.zst.not.gz (6MB -> 360kB, compressed with zstd: not actually gzip though I had to use another extension for github, you can use zstdcat
or zstdless
). I can send a .gz too, it's 866kB.
These earlier logs intersect with the posted logs from dendrite (unfortunately the logs are not marked as UTC). Bigger log file: oldlogs-redact.log.zst.not.gz
Thanks for including your logs. I notice a lot of timeouts when connecting to dendrite.example.org.
I got a little bit distracted by these log entries from March (maybe send a shorter log file next time? :)):
2021-03-30 08:38:12,496 - synapse.http.federation.well_known_resolver - 191 - INFO - federation_transaction_transmission_loop-11158951- Response from .well-known: {'m.server': 'matrix.dendrite.example.com:443'}
though it's followed up by a DNS lookup problem:
2021-03-30 08:40:26,722 - synapse.http.federation.matrix_federation_agent - 290 - INFO - federation_transaction_transmission_loop-11158951- Failed to connect to matrix.dendrite.example.com:443: DNS lookup failed: Couldn't find the hostname 'matrix.dendrite.example.com'.
More recently, I see the following:
2021-10-05 21:47:39,963 - synapse.http.federation.well_known_resolver - 253 - INFO - federation_transaction_transmission_loop-6725332- Fetching https://dendrite.example.com/.well-known/matrix/server
2021-10-05 21:47:46,416 - synapse.http.federation.well_known_resolver - 197 - INFO - federation_transaction_transmission_loop-6725332- Error parsing well-known for b'dendrite.example.com': Non-200 response 404
2021-10-05 21:48:20,801 - synapse.http.federation.matrix_federation_agent - 362 - INFO - federation_transaction_transmission_loop-6725332- Failed to connect to matrix.dendrite.example.com:8448: User timeout caused connection failure.
It looks as though the homeserver has changed from .well-known
delegation to SRV record delegation (and to port 8448): that strikes me as odd, since I wouldn't expect admins to be moving away from port 443 to port 8448 (since use of port 8448 has largely been a pain).
(Perhaps the SRV record always existed but .well-known
was working back in March and so the SRV record was being ignored?)
In any case, this seems not to be an issue caused by a rejected event, but by persistent errors in connection. To my understanding, event rejection will result in a 200 OK response that tells the sending homeserver which events were requested — the homeserver doesn't just drop off the network (this is as Erik mentioned above). :)
Since you've redacted the server name, I can't look at the situation with the DNS and well-known records myself (if you'd be happy sending the server name in a PM to https://matrix.to/#/@reivilibre.element:librepush.net then I am happy to at least have a look).
Thanks a lot for looking at the logs!
I'm really sorry to have taken some of your time with that issue, we solved it independently yesterday. It is exactly as you say:
A .well-known
record was previously working, then it suddenly stopped working, and a faulty SRV record was there the whole time.
I've also noticed these .well-known
entries in the logs, and asked the homeserver owner to fix the record. Connectivity was restored a bit after that was done ~30 hours ago (I see your message is 22 hours old, sorry for not commenting here right away). I'm also going to ask for the SRV
records to be removed.
I am still puzzled by the fact that it was federating with matrix.org despite the broken .well-known
and SRV
records.
I redacted the server name due to concerns from its owner, though if you are still interested I can ask them to share it privately.
I also removed the offending appservice a week ago, pending packaging improvements. I'll see if the situation improves when I restore that.
I am still puzzled by the fact that it was federating with matrix.org despite the broken
.well-known
andSRV
records.
It's possible that matrix.org had an open HTTP connection the entire time, and so didn't need to requery .well-known
and SRV
records? :shrug:
Glad it got sorted! :)
I am still puzzled by the fact that it was federating with matrix.org despite the broken
.well-known
andSRV
records.It's possible that matrix.org had an open HTTP connection the entire time, and so didn't need to requery
.well-known
andSRV
records? shrug
For multiple months? Given that the other server has about one user? Possible, but I consider it unlikely, especially as the other server had some downtime.
I would have to ask what was the previous configuration, but maybe matrix.org accepted that one (although unlikely, since I'm using synapse too).
Anyway, really sorry for the noise, but I'm glad this was solved too :)
Description
This issue is linked to https://github.com/matrix-org/dendrite/issues/1882
Dendrite rejects an event as invalid. It seems that it will also reject all subsequent events in that room.
That's fair, and a dendrite issue. However, synapse's per-server federation queue gets stuck in trying to send these events again and again, which causes federation between these servers to break.
Possible solution would be to rotate events in that federation queue, per-room, especially if rejected.
I am not sure what rejection mechanism is in place. The spec says that the transaction should not be responded with an error response, so maybe that cycling should be attempted by default?
Note that I could be completely misdirected on that one. Please tell me so if that's the case.
Steps to reproduce
Version information
Homeserver: mayeul.net
Version: 1.42.0
Install method: yunohost
Platform: yunohost (debian bare metal)
log excerpts from the Dendrite issue:
Dendrite:
Synapse (much more recent logs):