matrix-org / synapse

Synapse: Matrix homeserver written in Python/Twisted.
https://matrix-org.github.io/synapse
Apache License 2.0
11.79k stars 2.13k forks source link

Event rejected by dendrite breaks federation with that server #11160

Closed MayeulC closed 2 years ago

MayeulC commented 2 years ago

Description

This issue is linked to https://github.com/matrix-org/dendrite/issues/1882

Dendrite rejects an event as invalid. It seems that it will also reject all subsequent events in that room.

That's fair, and a dendrite issue. However, synapse's per-server federation queue gets stuck in trying to send these events again and again, which causes federation between these servers to break.

Possible solution would be to rotate events in that federation queue, per-room, especially if rejected.

I am not sure what rejection mechanism is in place. The spec says that the transaction should not be responded with an error response, so maybe that cycling should be attempted by default?

Note that I could be completely misdirected on that one. Please tell me so if that's the case.

Steps to reproduce

  1. On a synapse server, have someone gain PL > 100 as part of the room creation
  2. Invite a dendrite user in that room
  3. Observe how the federation between both servers melts down: encryption keys are not forwarded, communication is only possible trough a third party server in the same room.

Version information

log excerpts from the Dendrite issue:

Dendrite:

dentrite | time="2021-08-07T15:48:50.089392580Z" level=info msg="received data from client API server" func="onMessage\n\t" file=" [clientapi.go:89]" room_id="" type=im.vector.setting.breadcrumbs
dentrite | time="2021-08-07T15:48:52.933145593Z" level=info msg="Received transaction \"1628342442615\" from \"$BRIDGESERVER$.tld\" containing 0 PDUs, 1 EDUs" func="Send\n\t" file=" [send.go:206]" req.id=NBSUGY4uu24v req.method=PUT req.path=/_matrix/federation/v1/send/1628342442615
dentrite | time="2021-08-07T15:48:52.944073619Z" level=info msg="Checking event signatures for 14 events of room state" func="Check\n\t" file=" [federationtypes.go:489]" context=missing
dentrite | time="2021-08-07T15:48:52.974127673Z" level=error msg="Event \"$gdzUf8gGUxP3vXlGDTcxRL1UyLISPVW-D8dr5qSKbAc\" is not allowed by its auth events" func="Check\n\t" file=" [federationtypes.go:518]" error="gomatrixserverlib: event with ID \"$gdzUf8gGUxP3vXlGDTcxRL1UyLISPVW-D8dr5qSKbAc\" is not allowed by its auth_events: eventauth: sender with level 100 is not allowed change user level from 100 to 9001 because the new level is above the level of the sender"
dentrite | time="2021-08-07T15:48:52.974234831Z" level=error msg="Event \"$gdzUf8gGUxP3vXlGDTcxRL1UyLISPVW-D8dr5qSKbAc\" is not allowed by its auth events" func="Check\n\t" file=" [federationtypes.go:518]" error="gomatrixserverlib: event with ID \"$gdzUf8gGUxP3vXlGDTcxRL1UyLISPVW-D8dr5qSKbAc\" is not allowed by its auth_events: eventauth: sender with level 100 is not allowed change user level from 100 to 9001 because the new level is above the level of the sender"
dentrite | time="2021-08-07T15:48:52.974483708Z" level=warning msg="Discarding 1 auth/state event(s) due to invalid signatures" func="Check\n\t" file=" [federationtypes.go:526]" context=missing
dentrite | time="2021-08-07T15:48:53.183057018Z" level=error msg="helpers.CheckAuthEvents failed for event, rejecting event" func="processRoomEvent\n\t" file=" [input_events.go:106]" auth_event_ids="[$BWTXyDe2mX149rgVnZ2jaIQX3T27uNgJfp-_daqsFYw $pixruZKarWEuUOcWtQ-CVJEk_-qGdXLoxBxHU_c9cqk $gdzUf8gGUxP3vXlGDTcxRL1UyLISPVW-D8dr5qSKbAc]" error="db.StateEntriesForEventIDs: storage: state event IDs missing from the database (2 != 3)" event_id="$ntLpjV-wGLTitM0oKA40FKm1OJgee_TaT13XRW3TolA"
dentrite | time="2021-08-07T15:48:53.230620706Z" level=error msg="helpers.CheckAuthEvents failed for event, rejecting event" func="processRoomEvent\n\t" file=" [input_events.go:106]" auth_event_ids="[$BWTXyDe2mX149rgVnZ2jaIQX3T27uNgJfp-_daqsFYw $pixruZKarWEuUOcWtQ-CVJEk_-qGdXLoxBxHU_c9cqk $gdzUf8gGUxP3vXlGDTcxRL1UyLISPVW-D8dr5qSKbAc]" error="db.StateEntriesForEventIDs: storage: state event IDs missing from the database (2 != 3)" event_id="$grlSz6tZ0IjRv7qpo4AuSwdNSNzEKLPJqa78IUjbofc"

Synapse (much more recent logs):

synapse.federation.sender.per_destination_queue - 354 - WARNING - federation_transaction_transmission_loop-21109482- TX [dendrite.example.com] Failed to send transaction: Failed to send request: TimeoutError: User timeout caused connection failure.
synapse.federation.federation_server - 249 - WARNING - PUT-6166836- Received another txn 1634816968887-13 from dendrite.example.com while still processing 1634816968846-13
synapse.federation.federation_server - 249 - WARNING - PUT-6168829- Received another txn 1634817506390-27 from dendrite.example.com while still processing 1634817502881-27
synapse.federation.federation_server - 249 - WARNING - PUT-6168830- Received another txn 1634817506867-27 from dendrite.example.com while still processing 1634817502881-27
richvdh commented 2 years ago

I think we'll need more comprehensive logs from synapse to understand what's going on here.

erikjohnston commented 2 years ago

From the brief bit of Synapse logs its looks like Dendrite isn't responding to the /send request, so Synapse will retry and that is the correct behaviour. If Dendrite has finished processing an event (including if its rejected) then it needs to respond to the request with a 200 response (suitably quickly that the request isn't timed out).

More logs would be useful though

MayeulC commented 2 years ago

I am not actually sure this is the cause, mind you. But our server troubles started when I created that bridge.

I hope this might help: I just grepped my most recent homeserver.log (16 GB) for the offending homeserver, and redacted it, since its owner did so in their issue. Note: that robotsignal is my signalbot's custom username on my server.

I hadn't realized this, but homeserver.log's output is substantially different from the systemd unit's (python console). I hope this one gives more info.

Some interesting tidbits usually start with Marking origin 'dendrite.example.com' as up, like at 2021-10-28 13:34:13,512. You can clearly see exponential backoff kicking it until the server receives an event

errors.log.zst.not.gz (6MB -> 360kB, compressed with zstd: not actually gzip though I had to use another extension for github, you can use zstdcat or zstdless). I can send a .gz too, it's 866kB.

Edit: it looks like these "User timeouts" errors started appearing later, on 2021-08-16: ``` 021-08-15 23:04:53,268 - synapse.federation.sender.transaction_manager - 158 - INFO - federation_transaction_transmission_loop-2317951- TX [dendrite.example.com] {1628358076153} got 200 response 2021-08-16 07:26:40,982 - synapse.federation.transport.server - 450 - INFO - PUT-2311411- Received txn 1629091600572-0 from dendrite.example.com. (PDUs: 0, EDUs: 1) 2021-08-16 07:26:41,012 - synapse.access.https.8448 - 410 - INFO - PUT-2311411- ::ffff:XX.XX.XX.XX - 8448 - {dendrite.example.com} Processed request: 0.042sec/-0.000sec (0.006sec, 0.002sec) (0.005sec/0.019sec/8) 11B 200 "PUT /_matrix/federation/v1/send/1629091600572-0 HTTP/1.1" "Dendrite/0.4.1" [0 dbevts] 2021-08-16 07:27:08,141 - synapse.federation.transport.server - 450 - INFO - PUT-2311464- Received txn 1629091628031-1 from dendrite.example.com. (PDUs: 0, EDUs: 1) 2021-08-16 07:27:08,166 - synapse.access.https.8448 - 410 - INFO - PUT-2311464- ::ffff:XX.XX.XX.XX - 8448 - {dendrite.example.com} Processed request: 0.029sec/-0.000sec (0.006sec, 0.000sec) (0.003sec/0.015sec/7) 11B 200 "PUT /_matrix/federation/v1/send/1629091628031-1 HTTP/1.1" "Dendrite/0.4.1" [0 dbevts] 2021-08-16 07:27:31,289 - synapse.federation.transport.server - 450 - INFO - PUT-2311508- Received txn 1629091651186-2 from dendrite.example.com. (PDUs: 0, EDUs: 1) 2021-08-16 07:27:31,293 - synapse.handlers.typing - 344 - INFO - PUT-2311508- Got typing update from @someuser:dendrite.example.com: {'room_id': '!GcrWFZZBxiRwfXJmFp:mayeul.net', 'typing': True, 'user_id': '@someuser:dendrite.example.com'} 2021-08-16 07:27:31,307 - synapse.access.https.8448 - 410 - INFO - PUT-2311508- ::ffff:XX.XX.XX.XX - 8448 - {dendrite.example.com} Processed request: 0.021sec/-0.000sec (0.003sec, 0.000sec) (0.002sec/0.006sec/2) 11B 200 "PUT /_matrix/federation/v1/send/1629091651186-2 HTTP/1.1" "Dendrite/0.4.1" [0 dbevts] 2021-08-16 07:27:41,230 - synapse.federation.transport.server - 450 - INFO - PUT-2311532- Received txn 1629091661127-3 from dendrite.example.com. (PDUs: 0, EDUs: 1) 2021-08-16 07:27:41,235 - synapse.handlers.typing - 344 - INFO - PUT-2311532- Got typing update from @someuser:dendrite.example.com: {'room_id': '!GcrWFZZBxiRwfXJmFp:mayeul.net', 'typing': False, 'user_id': '@someuser:dendrite.example.com'} 2021-08-16 07:27:41,241 - synapse.access.https.8448 - 410 - INFO - PUT-2311532- ::ffff:XX.XX.XX.XX - 8448 - {dendrite.example.com} Processed request: 0.014sec/-0.000sec (0.003sec, 0.000sec) (0.002sec/0.004sec/2) 11B 200 "PUT /_matrix/federation/v1/send/1629091661127-3 HTTP/1.1" "Dendrite/0.4.1" [0 dbevts] 2021-08-16 07:27:41,410 - synapse.federation.transport.server - 450 - INFO - PUT-2311533- Received txn 1629091661305-4 from dendrite.example.com. (PDUs: 0, EDUs: 1) 2021-08-16 07:27:41,415 - synapse.handlers.typing - 344 - INFO - PUT-2311533- Got typing update from @someuser:dendrite.example.com: {'room_id': '!GcrWFZZBxiRwfXJmFp:mayeul.net', 'typing': True, 'user_id': '@someuser:dendrite.example.com'} 2021-08-16 07:27:41,422 - synapse.access.https.8448 - 410 - INFO - PUT-2311533- ::ffff:XX.XX.XX.XX - 8448 - {dendrite.example.com} Processed request: 0.015sec/-0.000sec (0.005sec, 0.001sec) (0.003sec/0.003sec/2) 11B 200 "PUT /_matrix/federation/v1/send/1629091661305-4 HTTP/1.1" "Dendrite/0.4.1" [0 dbevts] 2021-08-16 07:27:51,318 - synapse.federation.transport.server - 450 - INFO - PUT-2311553- Received txn 1629091671204-5 from dendrite.example.com. (PDUs: 0, EDUs: 1) 2021-08-16 07:27:51,323 - synapse.handlers.typing - 344 - INFO - PUT-2311553- Got typing update from @someuser:dendrite.example.com: {'room_id': '!GcrWFZZBxiRwfXJmFp:mayeul.net', 'typing': False, 'user_id': '@someuser:dendrite.example.com'} 2021-08-16 07:27:51,329 - synapse.access.https.8448 - 410 - INFO - PUT-2311553- ::ffff:XX.XX.XX.XX - 8448 - {dendrite.example.com} Processed request: 0.014sec/-0.000sec (0.003sec, 0.000sec) (0.003sec/0.003sec/2) 11B 200 "PUT /_matrix/federation/v1/send/1629091671204-5 HTTP/1.1" "Dendrite/0.4.1" [0 dbevts] 2021-08-16 07:27:51,576 - synapse.federation.transport.server - 450 - INFO - PUT-2311554- Received txn 1629091671468-6 from dendrite.example.com. (PDUs: 0, EDUs: 1) 2021-08-16 07:27:51,580 - synapse.handlers.typing - 344 - INFO - PUT-2311554- Got typing update from @someuser:dendrite.example.com: {'room_id': '!GcrWFZZBxiRwfXJmFp:mayeul.net', 'typing': True, 'user_id': '@someuser:dendrite.example.com'} 2021-08-16 07:27:51,587 - synapse.access.https.8448 - 410 - INFO - PUT-2311554- ::ffff:XX.XX.XX.XX - 8448 - {dendrite.example.com} Processed request: 0.017sec/-0.000sec (0.004sec, 0.000sec) (0.002sec/0.004sec/2) 11B 200 "PUT /_matrix/federation/v1/send/1629091671468-6 HTTP/1.1" "Dendrite/0.4.1" [0 dbevts] 2021-08-16 07:28:01,437 - synapse.federation.transport.server - 450 - INFO - PUT-2311578- Received txn 1629091681332-7 from dendrite.example.com. (PDUs: 0, EDUs: 1) 2021-08-16 07:28:01,448 - synapse.handlers.typing - 344 - INFO - PUT-2311578- Got typing update from @someuser:dendrite.example.com: {'room_id': '!GcrWFZZBxiRwfXJmFp:mayeul.net', 'typing': False, 'user_id': '@someuser:dendrite.example.com'} 2021-08-16 07:28:01,479 - synapse.access.https.8448 - 410 - INFO - PUT-2311578- ::ffff:XX.XX.XX.XX - 8448 - {dendrite.example.com} Processed request: 0.047sec/-0.000sec (0.006sec, 0.000sec) (0.026sec/0.004sec/2) 11B 200 "PUT /_matrix/federation/v1/send/1629091681332-7 HTTP/1.1" "Dendrite/0.4.1" [0 dbevts] 2021-08-16 07:28:02,353 - synapse.federation.transport.server - 450 - INFO - PUT-2311581- Received txn 1629091682253-8 from dendrite.example.com. (PDUs: 0, EDUs: 1) 2021-08-16 07:28:02,357 - synapse.handlers.typing - 344 - INFO - PUT-2311581- Got typing update from @someuser:dendrite.example.com: {'room_id': '!GcrWFZZBxiRwfXJmFp:mayeul.net', 'typing': True, 'user_id': '@someuser:dendrite.example.com'} 2021-08-16 07:28:02,363 - synapse.access.https.8448 - 410 - INFO - PUT-2311581- ::ffff:XX.XX.XX.XX - 8448 - {dendrite.example.com} Processed request: 0.013sec/-0.000sec (0.002sec, 0.001sec) (0.001sec/0.004sec/2) 11B 200 "PUT /_matrix/federation/v1/send/1629091682253-8 HTTP/1.1" "Dendrite/0.4.1" [0 dbevts] 2021-08-16 07:28:08,520 - synapse.federation.transport.server - 450 - INFO - PUT-2311599- Received txn 1629091688411-9 from dendrite.example.com. (PDUs: 0, EDUs: 1) 2021-08-16 07:28:08,528 - synapse.handlers.typing - 344 - INFO - PUT-2311599- Got typing update from @someuser:dendrite.example.com: {'room_id': '!GcrWFZZBxiRwfXJmFp:mayeul.net', 'typing': False, 'user_id': '@someuser:dendrite.example.com'} 2021-08-16 07:28:08,536 - synapse.access.https.8448 - 410 - INFO - PUT-2311599- ::ffff:XX.XX.XX.XX - 8448 - {dendrite.example.com} Processed request: 0.018sec/-0.000sec (0.004sec, 0.000sec) (0.004sec/0.005sec/3) 11B 200 "PUT /_matrix/federation/v1/send/1629091688411-9 HTTP/1.1" "Dendrite/0.4.1" [0 dbevts] 2021-08-16 07:28:08,923 - synapse.federation.transport.server - 450 - INFO - PUT-2311600- Received txn 1629091688823-10 from dendrite.example.com. (PDUs: 1, EDUs: 0) 2021-08-16 07:28:08,943 - synapse.access.https.8448 - 410 - INFO - PUT-2311600- ::ffff:XX.XX.XX.XX - 8448 - {dendrite.example.com} Processed request: 0.023sec/-0.000sec (0.006sec, 0.000sec) (0.003sec/0.010sec/5) 60B 200 "PUT /_matrix/federation/v1/send/1629091688823-10 HTTP/1.1" "Dendrite/0.4.1" [0 dbevts] 2021-08-16 08:08:53,935 - synapse.federation.sender.transaction_manager - 124 - INFO - federation_transaction_transmission_loop-2323582- TX [dendrite.example.com] {1628358078992} Sending transaction [1628358078992], (PDUs: 0, EDUs: 1) 2021-08-16 08:08:53,936 - synapse.http.federation.well_known_resolver - 254 - INFO - federation_transaction_transmission_loop-2323582- Fetching https://dendrite.example.com/.well-known/matrix/server 2021-08-16 08:08:54,059 - synapse.http.federation.well_known_resolver - 198 - INFO - federation_transaction_transmission_loop-2323582- Error parsing well-known for b'dendrite.example.com': Non-200 response 404 2021-08-16 08:09:24,103 - synapse.http.federation.matrix_federation_agent - 288 - INFO - federation_transaction_transmission_loop-2323582- Failed to connect to matrix.dendrite.example.com:8448: User timeout caused connection failure. 2021-08-16 08:09:24,104 - synapse.http.matrixfederationclient - 637 - INFO - federation_transaction_transmission_loop-2323582- {PUT-O-3728753} [dendrite.example.com] Request failed: PUT matrix://dendrite.example.com/_matrix/federation/v1/send/1628358078992: TimeoutError('') 2021-08-16 08:09:58,562 - synapse.http.federation.matrix_federation_agent - 288 - INFO - federation_transaction_transmission_loop-2323582- Failed to connect to matrix.dendrite.example.com:8448: User timeout caused connection failure. 2021-08-16 08:09:58,563 - synapse.http.matrixfederationclient - 637 - INFO - federation_transaction_transmission_loop-2323582- {PUT-O-3728753} [dendrite.example.com] Request failed: PUT matrix://dendrite.example.com/_matrix/federation/v1/send/1628358078992: TimeoutError('') ```

These earlier logs intersect with the posted logs from dendrite (unfortunately the logs are not marked as UTC). Bigger log file: oldlogs-redact.log.zst.not.gz

reivilibre commented 2 years ago

Thanks for including your logs. I notice a lot of timeouts when connecting to dendrite.example.org.

I got a little bit distracted by these log entries from March (maybe send a shorter log file next time? :)):

2021-03-30 08:38:12,496 - synapse.http.federation.well_known_resolver - 191 - INFO - federation_transaction_transmission_loop-11158951- Response from .well-known: {'m.server': 'matrix.dendrite.example.com:443'}

though it's followed up by a DNS lookup problem:

2021-03-30 08:40:26,722 - synapse.http.federation.matrix_federation_agent - 290 - INFO - federation_transaction_transmission_loop-11158951- Failed to connect to matrix.dendrite.example.com:443: DNS lookup failed: Couldn't find the hostname 'matrix.dendrite.example.com'.

More recently, I see the following:

2021-10-05 21:47:39,963 - synapse.http.federation.well_known_resolver - 253 - INFO - federation_transaction_transmission_loop-6725332- Fetching https://dendrite.example.com/.well-known/matrix/server
2021-10-05 21:47:46,416 - synapse.http.federation.well_known_resolver - 197 - INFO - federation_transaction_transmission_loop-6725332- Error parsing well-known for b'dendrite.example.com': Non-200 response 404
2021-10-05 21:48:20,801 - synapse.http.federation.matrix_federation_agent - 362 - INFO - federation_transaction_transmission_loop-6725332- Failed to connect to matrix.dendrite.example.com:8448: User timeout caused connection failure.

It looks as though the homeserver has changed from .well-known delegation to SRV record delegation (and to port 8448): that strikes me as odd, since I wouldn't expect admins to be moving away from port 443 to port 8448 (since use of port 8448 has largely been a pain). (Perhaps the SRV record always existed but .well-known was working back in March and so the SRV record was being ignored?)

In any case, this seems not to be an issue caused by a rejected event, but by persistent errors in connection. To my understanding, event rejection will result in a 200 OK response that tells the sending homeserver which events were requested — the homeserver doesn't just drop off the network (this is as Erik mentioned above). :)

Since you've redacted the server name, I can't look at the situation with the DNS and well-known records myself (if you'd be happy sending the server name in a PM to https://matrix.to/#/@reivilibre.element:librepush.net then I am happy to at least have a look).

MayeulC commented 2 years ago

Thanks a lot for looking at the logs!

I'm really sorry to have taken some of your time with that issue, we solved it independently yesterday. It is exactly as you say:

A .well-known record was previously working, then it suddenly stopped working, and a faulty SRV record was there the whole time.

I've also noticed these .well-known entries in the logs, and asked the homeserver owner to fix the record. Connectivity was restored a bit after that was done ~30 hours ago (I see your message is 22 hours old, sorry for not commenting here right away). I'm also going to ask for the SRV records to be removed.

I am still puzzled by the fact that it was federating with matrix.org despite the broken .well-known and SRV records.

I redacted the server name due to concerns from its owner, though if you are still interested I can ask them to share it privately.

I also removed the offending appservice a week ago, pending packaging improvements. I'll see if the situation improves when I restore that.

erikjohnston commented 2 years ago

I am still puzzled by the fact that it was federating with matrix.org despite the broken .well-known and SRV records.

It's possible that matrix.org had an open HTTP connection the entire time, and so didn't need to requery .well-known and SRV records? :shrug:

reivilibre commented 2 years ago

Glad it got sorted! :)

MayeulC commented 2 years ago

I am still puzzled by the fact that it was federating with matrix.org despite the broken .well-known and SRV records.

It's possible that matrix.org had an open HTTP connection the entire time, and so didn't need to requery .well-known and SRV records? shrug

For multiple months? Given that the other server has about one user? Possible, but I consider it unlikely, especially as the other server had some downtime.

I would have to ask what was the previous configuration, but maybe matrix.org accepted that one (although unlikely, since I'm using synapse too).

Anyway, really sorry for the noise, but I'm glad this was solved too :)