matrix-org / synapse

Synapse: Matrix homeserver written in Python/Twisted.
https://matrix-org.github.io/synapse
Apache License 2.0
11.79k stars 2.13k forks source link

Provide ways to analyze incoming federation delay issues and other problems with incoming message delays #10562

Open MurzNN opened 3 years ago

MurzNN commented 3 years ago

Our public homeserver ru-matrix.org with not so large userbase (less than 20 active users) have a big problem with incoming federation: messages from popular homeservers (eg matrix.org, tchncs.de) are delayed by several hours! And same thing with encryption keys. But from non-popular homeservers messages are delivered in several seconds.

And I, as admin of this server, can't even understand where can be the problem!

Server have enough free space, CPU is not so busy (8 CPU with average load less than 1), memory is not full too (16 gigabytes). I see no errors in Synapse and PostgreSQL logs too. I have all prometheus metrics for Synapse, Linux, Postgres and see no problems in them too...

I even tried to start 6 separate workers for federation in 10 hours span - but this isn't help, delays stays at more than hour spans.

So, please, give here for me (and for other Synapse users that have similar problems) some information to analyze, where can be the problem, how to investigate it, and what can I try more to solve it?

Please don't file github issues asking for support.

I already asked about this in #synapse:matrix.org, but not receive any useful advice, so filling long-standing issue to hope receive something useful :)

anoadragon453 commented 3 years ago

Looks like matrix.org is repeatedly receiving a 404 on PUT https://ru-matrix.org:8448/_matrix/federation/v1/send/1234, and then backing off as a result:

2021-08-11 13:46:23,171 - synapse.http.matrixfederationclient - 603 - INFO - federation_transaction_transmission_loop-2681741-- - {PUT-O-2058241} [ru-matrix.org] Got response headers: 404 Not Found
2021-08-11 13:46:23,171 - synapse.http.matrixfederationclient - 682 - WARNING - federation_transaction_transmission_loop-2681741-- - {PUT-O-2058241} [ru-matrix.org] Request failed: PUT matrix://ru-matrix.org/_matrix/federation/v1/send/1628609648565: HttpResponseException('404: Not Found')
2021-08-11 13:46:23,172 - synapse.util.retryutils - 212 - INFO - federation_transaction_transmission_loop-2681741- - Connection to ru-matrix.org was unsuccessful (<class 'synapse.api.errors.HttpResponseException'>(404: Not Found)); backoff now 600000
2021-08-11 13:46:23,172 - synapse.federation.sender.transaction_manager - 155 - INFO - federation_transaction_transmission_loop-2681741- - TX [ru-matrix.org] {1628609648565} got 404 response
2021-08-11 13:46:23,172 - synapse.federation.sender.per_destination_queue - 349 - WARNING - federation_transaction_transmission_loop-2681741 - TX [ru-matrix.org] Received 404 response to transaction: 404: Not Found

The same can be found throughout the day on August 9th. It's likely the cause of failing transactions inbound transactions. Would you happen to know why this might be?

MurzNN commented 3 years ago

@anoadragon453 Thank you for information! I've found that time span in my logs, here is log lines in worker, matched to 1628609648565 id:

2021-08-11 16:46:23,111 - synapse.federation.transport.server - 445 - INFO - PUT-1593680- Received txn 1628609648565 from matrix.org. (PDUs: 1, EDUs: 0)
2021-08-11 16:46:23,126 - synapse.http.server - 88 - INFO - PUT-1593680- <SynapseRequest at 0x7f2bae19d430 method='PUT' uri='/_matrix/federation/v1/send/1628609648565' clientproto='HTTP/1.0' site=12011> SynapseError: 404 - Could not find event $Dv3wmat7XbtBm4O37eT+Oa2kxHKUfCj36nCMiHyj5F4
2021-08-11 16:46:23,128 - synapse.access.http.12011 - 389 - INFO - PUT-1593680- ::ffff:127.0.0.1 - 12011 - {matrix.org} Processed request: 0.020sec/-0.000sec (0.007sec, 0.000sec) (0.001sec/0.002sec/1) 101B 404 "PUT /_matrix/federation/v1/send/1628609648565 HTTP/1.0" "Synapse/1.40.0rc2 (b=matrix-org-hotfixes,53d7d9251)" [1 dbevts]
2021-08-11 16:53:08,440 - synapse.http.server - 88 - INFO - GET-1596394- <SynapseRequest at 0x7f2bf2aa10d0 method='GET' uri='/_matrix/federation/v1/send/1628609648565' clientproto='HTTP/1.0' site=12011> SynapseError: 400 - Unrecognized request
2021-08-11 16:53:08,441 - synapse.access.http.12011 - 389 - INFO - GET-1596394- ::ffff:127.0.0.1 - 12011 - {None} Processed request: 0.001sec/-0.000sec (0.000sec, 0.000sec) (0.000sec/0.000sec/0) 59B 400 "GET /_matrix/federation/v1/send/1628609648565 HTTP/1.0" "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/91.0.4472.124 Safari/537.36" [0 dbevts]
2021-08-11 16:53:09,662 - synapse.http.server - 88 - INFO - GET-1596399- <SynapseRequest at 0x7f2c4436bb50 method='GET' uri='/_matrix/federation/v1/send/1628609648565' clientproto='HTTP/1.0' site=12011> SynapseError: 400 - Unrecognized request
2021-08-11 16:53:09,663 - synapse.access.http.12011 - 389 - INFO - GET-1596399- ::ffff:127.0.0.1 - 12011 - {None} Processed request: 0.002sec/-0.000sec (0.002sec, 0.002sec) (0.000sec/0.000sec/0) 59B 400 "GET /_matrix/federation/v1/send/1628609648565 HTTP/1.0" "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/91.0.4472.124 Safari/537.36" [0 dbevts]

And there is no one line matched to transaction id (1628609648565) or operation id (1593680) in Synapse log of main process (not worker)...

Seems the main problem is the Could not find event $Dv3wmat7XbtBm4O37eT+Oa2kxHKUfCj36nCMiHyj5F4 error?

As I understand, this means that my homeserver misses some chunk of data from room, included in that federation sync, right? So, what the solutions for that problem can be?

MurzNN commented 3 years ago

And here is example of last log lines, matched to "Could not find event" text (excuiding any id):

2021-08-12 12:00:57,947 - synapse.http.server - 88 - INFO - PUT-1922933- <SynapseRequest at 0x7f2bf47735b0 method='PUT' uri='/_matrix/federation/v1/send/1628611386500' clientproto='HTTP/1.0' site=12011> SynapseError: 404 - Could not find event $Dv3wmat7XbtBm4O37eT+Oa2kxHKUfCj36nCMiHyj5F4
2021-08-12 12:00:58,348 - synapse.http.server - 88 - INFO - PUT-1922936- <SynapseRequest at 0x7f2c74c21520 method='PUT' uri='/_matrix/federation/v1/send/1628611386505' clientproto='HTTP/1.0' site=12011> SynapseError: 404 - Could not find event $Dv3wmat7XbtBm4O37eT+Oa2kxHKUfCj36nCMiHyj5F4
2021-08-12 12:00:58,556 - synapse.http.server - 88 - INFO - PUT-1922937- <SynapseRequest at 0x7f2caf357880 method='PUT' uri='/_matrix/federation/v1/send/1628611386507' clientproto='HTTP/1.0' site=12011> SynapseError: 404 - Could not find event $Dv3wmat7XbtBm4O37eT+Oa2kxHKUfCj36nCMiHyj5F4
2021-08-12 12:00:58,793 - synapse.http.server - 88 - INFO - PUT-1922939- <SynapseRequest at 0x7f2cdb6f8100 method='PUT' uri='/_matrix/federation/v1/send/1628611386509' clientproto='HTTP/1.0' site=12011> SynapseError: 404 - Could not find event $Dv3wmat7XbtBm4O37eT+Oa2kxHKUfCj36nCMiHyj5F4
2021-08-12 12:01:19,976 - synapse.http.server - 88 - INFO - PUT-1922992- <SynapseRequest at 0x7f2cdb6f8490 method='PUT' uri='/_matrix/federation/v1/send/1628611387179' clientproto='HTTP/1.0' site=12011> SynapseError: 404 - Could not find event $Dv3wmat7XbtBm4O37eT+Oa2kxHKUfCj36nCMiHyj5F4
2021-08-12 12:01:20,003 - synapse.http.server - 88 - INFO - PUT-1922979- <SynapseRequest at 0x7f2bdc4407f0 method='PUT' uri='/_matrix/federation/v1/send/1626942950669' clientproto='HTTP/1.0' site=12011> SynapseError: 404 - Could not find event $Dv3wmat7XbtBm4O37eT+Oa2kxHKUfCj36nCMiHyj5F4
2021-08-12 12:01:20,339 - synapse.http.server - 88 - INFO - PUT-1923013- <SynapseRequest at 0x7f2cac0fc8e0 method='PUT' uri='/_matrix/federation/v1/send/1626942950670' clientproto='HTTP/1.0' site=12011> SynapseError: 404 - Could not find event $Dv3wmat7XbtBm4O37eT+Oa2kxHKUfCj36nCMiHyj5F4
2021-08-12 12:01:24,339 - synapse.http.server - 88 - INFO - PUT-1923036- <SynapseRequest at 0x7f2c456d6100 method='PUT' uri='/_matrix/federation/v1/send/1626942950671' clientproto='HTTP/1.0' site=12011> SynapseError: 404 - Could not find event $Dv3wmat7XbtBm4O37eT+Oa2kxHKUfCj36nCMiHyj5F4
2021-08-12 12:01:41,420 - synapse.http.server - 88 - INFO - PUT-1923115- <SynapseRequest at 0x7f2ca50ef820 method='PUT' uri='/_matrix/federation/v1/send/1626942950709' clientproto='HTTP/1.0' site=12011> SynapseError: 404 - Could not find event $Dv3wmat7XbtBm4O37eT+Oa2kxHKUfCj36nCMiHyj5F4
2021-08-12 12:03:05,176 - synapse.http.server - 88 - INFO - PUT-1923526- <SynapseRequest at 0x7f2cfc24c280 method='PUT' uri='/_matrix/federation/v1/send/1626942950740' clientproto='HTTP/1.0' site=12011> SynapseError: 404 - Could not find event $Dv3wmat7XbtBm4O37eT+Oa2kxHKUfCj36nCMiHyj5F4
2021-08-12 12:04:06,764 - synapse.http.server - 88 - INFO - PUT-1923821- <SynapseRequest at 0x7f2cd043acd0 method='PUT' uri='/_matrix/federation/v1/send/1626942950742' clientproto='HTTP/1.0' site=12011> SynapseError: 404 - Could not find event $Dv3wmat7XbtBm4O37eT+Oa2kxHKUfCj36nCMiHyj5F4
2021-08-12 12:04:25,058 - synapse.http.server - 88 - INFO - PUT-1923928- <SynapseRequest at 0x7f2ca938c040 method='PUT' uri='/_matrix/federation/v1/send/1628602877901' clientproto='HTTP/1.0' site=12011> SynapseError: 404 - Could not find event $Dv3wmat7XbtBm4O37eT+Oa2kxHKUfCj36nCMiHyj5F4
2021-08-12 12:05:09,543 - synapse.http.server - 88 - INFO - PUT-1924122- <SynapseRequest at 0x7f2c78d13ee0 method='PUT' uri='/_matrix/federation/v1/send/1626942950764' clientproto='HTTP/1.0' site=12011> SynapseError: 404 - Could not find event $Dv3wmat7XbtBm4O37eT+Oa2kxHKUfCj36nCMiHyj5F4
2021-08-12 12:05:22,727 - synapse.http.server - 88 - INFO - PUT-1924194- <SynapseRequest at 0x7f2c09a62dc0 method='PUT' uri='/_matrix/federation/v1/send/1628611392564' clientproto='HTTP/1.0' site=12011> SynapseError: 404 - Could not find event $Dv3wmat7XbtBm4O37eT+Oa2kxHKUfCj36nCMiHyj5F4
2021-08-12 12:05:27,203 - synapse.http.server - 88 - INFO - PUT-1924219- <SynapseRequest at 0x7f2c59ac3400 method='PUT' uri='/_matrix/federation/v1/send/1628450166484' clientproto='HTTP/1.0' site=12011> SynapseError: 404 - Could not find event $Dv3wmat7XbtBm4O37eT+Oa2kxHKUfCj36nCMiHyj5F4

Interesting thing, that they all (and all other from last 12 hours) are pointing to same event id $Dv3wmat7XbtBm4O37eT+Oa2kxHKUfCj36nCMiHyj5F4, I don't see mentions of any other ids in logs...

And in longer time span of 7 days, I see only 14 lines with that error, matching to other event id, here is all of them:

2021-08-11 23:21:25,694 - synapse.http.server - 88 - INFO - GET-1736041- <SynapseRequest at 0x7f2caa626910 method='GET' uri='/_matrix/federation/v1/event_auth/%21OGEhHVWSdvArJzumhm%3Amatrix.org/%24VpPZCkdVv0TgfLCLuELsP12Koe-mK6eA5BEr-1696k0' clientproto='HTTP/1.0' site=12011> SynapseError: 404 - Could not find event $VpPZCkdVv0TgfLCLuELsP12Koe-mK6eA5BEr-1696k0
2021-08-10 20:28:23,337 - synapse.http.server - 88 - INFO - GET-1248236- <SynapseRequest at 0x7f2c2f62d550 method='GET' uri='/_matrix/federation/v1/state_ids/%21rCWNvpCTZHQkiRYUDE%3Amatrix.org/?event_id=%248LSVfUqk4balwKDerq6Q4UjsKvigJpChPIBE5amo3fw&auth_chain_ids=0' clientproto='HTTP/1.0' site=12011> SynapseError: 404 - Could not find event $8LSVfUqk4balwKDerq6Q4UjsKvigJpChPIBE5amo3fw
2021-08-10 20:27:34,782 - synapse.http.server - 88 - INFO - GET-1247951- <SynapseRequest at 0x7f2cae643160 method='GET' uri='/_matrix/federation/v1/state_ids/%21rCWNvpCTZHQkiRYUDE%3Amatrix.org/?event_id=%2414w-9iX0zAPKZt1Pu31zdb5joK6KGcySxiSNaBX0-CY&auth_chain_ids=0' clientproto='HTTP/1.0' site=12011> SynapseError: 404 - Could not find event $14w-9iX0zAPKZt1Pu31zdb5joK6KGcySxiSNaBX0-CYShow context
2021-08-09 22:32:12,675 - synapse.http.server - 88 - INFO - GET-869628- <SynapseRequest at 0x7f2cefabb7c0 method='GET' uri='/_matrix/federation/v1/event_auth/%21OGEhHVWSdvArJzumhm%3Amatrix.org/%24IFLxPa2myf6iKJqltCrXvDCYfpY1FiNoj6ZSKkaFfIY' clientproto='HTTP/1.0' site=12011> SynapseError: 404 - Could not find event $IFLxPa2myf6iKJqltCrXvDCYfpY1FiNoj6ZSKkaFfIY
2021-08-09 21:55:41,479 - synapse.http.server - 88 - INFO - GET-855818- <SynapseRequest at 0x7f2bff23b370 method='GET' uri='/_matrix/federation/v1/event_auth/%21OGEhHVWSdvArJzumhm%3Amatrix.org/%245XstO3dpyT-GhrCV7TM9jn-AFr04bskPdb35Rwn2ot8' clientproto='HTTP/1.0' site=12011> SynapseError: 404 - Could not find event $5XstO3dpyT-GhrCV7TM9jn-AFr04bskPdb35Rwn2ot8
2021-08-09 17:20:48,071 - synapse.http.server - 88 - INFO - GET-763628- <SynapseRequest at 0x7f2c737e7a60 method='GET' uri='/_matrix/federation/v1/state_ids/%21rCWNvpCTZHQkiRYUDE%3Amatrix.org/?event_id=%24gonqWqRG89u2UTD5wcA13s-Hgq-S4XvKlOmiInzwHQU&auth_chain_ids=0' clientproto='HTTP/1.0' site=12011> SynapseError: 404 - Could not find event $gonqWqRG89u2UTD5wcA13s-Hgq-S4XvKlOmiInzwHQU
2021-08-09 10:18:00,718 - synapse.http.server - 88 - INFO - GET-224632 - <XForwardedForRequest at 0x7fdfa303c940 method='GET' uri='/_matrix/client/r0/rooms/%21OGEhHVWSdvArJzumhm%3Amatrix.org/event/%24hH1sSQtLCpgr13_oGBHWpPiA3YtZd9x6c2PR5NjrA9E' clientproto='HTTP/1.0' site='8008'> SynapseError: 404 - Could not find event $hH1sSQtLCpgr13_oGBHWpPiA3YtZd9x6c2PR5NjrA9E
2021-08-09 10:18:00,607 - synapse.http.server - 88 - INFO - GET-224628 - <XForwardedForRequest at 0x7fdfa28b46a0 method='GET' uri='/_matrix/client/r0/rooms/%21OGEhHVWSdvArJzumhm%3Amatrix.org/event/%24hH1sSQtLCpgr13_oGBHWpPiA3YtZd9x6c2PR5NjrA9E' clientproto='HTTP/1.0' site='8008'> SynapseError: 404 - Could not find event $hH1sSQtLCpgr13_oGBHWpPiA3YtZd9x6c2PR5NjrA9E
2021-08-09 10:18:00,384 - synapse.http.server - 88 - INFO - GET-224621 - <XForwardedForRequest at 0x7fdfe3143880 method='GET' uri='/_matrix/client/r0/rooms/%21OGEhHVWSdvArJzumhm%3Amatrix.org/event/%24hH1sSQtLCpgr13_oGBHWpPiA3YtZd9x6c2PR5NjrA9E' clientproto='HTTP/1.0' site='8008'> SynapseError: 404 - Could not find event $hH1sSQtLCpgr13_oGBHWpPiA3YtZd9x6c2PR5NjrA9E
2021-08-09 10:18:00,355 - synapse.http.server - 88 - INFO - GET-224619 - <XForwardedForRequest at 0x7fdfa2b89910 method='GET' uri='/_matrix/client/r0/rooms/%21OGEhHVWSdvArJzumhm%3Amatrix.org/event/%24hH1sSQtLCpgr13_oGBHWpPiA3YtZd9x6c2PR5NjrA9E' clientproto='HTTP/1.0' site='8008'> SynapseError: 404 - Could not find event $hH1sSQtLCpgr13_oGBHWpPiA3YtZd9x6c2PR5NjrA9E
2021-08-08 18:45:35,020 - synapse.http.server - 88 - INFO - GET-403528- <SynapseRequest at 0x7f2c570ed6d0 method='GET' uri='/_matrix/federation/v1/state_ids/%21rCWNvpCTZHQkiRYUDE%3Amatrix.org/?event_id=%24UYQ1EKG0Yp3gtG3YKMlaeOHr_CV5r9KbZ5Y0BVGMPzw&auth_chain_ids=0' clientproto='HTTP/1.0' site=12011> SynapseError: 404 - Could not find event $UYQ1EKG0Yp3gtG3YKMlaeOHr_CV5r9KbZ5Y0BVGMPzw
2021-08-08 16:38:20,160 - synapse.http.server - 88 - INFO - GET-365646- <SynapseRequest at 0x7f2ccc4815e0 method='GET' uri='/_matrix/federation/v1/state_ids/%21rCWNvpCTZHQkiRYUDE%3Amatrix.org/?event_id=%24HMqnXDjE_YQ9bOdxektQvRmGthO5asuiG2bz79hPRzI&auth_chain_ids=0' clientproto='HTTP/1.0' site=12011> SynapseError: 404 - Could not find event $HMqnXDjE_YQ9bOdxektQvRmGthO5asuiG2bz79hPRzI
2021-08-08 00:36:25,778 - synapse.http.server - 88 - INFO - GET-31847 - <XForwardedForRequest at 0x7fe08dddb040 method='GET' uri='/_matrix/client/r0/rooms/!OGEhHVWSdvArJzumhm%3Amatrix.org/event/%24YeWIEdNMgq4yMi0AJbi17mFHvPT52VuhmI4rCBUZkio' clientproto='HTTP/1.0' site='8008'> SynapseError: 404 - Could not find event $YeWIEdNMgq4yMi0AJbi17mFHvPT52VuhmI4rCBUZkio
2021-08-06 21:44:42,980 - synapse.http.server - 88 - INFO - GET-4486- <SynapseRequest at 0x7fb000a8f430 method='GET' uri='/_matrix/federation/v1/event_auth/%21OGEhHVWSdvArJzumhm%3Amatrix.org/%248veMnKnCN_T_dKZIfStJ4Sqdw4kiFfvIcVu4AQ_VaA0' clientproto='HTTP/1.0' site=12015> SynapseError: 404 - Could not find event $8veMnKnCN_T_dKZIfStJ4Sqdw4kiFfvIcVu4AQ_VaA0

In last 7 days there is about 7500 count of lines per day matched to Could not find event $8veMnKnCN_T_dKZIfStJ4Sqdw4kiFfvIcVu4AQ_VaA0 event id, and about 0-7 (average 2.73) matched to other event ids.

And I see no other log record types matched to $8veMnKnCN_T_dKZIfStJ4Sqdw4kiFfvIcVu4AQ_VaA0 string, only the "Could not find event".

MurzNN commented 3 years ago

Hmm, I even see this id in my events Synapse database, here is this row:

 {
   "topological_ordering": 324921,
   "event_id": "$8veMnKnCN_T_dKZIfStJ4Sqdw4kiFfvIcVu4AQ_VaA0",
   "type": "m.room.member",
   "room_id": "!OGEhHVWSdvArJzumhm:matrix.org",
   "content": null,
   "unrecognized_keys": null,
   "processed": true,
   "outlier": true,
   "depth": 324921,
   "origin_server_ts": 1628275419183,
   "received_ts": 1628278668700,
   "sender": "@timo:conduit.rs",
   "contains_url": false,
   "instance_name": "master",
   "stream_ordering": 45554143
 }

And here is from event_json table:

 {
   "event_id": "$8veMnKnCN_T_dKZIfStJ4Sqdw4kiFfvIcVu4AQ_VaA0",
   "room_id": "!OGEhHVWSdvArJzumhm:matrix.org",
   "internal_metadata": "{\"outlier\":true}",
   "json": "{\"auth_events\":[\"$MaPUZG2fNleIE_gUYDmOy-nj0Zom8mQ5FNM7gEAVI6I\",\"$cqtvWbNUS-f2uwi7Lp6RUbwNnvtorTXDBaH0Qx1-Ej4\",\"$Gkjeo5PQl__unpfqml2KQsO1Z3cI0ul7TEJW4rLA6dI\",\"$0QMsU7XVvBu1aJbnq8GZpB47MdbJTvvrWrTM1YaqgHE\"],\"content\":{\"avatar_url\":\"mxc://conduit.rs/q9B6LVnmTd1IM4PTdKWDo0oMfbzqH4yE\",\"displayname\":\"Timo on Conduit \\u26a1\\ufe0f\",\"membership\":\"join\"},\"depth\":324921,\"hashes\":{\"sha256\":\"qOxnFLSF1nxjw2afcBR9iL5rOiRJSCY4RdCpE9Y3Tbo\"},\"origin\":\"conduit.rs\",\"origin_server_ts\":1628275419183,\"prev_events\":[\"$DSHtZKCzitARzWhC6uEjcDAzYZxQKh7C5sTrziNxWeE\"],\"prev_state\":[],\"room_id\":\"!OGEhHVWSdvArJzumhm:matrix.org\",\"sender\":\"@timo:conduit.rs\",\"state_key\":\"@timo:conduit.rs\",\"type\":\"m.room.member\",\"signatures\":{\"conduit.rs\":{\"ed25519:s6MDvQDG\":\"kDt2DJpSteF0Zpt/v9h1gIlBP2bNXscm5wbtsQ0Anq5lAFNtGV8xWr43ieuMxT3wIpcp1TJlutDWJxhQxAjQBw\"}},\"unsigned\":{\"prev_content\":{\"avatar_url\":\"mxc://conduit.rs/q9B6LVnmTd1IM4PTdKWDo0oMfbzqH4yE\",\"displayname\":\"Timo on Conduit \\u26a1\\ufe0f\",\"membership\":\"join\"},\"prev_sender\":\"@timo:conduit.rs\",\"replaces_state\":\"$Gkjeo5PQl__unpfqml2KQsO1Z3cI0ul7TEJW4rLA6dI\"}}",
   "format_version": 3
 }

But why Synapse can't find it - does anybody have any ideas?

wol-win commented 3 years ago

FWIW - "room_id": "!OGEhHVWSdvArJzumhm:matrix.org" is the MatrixHQ room

richvdh commented 3 years ago

And here is example of last log lines, matched to "Could not find event" text (excuiding any id):

can you show us all the log lines for one of these requests - try grepping for PUT-1922933, for example.

MurzNN commented 3 years ago

can you show us all the log lines for one of these requests - try grepping for PUT-1922933, for example.

Here are all log lines, that matched to PUT-1922933 from main Synapse log and workers:

2021-08-12 12:00:57,937 - synapse.federation.transport.server - 445 - INFO - PUT-1922933- Received txn 1628611386500 from matrix.org. (PDUs: 1, EDUs: 0)
2021-08-12 12:00:57,947 - synapse.http.server - 88 - INFO - PUT-1922933- <SynapseRequest at 0x7f2bf47735b0 method='PUT' uri='/_matrix/federation/v1/send/1628611386500' clientproto='HTTP/1.0' site=12011> SynapseError: 404 - Could not find event $Dv3wmat7XbtBm4O37eT+Oa2kxHKUfCj36nCMiHyj5F4
2021-08-12 12:00:57,948 - synapse.access.http.12011 - 389 - INFO - PUT-1922933- ::ffff:127.0.0.1 - 12011 - {matrix.org} Processed request: 0.014sec/-0.000sec (0.001sec, 0.000sec) (0.002sec/0.002sec/1) 101B 404 "PUT /_matrix/federation/v1/send/1628611386500 HTTP/1.0" "Synapse/1.40.0rc2 (b=matrix-org-hotfixes,53d7d9251)" [1 dbevts]

Maybe I need to enable some verbose or debug logging to see deeper?

Or maybe try to execute the exact same SQL query, that uses Synapse to search events in federation transaction operation, to understand is this a Synapse or database side problem? Can you post that SQL query template?

MurzNN commented 3 years ago

And the main question: is this Could not find event 404 error can be the source of described federation problems with delays of hours, or it isn't crucial for the whole federation process?

As I understand, it must broke federation only with one "MatrixHQ" room (id !OGEhHVWSdvArJzumhm:matrix.org), and keep federation process of other rooms untouched, isn't it?

MurzNN commented 3 years ago

And the main question: is this Could not find event 404 error can be the source of described federation problems with delays of hours, or it isn't crucial for the whole federation process?

As I understand, it must broke federation only with one "MatrixHQ" room (id !OGEhHVWSdvArJzumhm:matrix.org), isn't it?

Seems yes, because if I create a new DM room in "tchncs.de" server and send messages to it, on sent time I see the same error in Synapse logs, example of last try:

2021-08-12 14:39:45,287 - synapse.access.http.12011 - 389 - INFO - PUT-1970495- ::ffff:127.0.0.1 - 12011 - {tchncs.de} Processed request: 0.356sec/-0.000sec (0.006sec, 0.000sec) (0.031sec/0.003sec/1) 101B 404 "PUT /_matrix/federation/v1/send/1628602889429 HTTP/1.0" "Synapse/1.40.0" [1 dbevts]
2021-08-12 14:39:45,286 - synapse.http.server - 88 - INFO - PUT-1970495- <SynapseRequest at 0x7f2cd7797d60 method='PUT' uri='/_matrix/federation/v1/send/1628602889429' clientproto='HTTP/1.0' site=12011> SynapseError: 404 - Could not find event $Dv3wmat7XbtBm4O37eT+Oa2kxHKUfCj36nCMiHyj5F4
2021-08-12 14:39:45,082 - synapse.federation.transport.server - 445 - INFO - PUT-1970495- Received txn 1628602889429 from tchncs.de. (PDUs: 1, EDUs: 0)

And seems all servers, that have MatrixHQ room, have federation problems with ru-matrix.org, but which are not member of it - federates well wit us, I guess.

Seems this issue becomes more a separate bug, than a current support issue, so I fill separate bug report exactly about only this problem: https://github.com/matrix-org/synapse/issues/10589

I hope, resolving the problem with that event_id ($Dv3wmat7XbtBm4O37eT+Oa2kxHKUfCj36nCMiHyj5F4) must solve the whole federation problem :crossed_fingers:

And let's keep here the discussion of methods to analyze incoming federation problems.

MurzNN commented 3 years ago

10589 is resolved now with great help from @richvdh, but I guess better to keep this issue opened to document the ways of analyzing incoming federation problems. And maybe good idea is to integrate some workaround into next Synapse releases to automatically resolve problems like described above? Because, I guess, there are many other homeservers, which can catch bad event and gain federation problems.

MurzNN commented 3 years ago

I've filled https://github.com/matrix-org/synapse/issues/10597 to make federation errors more visible in logs.

erikjohnston commented 3 years ago

There are some prometheus graphs that help track delays: https://github.com/matrix-org/synapse/issues/10553#issuecomment-905369041