matrix-org / synapse

Synapse: Matrix homeserver written in Python/Twisted.
https://matrix-org.github.io/synapse
Apache License 2.0
11.79k stars 2.13k forks source link

Messages not being sent over federation #7099

Closed ErrorProne closed 4 years ago

ErrorProne commented 4 years ago

Description

I'll try to do this as exact as I'm able to in reverse.

After upgrading our matrix server running on ubuntu 18.04 to 1.11.1+bionic1 we did not receive any messages from servers running on older instances (only in 1:1 chats, groups have been working fine).

We are explicitly talking with a another company running a matrix instance and a freelancer and his own setup.

Both problems were solved after upgrading their versions to the matrix synapse 1.11.1 version. But since at least one party was running 1.11.0+bionic1 before, it looks like the bug has been introduced in the patch release.

All messages and presence updates going from 1.11.1+bionic1 to the other sides went through without problems. But nothing the other way around.

One major pain point was that we discovered this by getting suspicious that some people did not respond in time and thus tried other communication channels. So the users where totally unaware of this.

A last note: As mentioned bringing all instances to 1.11.1 (or the latest docker image) solved it, no message went missing! As soon as the other parties wrote a new message all the missing ones got redelivered.

Edit: As far as I can tell it did not matter on which side the 1:1 chatroom has been initially created.

Steps to reproduce

(I did not re-test this, but this should result in the problematic setup)

You should now only be able to send 1:1 messages from 1.11.1+bionic1 to 1.11.0+bionic1 but not the other way around (At least using any riot the user won't get any feedback about this. I'll write an issue for that project later).

I could not find any suspicious log entries on our 1.11.1 server. But I've requested logs from the other side, maybe we can find something there.

Version information

ErrorProne commented 4 years ago

Update: This continues, sometimes messaging in 1:1 rooms between two server works, sometimes it fails. It looks like old 1:1 rooms could be bricked. We just tried to create new direct chats and that seems to work for now

Addon: We have now checked logs on both server but could not find any error messages regarding the broken rooms

richvdh commented 4 years ago

It sounds like the upgrade from 1.11.0 to 1.11.1 is irrelevant. Possibly it just seemed related because things coincidentally broke at the same time.

I'm wondering if this is somehow related to #7065: there was a bug in synapse 1.10 which could cause some incorrect database updates. There is a fix for this (#7070) in the forthcoming 1.12.0 release.

ErrorProne commented 4 years ago

@richvdh thanks for the info. After all parties had updated their instances to 1.12.0 it looks like it works for everyone again =) The problems still occurs on some 1:1 chats (only via federation). However if a 1:1 chat as been created after the problem first hit it seems to be not influenced by the bug.

This all fells like a some form of data corruption bug for 1:1 federation chats which at least can be worked around be starting new 1:1 chats

richvdh commented 4 years ago

ok if you want to investigate further, please send a message in one of the affected rooms, and then share the logs from both the sending server and a server which didn't receive the message.

ErrorProne commented 4 years ago

By chance we got a broken room again, we'll try to filter the logs on both servers accordingly. There is a lot going on in there. I'll keep you posted

ErrorProne commented 4 years ago

Sorry for the delay! We have some logs now, also we've seen that:

Here are logs from the admins of server1 and server2. user1@server1 -> user2@server2: Works user2@server2 -> user1@server1: Does not work

All systems should be latest greatest (latest stable riot web and latest stable synapse version)

We've tried to get the correct point in time as good as we can and sanitize the logs

Server1 sending:


> 2020-04-14 07:51:23,694 - synapse.federation.sender.transaction_manager - 113 - INFO - federation_transaction_transmission_loop-7824- TX [server2] {1586268922514} Sending transaction [1586268922514], (PDUs: 0, EDUs: 1)
> 2020-04-14 07:51:23,695 - synapse.http.matrixfederationclient - 408 - INFO - federation_transaction_transmission_loop-7824- {PUT-O-10788} [server2] Sending request: PUT matrix://server2/_matrix/federation/v1/send/1586268922514; timeout 60.000000s
> 2020-04-14 07:51:23,706 - synapse.access.http.8008 - 302 - INFO - GET-181487- 10.0.11.10 - 8008 - {@user1:server1} Processed request: 1.129sec/0.003sec (0.007sec, 0.001sec) (0.000sec/0.000sec/0) 759B 200 "GET /_matrix/client/r0/sync?filter=0&timeout=30000&since=s316693_1032142_10579_84647_137398_58_247_4197_57 HTTP/1.0" "Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/537.36 (KHTML, like Gecko) Ubuntu Chromium/80.0.3987.163 Chrome/80.0.3987.163 Safari/537.36" [0 dbevts]
> 2020-04-14 07:51:23,707 - synapse.access.http.8008 - 302 - INFO - PUT-181490- 10.0.11.10 - 8008 - {@user1:server1} Processed request: 0.015sec/0.002sec (0.004sec, 0.000sec) (0.000sec/0.000sec/0) 2B 200 "PUT /_matrix/client/r0/rooms/!nmOxkCqCxiHFcJqVrX%3Aserver1/typing/%40user1%3Aserver1 HTTP/1.0" "Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/537.36 (KHTML, like Gecko) Ubuntu Chromium/80.0.3987.163 Chrome/80.0.3987.163 Safari/537.36" [0 dbevts]
> 2020-04-14 07:51:23,749 - synapse.http.matrixfederationclient - 442 - INFO - federation_transaction_transmission_loop-7824- {PUT-O-10788} [server2] Got response headers: 200 OK
> 2020-04-14 07:51:23,750 - synapse.http.matrixfederationclient - 164 - INFO - federation_transaction_transmission_loop-7824- {PUT-O-10788} [server2] Completed: 200 OK
> 2020-04-14 07:51:23,750 - synapse.federation.sender.transaction_manager - 146 - INFO - federation_transaction_transmission_loop-7824- TX [server2] {1586268922514} got 200 response
> 2020-04-14 07:51:24,811 - synapse.federation.sender.transaction_manager - 113 - INFO - federation_transaction_transmission_loop-7825- TX [server2] {1586268922515} Sending transaction [1586268922515], (PDUs: 0, EDUs: 1)
> 2020-04-14 07:51:24,813 - synapse.http.matrixfederationclient - 408 - INFO - federation_transaction_transmission_loop-7825- {PUT-O-10789} [server2] Sending request: PUT matrix://server2/_matrix/federation/v1/send/1586268922515; timeout 60.000000s
> 2020-04-14 07:51:24,828 - synapse.access.http.8008 - 302 - INFO - GET-181491- 10.0.11.10 - 8008 - {@user1:server1} Processed request: 1.056sec/0.007sec (0.008sec, 0.000sec) (0.000sec/0.000sec/0) 733B 200 "GET /_matrix/client/r0/sync?filter=0&timeout=30000&since=s316693_1032142_10580_84647_137398_58_247_4197_57 HTTP/1.0" "Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/537.36 (KHTML, like Gecko) Ubuntu Chromium/80.0.3987.163 Chrome/80.0.3987.163 Safari/537.36" [0 dbevts]
> 2020-04-14 07:51:24,829 - synapse.access.http.8008 - 302 - INFO - PUT-181493- 10.0.11.10 - 8008 - {@user1:server1} Processed request: 0.015sec/0.007sec (0.000sec, 0.000sec) (0.000sec/0.000sec/0) 2B 200 "PUT /_matrix/client/r0/rooms/!nmOxkCqCxiHFcJqVrX%3Aserver1/typing/%40user1%3Aserver1 HTTP/1.0" "Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/537.36 (KHTML, like Gecko) Ubuntu Chromium/80.0.3987.163 Chrome/80.0.3987.163 Safari/537.36" [0 dbevts]
> 2020-04-14 07:51:24,834 - synapse.metrics - 464 - INFO - - Collecting gc 1
> 2020-04-14 07:51:24,870 - synapse.http.matrixfederationclient - 442 - INFO - federation_transaction_transmission_loop-7825- {PUT-O-10789} [server2] Got response headers: 200 OK
> 2020-04-14 07:51:24,872 - synapse.http.matrixfederationclient - 164 - INFO - federation_transaction_transmission_loop-7825- {PUT-O-10789} [server2] Completed: 200 OK
> 2020-04-14 07:51:24,873 - synapse.federation.sender.transaction_manager - 146 - INFO - federation_transaction_transmission_loop-7825- TX [server2] {1586268922515} got 200 response
> 2020-04-14 07:51:24,903 - synapse.access.http.8008 - 302 - INFO - PUT-181494- 10.0.11.10 - 8008 - {@user1:server1} Processed request: 0.064sec/0.001sec (0.009sec, 0.002sec) (0.003sec/0.008sec/4) 60B 200 "PUT /_matrix/client/r0/rooms/!nmOxkCqCxiHFcJqVrX%3Aserver1/send/m.room.message/m1586850684575.73 HTTP/1.0" "Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/537.36 (KHTML, like Gecko) Ubuntu Chromium/80.0.3987.163 Chrome/80.0.3987.163 Safari/537.36" [0 dbevts]
> 2020-04-14 07:51:24,934 - synapse.federation.sender.transaction_manager - 113 - INFO - federation_transaction_transmission_loop-7826- TX [server2] {1586268922516} Sending transaction [1586268922516], (PDUs: 1, EDUs: 0)
> 2020-04-14 07:51:24,937 - synapse.http.matrixfederationclient - 408 - INFO - federation_transaction_transmission_loop-7826- {PUT-O-10790} [server2] Sending request: PUT matrix://server2/_matrix/federation/v1/send/1586268922516; timeout 60.000000s
> 2020-04-14 07:51:25,045 - synapse.http.matrixfederationclient - 442 - INFO - federation_transaction_transmission_loop-7826- {PUT-O-10790} [server2] Got response headers: 200 OK
> 2020-04-14 07:51:25,047 - synapse.http.matrixfederationclient - 164 - INFO - federation_transaction_transmission_loop-7826- {PUT-O-10790} [server2] Completed: 200 OK
> 2020-04-14 07:51:25,047 - synapse.federation.sender.transaction_manager - 146 - INFO - federation_transaction_transmission_loop-7826- TX [server2] {1586268922516} got 200 response
> 2020-04-14 07:51:25,139 - synapse.access.http.8008 - 302 - INFO - GET-181496- 10.0.11.10 - 8008 - {@user1:server1} Processed request: 0.016sec/0.001sec (0.010sec, 0.000sec) (0.002sec/0.002sec/1) 1337B 200 "GET /_matrix/client/r0/sync?filter=0&timeout=30000&since=s316693_1032142_10581_84647_137398_58_247_4197_57 HTTP/1.0" "Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/537.36 (KHTML, like Gecko) Ubuntu Chromium/80.0.3987.163 Chrome/80.0.3987.163 Safari/537.36" [0 dbevts]
> 2020-04-14 07:51:25,269 - synapse.access.http.8008 - 302 - INFO - POST-181497- 10.0.11.10 - 8008 - {@user1:server1} Processed request: 0.018sec/0.001sec (0.002sec, 0.001sec) (0.001sec/0.004sec/2) 6008B 200 "POST /_matrix/client/r0/keys/query HTTP/1.0" "Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/537.36 (KHTML, like Gecko) Ubuntu Chromium/80.0.3987.163 Chrome/80.0.3987.163 Safari/537.36" [0 dbevts]

Server 2 Receiving:

> 2020-04-14 07:51:23,719 - synapse.federation.transport.server - 406 - INFO - PUT-77621 - Received txn 1586268922514 from server1. (PDUs: 0, EDUs: 1)
> 2020-04-14 07:51:23,726 - synapse.state - 497 - INFO - PUT-77621 - Resolving state for !nmOxkCqCxiHFcJqVrX:server1 with 2 groups
> 2020-04-14 07:51:23,727 - synapse.state - 520 - INFO - PUT-77621 - Resolving conflicted state for '!nmOxkCqCxiHFcJqVrX:server1'
> 2020-04-14 07:51:23,735 - synapse.handlers.typing - 237 - INFO - PUT-77621 - Got typing update from @user1:server1: {'room_id': '!nmOxkCqCxiHFcJqVrX:server1', 'typing': True, 'user_id': '@user1:server1'}
> 2020-04-14 07:51:23,740 - synapse.access.http.8008 - 302 - INFO - PUT-77621 - 194.95.62.117 - 8008 - {server1} Processed request: 0.025sec/0.000sec (0.017sec, 0.000sec) (0.002sec/0.007sec/3) 12B 200 "PUT /_matrix/federation/v1/send/1586268922514 HTTP/1.0" "Synapse/1.12.3" [0 dbevts]
> 2020-04-14 07:51:24,838 - synapse.federation.transport.server - 406 - INFO - PUT-77622 - Received txn 1586268922515 from server1. (PDUs: 0, EDUs: 1)
> 2020-04-14 07:51:24,846 - synapse.handlers.typing - 237 - INFO - PUT-77622 - Got typing update from @user1:server1: {'room_id': '!nmOxkCqCxiHFcJqVrX:server1', 'typing': False, 'user_id': '@user1:server1'}
> 2020-04-14 07:51:24,859 - synapse.access.http.8008 - 302 - INFO - PUT-77622 - 194.95.62.117 - 8008 - {server1} Processed request: 0.023sec/0.001sec (0.013sec, 0.000sec) (0.002sec/0.005sec/2) 12B 200 "PUT /_matrix/federation/v1/send/1586268922515 HTTP/1.0" "Synapse/1.12.3" [0 dbevts]
> 2020-04-14 07:51:24,948 - synapse.federation.transport.server - 406 - INFO - PUT-77623 - Received txn 1586268922516 from server1. (PDUs: 1, EDUs: 0)
> 2020-04-14 07:51:24,957 - synapse.handlers.federation - 194 - INFO - PUT-77623-$RA-DUABeG2Ob6pmFoAs-2Uf-XAxjSnnnpOqBAMHhbuY - handling received PDU: <FrozenEventV3 event_id='$RA-DUABeG2Ob6pmFoAs-2Uf-XAxjSnnnpOqBAMHhbuY', type='m.room.message', state_key='None'>
> 2020-04-14 07:51:25,034 - synapse.access.http.8008 - 302 - INFO - PUT-77623 - 194.95.62.117 - 8008 - {server1} Processed request: 0.087sec/0.002sec (0.017sec, 0.000sec) (0.007sec/0.019sec/7) 62B 200 "PUT /_matrix/federation/v1/send/1586268922516 HTTP/1.0" "Synapse/1.12.3" [1 dbevts]
> 2020-04-14 07:51:25,037 - synapse.push.httppusher - 195 - INFO - httppush.process-181 - Processing 1 unprocessed push actions for @user2:server2/im.vector.app.android/xxxx starting at stream_ordering 7820

Server 2 sending:

> 2020-04-14 07:53:42,024 - synapse.federation.sender.transaction_manager - 113 - INFO - federation_transaction_transmission_loop-42873 - TX [server1] {1585132221531} Sending transaction [1585132221531], (PDUs: 0, EDUs: 1)
> 2020-04-14 07:53:42,026 - synapse.http.matrixfederationclient - 408 - INFO - federation_transaction_transmission_loop-42873 - {PUT-O-25694} [server1] Sending request: PUT matrix://server1/_matrix/federation/v1/send/1585132221531; timeout 60.000000s
> 2020-04-14 07:53:42,029 - synapse.http.federation.matrix_federation_agent - 242 - INFO - federation_transaction_transmission_loop-42873 - Connecting to chat3.server1:443
> 2020-04-14 07:53:42,041 - synapse.access.http.8008 - 302 - INFO - GET-77751 - 31.17.249.107 - 8008 - {@user2:server2} Processed request: 2.598sec/0.005sec (0.012sec, 0.000sec) (0.000sec/0.000sec/0) 738B 200 "GET /_matrix/client/r0/sync?filter=0&timeout=30000&since=s9681_261342_1945_10059_199_5_3_6751_10 HTTP/1.1" "Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/80.0.3987.141 Safari/537.36" [0 dbevts]
> 2020-04-14 07:53:42,042 - synapse.access.http.8008 - 302 - INFO - PUT-77753 - 31.17.249.107 - 8008 - {@user2:server2} Processed request: 0.027sec/0.002sec (0.009sec, 0.000sec) (0.000sec/0.000sec/0) 2B 200 "PUT /_matrix/client/r0/rooms/!nmOxkCqCxiHFcJqVrX:server1/typing/@user2:server2 HTTP/1.1" "Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/80.0.3987.141 Safari/537.36" [0 dbevts]
> 2020-04-14 07:53:42,081 - synapse.access.http.8008 - 302 - INFO - OPTIONS-77754 - 31.17.249.107 - 8008 - {None} Processed request: 0.001sec/0.001sec (0.003sec, 0.000sec) (0.000sec/0.000sec/0) 2B 200 "OPTIONS /_matrix/client/r0/sync?filter=0&timeout=30000&since=s9681_261342_1946_10059_199_5_3_6751_10 HTTP/1.1" "Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/80.0.3987.141 Safari/537.36" [0 dbevts]
> 2020-04-14 07:53:42,159 - synapse.http.matrixfederationclient - 442 - INFO - federation_transaction_transmission_loop-42873 - {PUT-O-25694} [server1] Got response headers: 200 OK
> 2020-04-14 07:53:42,161 - synapse.http.matrixfederationclient - 164 - INFO - federation_transaction_transmission_loop-42873 - {PUT-O-25694} [server1] Completed: 200 OK
> 2020-04-14 07:53:42,161 - synapse.federation.sender.transaction_manager - 146 - INFO - federation_transaction_transmission_loop-42873 - TX [server1] {1585132221531} got 200 response
> 2020-04-14 07:53:43,791 - synapse.federation.sender.transaction_manager - 113 - INFO - federation_transaction_transmission_loop-42874 - TX [server1] {1585132221532} Sending transaction [1585132221532], (PDUs: 0, EDUs: 1)
> 2020-04-14 07:53:43,793 - synapse.http.matrixfederationclient - 408 - INFO - federation_transaction_transmission_loop-42874 - {PUT-O-25695} [server1] Sending request: PUT matrix://server1/_matrix/federation/v1/send/1585132221532; timeout 60.000000s
> 2020-04-14 07:53:43,806 - synapse.access.http.8008 - 302 - INFO - GET-77755 - 31.17.249.107 - 8008 - {@user2:server2} Processed request: 1.689sec/0.004sec (0.008sec, 0.000sec) (0.000sec/0.000sec/0) 713B 200 "GET /_matrix/client/r0/sync?filter=0&timeout=30000&since=s9681_261342_1946_10059_199_5_3_6751_10 HTTP/1.1" "Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/80.0.3987.141 Safari/537.36" [0 dbevts]
> 2020-04-14 07:53:43,807 - synapse.access.http.8008 - 302 - INFO - PUT-77756 - 31.17.249.107 - 8008 - {@user2:server2} Processed request: 0.021sec/0.002sec (0.006sec, 0.000sec) (0.000sec/0.000sec/0) 2B 200 "PUT /_matrix/client/r0/rooms/!nmOxkCqCxiHFcJqVrX:server1/typing/@user2:server2 HTTP/1.1" "Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/80.0.3987.141 Safari/537.36" [0 dbevts]
> 2020-04-14 07:53:43,809 - synapse.metrics - 464 - INFO -  - Collecting gc 1
> 2020-04-14 07:53:43,814 - synapse.access.http.8008 - 302 - INFO - OPTIONS-77757 - 31.17.249.107 - 8008 - {None} Processed request: 0.000sec/0.000sec (0.000sec, 0.000sec) (0.000sec/0.000sec/0) 2B 200 "OPTIONS /_matrix/client/r0/rooms/!nmOxkCqCxiHFcJqVrX:server1/send/m.room.message/m1586850823763.0 HTTP/1.1" "Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/80.0.3987.141 Safari/537.36" [0 dbevts]
> 2020-04-14 07:53:43,839 - synapse.http.matrixfederationclient - 442 - INFO - federation_transaction_transmission_loop-42874 - {PUT-O-25695} [server1] Got response headers: 200 OK
> 2020-04-14 07:53:43,842 - synapse.http.matrixfederationclient - 164 - INFO - federation_transaction_transmission_loop-42874 - {PUT-O-25695} [server1] Completed: 200 OK
> 2020-04-14 07:53:43,843 - synapse.federation.sender.transaction_manager - 146 - INFO - federation_transaction_transmission_loop-42874 - TX [server1] {1585132221532} got 200 response
> 2020-04-14 07:53:43,857 - synapse.access.http.8008 - 302 - INFO - OPTIONS-77759 - 31.17.249.107 - 8008 - {None} Processed request: 0.000sec/0.003sec (0.000sec, 0.000sec) (0.000sec/0.000sec/0) 2B 200 "OPTIONS /_matrix/client/r0/sync?filter=0&timeout=30000&since=s9681_261342_1947_10059_199_5_3_6751_10 HTTP/1.1" "Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/80.0.3987.141 Safari/537.36" [0 dbevts]
> 2020-04-14 07:53:43,892 - synapse.access.http.8008 - 302 - INFO - PUT-77758 - 31.17.249.107 - 8008 - {@user2:server2} Processed request: 0.041sec/0.006sec (0.014sec, 0.001sec) (0.002sec/0.004sec/3) 60B 200 "PUT /_matrix/client/r0/rooms/!nmOxkCqCxiHFcJqVrX:server1/send/m.room.message/m1586850823763.0 HTTP/1.1" "Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/80.0.3987.141 Safari/537.36" [0 dbevts]
> 2020-04-14 07:53:43,917 - synapse.access.http.8008 - 302 - INFO - GET-77760 - 31.17.249.107 - 8008 - {@user2:server2} Processed request: 0.017sec/0.001sec (0.006sec, 0.000sec) (0.003sec/0.004sec/2) 942B 200 "GET /_matrix/client/r0/sync?filter=0&timeout=30000&since=s9681_261342_1947_10059_199_5_3_6751_10 HTTP/1.1" "Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/80.0.3987.141 Safari/537.36" [0 dbevts]
> 2020-04-14 07:53:43,988 - synapse.access.http.8008 - 302 - INFO - OPTIONS-77762 - 31.17.249.107 - 8008 - {None} Processed request: 0.001sec/0.003sec (0.000sec, 0.000sec) (0.000sec/0.000sec/0) 2B 200 "OPTIONS /_matrix/client/r0/sync?filter=0&timeout=30000&since=s9682_261342_1947_10059_199_5_3_6751_10 HTTP/1.1" "Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/80.0.3987.141 Safari/537.36" [0 dbevts]
> 2020-04-14 07:53:43,994 - synapse.access.http.8008 - 302 - INFO - POST-77761 - 31.17.249.107 - 8008 - {@user2:server2} Processed request: 0.017sec/0.000sec (0.002sec, 0.004sec) (0.002sec/0.004sec/3) 2B 200 "POST /_matrix/client/r0/rooms/!nmOxkCqCxiHFcJqVrX:server1/read_markers HTTP/1.1" "Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/80.0.3987.141 Safari/537.36" [0 dbevts]
> 2020-04-14 07:53:44,039 - synapse.access.http.8008 - 302 - INFO - GET-77763 - 31.17.249.107 - 8008 - {@user2:server2} Processed request: 0.018sec/0.001sec (0.009sec, 0.001sec) (0.004sec/0.002sec/2) 2007B 200 "GET /_matrix/client/r0/sync?filter=0&timeout=30000&since=s9682_261342_1947_10059_199_5_3_6751_10 HTTP/1.1" "Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/80.0.3987.141 Safari/537.36" [0 dbevts]
> 2020-04-14 07:53:44,077 - synapse.access.http.8008 - 302 - INFO - OPTIONS-77764 - 31.17.249.107 - 8008 - {None} Processed request: 0.001sec/0.001sec (0.000sec, 0.000sec) (0.000sec/0.000sec/0) 2B 200 "OPTIONS /_matrix/client/r0/sync?filter=0&timeout=30000&since=s9682_261342_1947_10059_200_5_3_6751_10 HTTP/1.1" "Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/80.0.3987.141 Safari/537.36" [0 dbevts]

Server 1 receiving:

> 2020-04-14 07:53:42,143 - synapse.federation.transport.server - 406 - INFO - PUT-181595- Received txn 1585132221531 from server2. (PDUs: 0, EDUs: 1)
> 2020-04-14 07:53:42,156 - synapse.access.http.8008 - 302 - INFO - PUT-181595- 10.0.11.10 - 8008 - {server2} Processed request: 0.024sec/0.001sec (0.002sec, 0.003sec) (0.002sec/0.005sec/2) 12B 200 "PUT /_matrix/federation/v1/send/1585132221531 HTTP/1.0" "Synapse/1.12.0" [0 dbevts]
> 2020-04-14 07:53:42,221 - synapse.rest.media.v1.preview_url_resource - 404 - INFO - expire_url_cache_data-58190- Running url preview cache expiry
> 2020-04-14 07:53:42,228 - synapse.rest.media.v1.preview_url_resource - 483 - INFO - expire_url_cache_data-58190- Deleted 0 media from url cache
> 2020-04-14 07:53:43,822 - synapse.federation.transport.server - 406 - INFO - PUT-181596- Received txn 1585132221532 from server2. (PDUs: 0, EDUs: 1)
> 2020-04-14 07:53:43,837 - synapse.access.http.8008 - 302 - INFO - PUT-181596- 10.0.11.10 - 8008 - {server2} Processed request: 0.017sec/0.001sec (0.008sec, 0.004sec) (0.002sec/0.008sec/2) 12B 200 "PUT /_matrix/federation/v1/send/1585132221532 HTTP/1.0" "Synapse/1.12.0" [0 dbevts]
richvdh commented 4 years ago

looks like server2 isn't sending the events to server1 for some reason.

can you enable DEBUG logging for synapse.federation.sender and then sharing the logs from the sending server.

it's also worth grepping the logs for ERROR and CRITICAL lines to see if anything looks odd.

richvdh commented 4 years ago
... "PUT /_matrix/federation/v1/send/1585132221531 HTTP/1.0" 

this is odd. have you got a reverse-proxy which is converting HTTP/1.1 requests to HTTP/1.0?

ErrorProne commented 4 years ago
... "PUT /_matrix/federation/v1/send/1585132221531 HTTP/1.0" 

this is odd. have you got a reverse-proxy which is converting HTTP/1.1 requests to HTTP/1.0?

That is nginx default for proxy connections (https://nginx.org/en/docs/http/ngx_http_proxy_module.html#proxy_http_version) So yeah, there are actually two nginx proxies in front. One in the DMZ and one on the VM itself to separate the slack-bridge from the matrix requests.

The admin of server2 changed the configuration as per your recommendation. As we found out before messages started flowing again after the restart, but we'll try until it breaks again.

ErrorProne commented 4 years ago

Today it finally broke again.

Here are the logs with debug enabled for synapse.federation.sender on server2.

First I (user1) have sent a direct message to server 2, and than user2 on server 2 instantly tried to respond, which did not come through (also no read or typing notifications). If we would restart server2 now, it would work again and also old messages would be re-sent with the next new new message user 2 writes in the direct channel.

Server1 log (Can send, but does not receive)

> 2020-04-29 07:32:58,666 - synapse.access.http.8008 - 302 - INFO - GET-870110- 10.0.11.10 - 8008 - {@user1:server1} Processed request: 0.905sec/0.003sec (0.008sec, 0.001sec) (0.000sec/0.000sec/0) 759B 200 "GET /_matrix/client/r0/sync?filter=0&timeout=30000&since=s340843_1120144_41282_99343_162034_64_247_4474_57 HTTP/1.1" "Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/537.36 (KHTML, like Gecko) snap Chromium/81.0.4044.122 Chrome/81.0.4044.122 Safari/537.36" [0 dbevts]
> 2020-04-29 07:32:58,668 - synapse.access.http.8008 - 302 - INFO - PUT-870111- 10.0.11.10 - 8008 - {@user1:server1} Processed request: 0.013sec/0.002sec (0.005sec, 0.000sec) (0.000sec/0.000sec/0) 2B 200 "PUT /_matrix/client/r0/rooms/!nmOxkCqCxiHFcJqVrX%3Aserver1/typing/%40user1%3Aserver1 HTTP/1.1" "Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/537.36 (KHTML, like Gecko) snap Chromium/81.0.4044.122 Chrome/81.0.4044.122 Safari/537.36" [0 dbevts]
> 2020-04-29 07:32:58,671 - synapse.federation.sender.transaction_manager - 113 - INFO - federation_transaction_transmission_loop-30313- TX [server2] {1586268953496} Sending transaction [1586268953496], (PDUs: 0, EDUs: 1)
> 2020-04-29 07:32:58,672 - synapse.http.matrixfederationclient - 408 - INFO - federation_transaction_transmission_loop-30313- {PUT-O-42488} [server2] Sending request: PUT matrix://server2/_matrix/federation/v1/send/1586268953496; timeout 60.000000s
> 2020-04-29 07:32:58,691 - synapse.http.federation.matrix_federation_agent - 242 - INFO - federation_transaction_transmission_loop-30313- Connecting to server2:8448
> 2020-04-29 07:32:58,724 - synapse.metrics - 464 - INFO - - Collecting gc 1
> 2020-04-29 07:32:58,794 - synapse.http.matrixfederationclient - 442 - INFO - federation_transaction_transmission_loop-30313- {PUT-O-42488} [server2] Got response headers: 200 OK
> 2020-04-29 07:32:58,796 - synapse.http.matrixfederationclient - 164 - INFO - federation_transaction_transmission_loop-30313- {PUT-O-42488} [server2] Completed: 200 OK
> 2020-04-29 07:32:58,796 - synapse.federation.sender.transaction_manager - 146 - INFO - federation_transaction_transmission_loop-30313- TX [server2] {1586268953496} got 200 response
> 2020-04-29 07:32:59,962 - synapse.federation.sender.transaction_manager - 113 - INFO - federation_transaction_transmission_loop-30314- TX [server2] {1586268953497} Sending transaction [1586268953497], (PDUs: 0, EDUs: 1)
> 2020-04-29 07:32:59,963 - synapse.http.matrixfederationclient - 408 - INFO - federation_transaction_transmission_loop-30314- {PUT-O-42489} [server2] Sending request: PUT matrix://server2/_matrix/federation/v1/send/1586268953497; timeout 60.000000s
> 2020-04-29 07:32:59,974 - synapse.access.http.8008 - 302 - INFO - GET-870112- 10.0.11.10 - 8008 - {@user1:server1} Processed request: 1.256sec/0.003sec (0.008sec, 0.000sec) (0.000sec/0.000sec/0) 733B 200 "GET /_matrix/client/r0/sync?filter=0&timeout=30000&since=s340843_1120144_41283_99343_162034_64_247_4474_57 HTTP/1.1" "Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/537.36 (KHTML, like Gecko) snap Chromium/81.0.4044.122 Chrome/81.0.4044.122 Safari/537.36" [0 dbevts]
> 2020-04-29 07:32:59,975 - synapse.access.http.8008 - 302 - INFO - PUT-870113- 10.0.11.10 - 8008 - {@user1:server1} Processed request: 0.015sec/0.002sec (0.004sec, 0.000sec) (0.000sec/0.000sec/0) 2B 200 "PUT /_matrix/client/r0/rooms/!nmOxkCqCxiHFcJqVrX%3Aserver1/typing/%40user1%3Aserver1 HTTP/1.1" "Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/537.36 (KHTML, like Gecko) snap Chromium/81.0.4044.122 Chrome/81.0.4044.122 Safari/537.36" [0 dbevts]
> 2020-04-29 07:33:00,017 - synapse.http.matrixfederationclient - 442 - INFO - federation_transaction_transmission_loop-30314- {PUT-O-42489} [server2] Got response headers: 200 OK
> 2020-04-29 07:33:00,019 - synapse.http.matrixfederationclient - 164 - INFO - federation_transaction_transmission_loop-30314- {PUT-O-42489} [server2] Completed: 200 OK
> 2020-04-29 07:33:00,019 - synapse.federation.sender.transaction_manager - 146 - INFO - federation_transaction_transmission_loop-30314- TX [server2] {1586268953497} got 200 response
> 2020-04-29 07:33:00,092 - synapse.access.http.8008 - 302 - INFO - PUT-870115- 10.0.11.10 - 8008 - {@user1:server1} Processed request: 0.066sec/0.001sec (0.001sec, 0.003sec) (0.002sec/0.006sec/3) 60B 200 "PUT /_matrix/client/r0/rooms/!nmOxkCqCxiHFcJqVrX%3Aserver1/send/m.room.message/m1588145579925.49 HTTP/1.1" "Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/537.36 (KHTML, like Gecko) snap Chromium/81.0.4044.122 Chrome/81.0.4044.122 Safari/537.36" [0 dbevts]
> 2020-04-29 07:33:00,109 - synapse.access.http.8008 - 302 - INFO - GET-870114- 10.0.11.10 - 8008 - {@user1:server1} Processed request: 0.093sec/0.004sec (0.008sec, 0.001sec) (0.003sec/0.008sec/1) 1336B 200 "GET /_matrix/client/r0/sync?filter=0&timeout=30000&since=s340843_1120144_41284_99343_162034_64_247_4474_57 HTTP/1.1" "Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/537.36 (KHTML, like Gecko) snap Chromium/81.0.4044.122 Chrome/81.0.4044.122 Safari/537.36" [0 dbevts]
> 2020-04-29 07:33:00,123 - synapse.federation.sender.transaction_manager - 113 - INFO - federation_transaction_transmission_loop-30315- TX [server2] {1586268953498} Sending transaction [1586268953498], (PDUs: 1, EDUs: 0)
> 2020-04-29 07:33:00,125 - synapse.http.matrixfederationclient - 408 - INFO - federation_transaction_transmission_loop-30315- {PUT-O-42490} [server2] Sending request: PUT matrix://server2/_matrix/federation/v1/send/1586268953498; timeout 60.000000s
> 2020-04-29 07:33:00,228 - synapse.access.http.8008 - 302 - INFO - POST-870116- 10.0.11.10 - 8008 - {@user1:server1} Processed request: 0.014sec/0.001sec (0.000sec, 0.001sec) (0.001sec/0.009sec/2) 6008B 200 "POST /_matrix/client/r0/keys/query HTTP/1.1" "Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/537.36 (KHTML, like Gecko) snap Chromium/81.0.4044.122 Chrome/81.0.4044.122 Safari/537.36" [0 dbevts]
> 2020-04-29 07:33:00,247 - synapse.http.matrixfederationclient - 442 - INFO - federation_transaction_transmission_loop-30315- {PUT-O-42490} [server2] Got response headers: 200 OK
> 2020-04-29 07:33:00,248 - synapse.http.matrixfederationclient - 164 - INFO - federation_transaction_transmission_loop-30315- {PUT-O-42490} [server2] Completed: 200 OK
> 2020-04-29 07:33:00,249 - synapse.federation.sender.transaction_manager - 146 - INFO - federation_transaction_transmission_loop-30315- TX [server2] {1586268953498} got 200 response
> 2020-04-29 07:33:00,480 - synapse.access.http.8008 - 302 - INFO - POST-870118- 10.0.11.10 - 8008 - {@user1:server1} Processed request: 0.018sec/0.004sec (0.008sec, 0.001sec) (0.002sec/0.008sec/3) 2B 200 "POST /_matrix/client/r0/rooms/!nmOxkCqCxiHFcJqVrX%3Aserver1/read_markers HTTP/1.1" "Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/537.36 (KHTML, like Gecko) snap Chromium/81.0.4044.122 Chrome/81.0.4044.122 Safari/537.36" [0 dbevts]
> 2020-04-29 07:33:00,493 - synapse.access.http.8008 - 302 - INFO - GET-870117- 10.0.11.10 - 8008 - {@user1:server1} Processed request: 0.273sec/0.001sec (0.010sec, 0.001sec) (0.003sec/0.004sec/2) 781B 200 "GET /_matrix/client/r0/sync?filter=0&timeout=30000&since=s340844_1120144_41284_99343_162034_64_247_4474_57 HTTP/1.1" "Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/537.36 (KHTML, like Gecko) snap Chromium/81.0.4044.122 Chrome/81.0.4044.122 Safari/537.36" [0 dbevts]
> 2020-04-29 07:33:01,950 - synapse.federation.transport.server - 406 - INFO - PUT-870120- Received txn 1587630318384 from matrix.gambio-server.net. (PDUs: 0, EDUs: 1)
> 2020-04-29 07:33:01,979 - synapse.access.http.8008 - 302 - INFO - GET-870119- 10.0.11.10 - 8008 - {@user1:server1} Processed request: 1.433sec/0.015sec (0.004sec, 0.000sec) (0.000sec/0.000sec/0) 498B 200 "GET /_matrix/client/r0/sync?filter=0&timeout=30000&since=s340844_1120144_41284_99343_162035_64_247_4474_57 HTTP/1.1" "Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/537.36 (KHTML, like Gecko) snap Chromium/81.0.4044.122 Chrome/81.0.4044.122 Safari/537.36" [0 dbevts]
> 2020-04-29 07:33:01,987 - synapse.access.http.8008 - 302 - INFO - PUT-870120- 10.0.11.10 - 8008 - {matrix.gambio-server.net} Processed request: 0.042sec/0.000sec (0.009sec, 0.002sec) (0.004sec/0.011sec/4) 12B 200 "PUT /_matrix/federation/v1/send/1587630318384 HTTP/1.1" "Synapse/1.12.3" [0 dbevts]
> 2020-04-29 07:33:02,210 - synapse.federation.transport.server - 406 - INFO - PUT-870125- Received txn 1587630318385 from matrix.gambio-server.net. (PDUs: 0, EDUs: 1)
> 2020-04-29 07:33:02,238 - synapse.rest.media.v1.preview_url_resource - 404 - INFO - expire_url_cache_data-187666- Running url preview cache expiry
> 2020-04-29 07:33:02,249 - synapse.access.http.8008 - 302 - INFO - GET-870122- 10.0.11.10 - 8008 - {@user1:server1} Processed request: 0.188sec/0.024sec (0.004sec, 0.000sec) (0.000sec/0.000sec/0) 517B 200 "GET /_matrix/client/r0/sync?filter=0&timeout=30000&since=s340844_1120145_41284_99343_162035_64_247_4474_57 HTTP/1.1" "Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/537.36 (KHTML, like Gecko) snap Chromium/81.0.4044.122 Chrome/81.0.4044.122 Safari/537.36" [0 dbevts]
> 2020-04-29 07:33:02,254 - synapse.access.http.8008 - 302 - INFO - PUT-870125- 10.0.11.10 - 8008 - {matrix.gambio-server.net} Processed request: 0.046sec/0.000sec (0.006sec, 0.001sec) (0.003sec/0.014sec/3) 12B 200 "PUT /_matrix/federation/v1/send/1587630318385 HTTP/1.1" "Synapse/1.12.3" [0 dbevts]
> 2020-04-29 07:33:02,260 - synapse.rest.media.v1.preview_url_resource - 483 - INFO - expire_url_cache_data-187666- Deleted 0 media from url cache
> 2020-04-29 07:33:03,620 - synapse.access.http.8008 - 302 - INFO - GET-870127- 10.0.11.10 - 8008 - {@user1:server1} Processed request: 1.317sec/0.005sec (0.009sec, 0.000sec) (0.000sec/0.000sec/0) 828B 200 "GET /_matrix/client/r0/sync?filter=0&timeout=30000&since=s340844_1120146_41284_99343_162035_64_247_4474_57 HTTP/1.1" "Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/537.36 (KHTML, like Gecko) snap Chromium/81.0.4044.122 Chrome/81.0.4044.122 Safari/537.36" [0 dbevts]

Server2 log (Did receive all, but no messages are flowing back)

> 2020-04-29 07:32:48,958 - synapse.access.http.8008 - 302 - INFO - GET-39300 - 31.17.249.107 - 8008 - {@user2:server2} Processed request: 10.123sec/0.003sec (0.012sec, 0.000sec) (0.000sec/0.000sec/0) 711B 200 "GET /_matrix/client/r0/sync?filter=0&timeout=30000&since=s10162_290056_721_11423_79_5_3_7771_10 HTTP/1.1" "Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/80.0.3987.141 Safari/537.36" [0 dbevts]
> 2020-04-29 07:32:48,959 - synapse.access.http.8008 - 302 - INFO - PUT-39302 - 31.17.249.107 - 8008 - {@user2:server2} Processed request: 0.018sec/0.002sec (0.011sec, 0.000sec) (0.000sec/0.000sec/0) 2B 200 "PUT /_matrix/client/r0/rooms/!nmOxkCqCxiHFcJqVrX:server1/typing/@user2:server2 HTTP/1.1" "Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/80.0.3987.141 Safari/537.36" [0 dbevts]
> 2020-04-29 07:32:49,003 - synapse.access.http.8008 - 302 - INFO - OPTIONS-39303 - 31.17.249.107 - 8008 - {None} Processed request: 0.001sec/0.001sec (0.000sec, 0.000sec) (0.000sec/0.000sec/0) 2B 200 "OPTIONS /_matrix/client/r0/sync?filter=0&timeout=30000&since=s10162_290056_722_11423_79_5_3_7771_10 HTTP/1.1" "Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/80.0.3987.141 Safari/537.36" [0 dbevts]
> 2020-04-29 07:32:58,774 - synapse.federation.transport.server - 406 - INFO - PUT-39305 - Received txn 1586268953496 from server1. (PDUs: 0, EDUs: 1)
> 2020-04-29 07:32:58,778 - synapse.handlers.typing - 237 - INFO - PUT-39305 - Got typing update from @user1:server1: {'room_id': '!nmOxkCqCxiHFcJqVrX:server1', 'typing': True, 'user_id': '@user1:server1'}
> 2020-04-29 07:32:58,785 - synapse.access.http.8008 - 302 - INFO - GET-39304 - 31.17.249.107 - 8008 - {@user2:server2} Processed request: 9.746sec/0.003sec (0.008sec, 0.000sec) (0.000sec/0.000sec/0) 737B 200 "GET /_matrix/client/r0/sync?filter=0&timeout=30000&since=s10162_290056_722_11423_79_5_3_7771_10 HTTP/1.1" "Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/80.0.3987.141 Safari/537.36" [0 dbevts]
> 2020-04-29 07:32:58,787 - synapse.access.http.8008 - 302 - INFO - PUT-39305 - 194.95.62.117 - 8008 - {server1} Processed request: 0.020sec/0.000sec (0.010sec, 0.000sec) (0.004sec/0.004sec/3) 12B 200 "PUT /_matrix/federation/v1/send/1586268953496 HTTP/1.0" "Synapse/1.12.3" [0 dbevts]
> 2020-04-29 07:32:58,829 - synapse.access.http.8008 - 302 - INFO - OPTIONS-39306 - 31.17.249.107 - 8008 - {None} Processed request: 0.000sec/0.000sec (0.000sec, 0.000sec) (0.000sec/0.000sec/0) 2B 200 "OPTIONS /_matrix/client/r0/sync?filter=0&timeout=30000&since=s10162_290056_723_11423_79_5_3_7771_10 HTTP/1.1" "Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/80.0.3987.141 Safari/537.36" [0 dbevts]
> 2020-04-29 07:32:59,984 - synapse.federation.transport.server - 406 - INFO - PUT-39308 - Received txn 1586268953497 from server1. (PDUs: 0, EDUs: 1)
> 2020-04-29 07:32:59,993 - synapse.handlers.typing - 237 - INFO - PUT-39308 - Got typing update from @user1:server1: {'room_id': '!nmOxkCqCxiHFcJqVrX:server1', 'typing': False, 'user_id': '@user1:server1'}
> 2020-04-29 07:33:00,005 - synapse.access.http.8008 - 302 - INFO - GET-39307 - 31.17.249.107 - 8008 - {@user2:server2} Processed request: 1.144sec/0.006sec (0.007sec, 0.000sec) (0.000sec/0.000sec/0) 711B 200 "GET /_matrix/client/r0/sync?filter=0&timeout=30000&since=s10162_290056_723_11423_79_5_3_7771_10 HTTP/1.1" "Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/80.0.3987.141 Safari/537.36" [0 dbevts]
> 2020-04-29 07:33:00,008 - synapse.access.http.8008 - 302 - INFO - PUT-39308 - 194.95.62.117 - 8008 - {server1} Processed request: 0.027sec/0.000sec (0.013sec, 0.000sec) (0.002sec/0.006sec/2) 12B 200 "PUT /_matrix/federation/v1/send/1586268953497 HTTP/1.0" "Synapse/1.12.3" [0 dbevts]
> 2020-04-29 07:33:00,051 - synapse.access.http.8008 - 302 - INFO - OPTIONS-39309 - 31.17.249.107 - 8008 - {None} Processed request: 0.001sec/0.001sec (0.000sec, 0.000sec) (0.000sec/0.000sec/0) 2B 200 "OPTIONS /_matrix/client/r0/sync?filter=0&timeout=30000&since=s10162_290056_724_11423_79_5_3_7771_10 HTTP/1.1" "Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/80.0.3987.141 Safari/537.36" [0 dbevts]
> 2020-04-29 07:33:00,137 - synapse.federation.transport.server - 406 - INFO - PUT-39311 - Received txn 1586268953498 from server1. (PDUs: 1, EDUs: 0)
> 2020-04-29 07:33:00,147 - synapse.handlers.federation - 194 - INFO - PUT-39311-$ZfnYHy5QusO2J6fFhyVqJhLmBpMpURpJ5URTETEMqw0 - handling received PDU: <FrozenEventV3 event_id='$ZfnYHy5QusO2J6fFhyVqJhLmBpMpURpJ5URTETEMqw0', type='m.room.message', state_key='None'>
> 2020-04-29 07:33:00,232 - synapse.federation.sender - 164 - DEBUG - process_event_queue_for_federation-315 - Handling 10162 -> 10163
> 2020-04-29 07:33:00,240 - synapse.access.http.8008 - 302 - INFO - PUT-39311 - 194.95.62.117 - 8008 - {server1} Processed request: 0.097sec/0.008sec (0.019sec, 0.001sec) (0.006sec/0.013sec/6) 62B 200 "PUT /_matrix/federation/v1/send/1586268953498 HTTP/1.0" "Synapse/1.12.3" [1 dbevts]
> 2020-04-29 07:33:00,244 - synapse.push.httppusher - 195 - INFO - httppush.process-46 - Processing 1 unprocessed push actions for @user2:server2/im.vector.app.android/cu04JX1a3so:APA91bGko9o762To6cI6Sg31_S8kmo2qXbE54uIG3Cq4GXIMXfzx8TIeSCpHka6T6_8fIcgM6qb_c07pe9EpWrPqwtErBEBe1ok9z3QEEiJ4uDmZX2q9uo4FA6lRzOhB9rbg15t3OIpj starting at stream_ordering 10160
> 2020-04-29 07:33:00,246 - synapse.http.client - 281 - INFO - httppush.process-46 - Sending request POST https://matrix.org/_matrix/push/v1/notify
> 2020-04-29 07:33:00,251 - synapse.access.http.8008 - 302 - INFO - GET-39310 - 31.17.249.107 - 8008 - {@user2:server2} Processed request: 0.161sec/0.007sec (0.011sec, 0.000sec) (0.006sec/0.013sec/2) 897B 200 "GET /_matrix/client/r0/sync?filter=0&timeout=30000&since=s10162_290056_724_11423_79_5_3_7771_10 HTTP/1.1" "Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/80.0.3987.141 Safari/537.36" [0 dbevts]
> 2020-04-29 07:33:00,255 - synapse.federation.sender - 164 - DEBUG - process_event_queue_for_federation-315 - Handling 10163 -> 10163
> 2020-04-29 07:33:00,324 - synapse.access.http.8008 - 302 - INFO - OPTIONS-39312 - 31.17.249.107 - 8008 - {None} Processed request: 0.000sec/0.001sec (0.000sec, 0.000sec) (0.000sec/0.000sec/0) 2B 200 "OPTIONS /_matrix/client/r0/sync?filter=0&timeout=30000&since=s10163_290056_724_11423_79_5_3_7771_10 HTTP/1.1" "Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/80.0.3987.141 Safari/537.36" [0 dbevts]
> 2020-04-29 07:33:00,397 - synapse.http.client - 318 - INFO - httppush.process-46 - Received response to POST https://matrix.org/_matrix/push/v1/notify: 200
> 2020-04-29 07:33:00,767 - synapse.access.http.8008 - 302 - INFO - OPTIONS-39314 - 31.17.249.107 - 8008 - {None} Processed request: 0.000sec/0.000sec (0.000sec, 0.000sec) (0.000sec/0.000sec/0) 2B 200 "OPTIONS /_matrix/client/r0/rooms/!nmOxkCqCxiHFcJqVrX:server1/send/m.room.message/m1588145580713.2 HTTP/1.1" "Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/80.0.3987.141 Safari/537.36" [0 dbevts]
> 2020-04-29 07:33:00,816 - synapse.metrics - 464 - INFO -  - Collecting gc 1
> 2020-04-29 07:33:00,868 - synapse.access.http.8008 - 302 - INFO - PUT-39315 - 31.17.249.107 - 8008 - {@user2:server2} Processed request: 0.065sec/0.005sec (0.013sec, 0.000sec) (0.002sec/0.005sec/3) 60B 200 "PUT /_matrix/client/r0/rooms/!nmOxkCqCxiHFcJqVrX:server1/send/m.room.message/m1588145580713.2 HTTP/1.1" "Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/80.0.3987.141 Safari/537.36" [0 dbevts]
> 2020-04-29 07:33:00,884 - synapse.access.http.8008 - 302 - INFO - GET-39313 - 31.17.249.107 - 8008 - {@user2:server2} Processed request: 0.524sec/0.006sec (0.011sec, 0.001sec) (0.003sec/0.004sec/1) 1378B 200 "GET /_matrix/client/r0/sync?filter=0&timeout=30000&since=s10163_290056_724_11423_79_5_3_7771_10 HTTP/1.1" "Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/80.0.3987.141 Safari/537.36" [0 dbevts]
> 2020-04-29 07:33:00,887 - synapse.federation.sender - 164 - DEBUG - process_event_queue_for_federation-316 - Handling 10163 -> 10164
> 2020-04-29 07:33:00,889 - synapse.federation.sender - 204 - DEBUG - process_event_queue_for_federation-316 - Sending <FrozenEventV3 event_id='$FlSDMyWVO-iVXSLfi4aAok4AUHi58zpFv_xiWczhFvc', type='m.room.message', state_key='None'> to {'server2'}
> 2020-04-29 07:33:00,889 - synapse.federation.sender - 265 - DEBUG - process_event_queue_for_federation-316 - Sending to: set()
> 2020-04-29 07:33:00,899 - synapse.access.http.8008 - 302 - INFO - GET-39316 - 31.17.249.107 - 8008 - {@user2:server2} Processed request: 0.030sec/0.000sec (0.010sec, 0.001sec) (0.003sec/0.005sec/4) 7458B 200 "GET /_matrix/client/r0/sync?filter=0&set_presence=offline&timeout=0&since=s10160_290014_716_11423_73_5_3_7771_10 HTTP/1.1" "Riot.im/0.9.10 (Linux; U; Android 8.0.0; XT1635-02 Build/OPNS27.76-12-22-; Flavour GooglePlay; MatrixAndroidSDK 0.9.33)" [0 dbevts]
> 2020-04-29 07:33:00,904 - synapse.federation.sender - 164 - DEBUG - process_event_queue_for_federation-316 - Handling 10164 -> 10164
> 2020-04-29 07:33:00,952 - synapse.access.http.8008 - 302 - INFO - OPTIONS-39317 - 31.17.249.107 - 8008 - {None} Processed request: 0.000sec/0.002sec (0.000sec, 0.000sec) (0.000sec/0.000sec/0) 2B 200 "OPTIONS /_matrix/client/r0/keys/query HTTP/1.1" "Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/80.0.3987.141 Safari/537.36" [0 dbevts]
> 2020-04-29 07:33:00,954 - synapse.access.http.8008 - 302 - INFO - OPTIONS-39318 - 31.17.249.107 - 8008 - {None} Processed request: 0.000sec/0.003sec (0.000sec, 0.000sec) (0.000sec/0.000sec/0) 2B 200 "OPTIONS /_matrix/client/r0/sync?filter=0&timeout=30000&since=s10164_290056_724_11423_79_5_3_7771_10 HTTP/1.1" "Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/80.0.3987.141 Safari/537.36" [0 dbevts]
> 2020-04-29 07:33:00,996 - synapse.access.http.8008 - 302 - INFO - POST-39319 - 31.17.249.107 - 8008 - {@user2:server2} Processed request: 0.015sec/0.001sec (0.005sec, 0.000sec) (0.002sec/0.003sec/2) 3344B 200 "POST /_matrix/client/r0/keys/query HTTP/1.1" "Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/80.0.3987.141 Safari/537.36" [0 dbevts]
> 2020-04-29 07:33:01,251 - synapse.access.http.8008 - 302 - INFO - OPTIONS-39321 - 31.17.249.107 - 8008 - {None} Processed request: 0.001sec/0.001sec (0.000sec, 0.000sec) (0.000sec/0.000sec/0) 2B 200 "OPTIONS /_matrix/client/r0/rooms/!nmOxkCqCxiHFcJqVrX:server1/read_markers HTTP/1.1" "Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/80.0.3987.141 Safari/537.36" [0 dbevts]
> 2020-04-29 07:33:01,336 - synapse.http.client - 281 - INFO - http_pusher.on_new_receipts-59 - Sending request POST https://matrix.org/_matrix/push/v1/notify
> 2020-04-29 07:33:01,338 - synapse.access.http.8008 - 302 - INFO - GET-39320 - 31.17.249.107 - 8008 - {@user2:server2} Processed request: 0.344sec/0.003sec (0.006sec, 0.000sec) (0.004sec/0.009sec/3) 808B 200 "GET /_matrix/client/r0/sync?filter=0&timeout=30000&since=s10164_290056_724_11423_79_5_3_7771_10 HTTP/1.1" "Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/80.0.3987.141 Safari/537.36" [0 dbevts]
> 2020-04-29 07:33:01,341 - synapse.access.http.8008 - 302 - INFO - POST-39322 - 31.17.249.107 - 8008 - {@user2:server2} Processed request: 0.050sec/0.000sec (0.011sec, 0.001sec) (0.006sec/0.022sec/6) 2B 200 "POST /_matrix/client/r0/rooms/!nmOxkCqCxiHFcJqVrX:server1/read_markers HTTP/1.1" "Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/80.0.3987.141 Safari/537.36" [0 dbevts]
> 2020-04-29 07:33:01,393 - synapse.access.http.8008 - 302 - INFO - OPTIONS-39323 - 31.17.249.107 - 8008 - {None} Processed request: 0.001sec/0.001sec (0.000sec, 0.000sec) (0.000sec/0.000sec/0) 2B 200 "OPTIONS /_matrix/client/r0/sync?filter=0&timeout=30000&since=s10164_290056_724_11424_79_5_3_7771_10 HTTP/1.1" "Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/80.0.3987.141 Safari/537.36" [0 dbevts]
> 2020-04-29 07:33:01,453 - synapse.access.http.8008 - 302 - INFO - GET-39324 - 31.17.249.107 - 8008 - {@user2:server2} Processed request: 0.019sec/0.000sec (0.014sec, 0.001sec) (0.001sec/0.002sec/2) 2565B 200 "GET /_matrix/client/r0/sync?filter=0&timeout=30000&since=s10164_290056_724_11424_79_5_3_7771_10 HTTP/1.1" "Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/80.0.3987.141 Safari/537.36" [0 dbevts]
> 2020-04-29 07:33:01,499 - synapse.access.http.8008 - 302 - INFO - OPTIONS-39325 - 31.17.249.107 - 8008 - {None} Processed request: 0.001sec/0.001sec (0.001sec, 0.000sec) (0.000sec/0.000sec/0) 2B 200 "OPTIONS /_matrix/client/r0/sync?filter=0&timeout=30000&since=s10164_290056_724_11424_80_5_3_7771_10 HTTP/1.1" "Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/80.0.3987.141 Safari/537.36" [0 dbevts]
> 2020-04-29 07:33:02,005 - synapse.http.client - 318 - INFO - http_pusher.on_new_receipts-59 - Received response to POST https://matrix.org/_matrix/push/v1/notify: 200
> 2020-04-29 07:33:02,098 - synapse.access.http.8008 - 302 - INFO - GET-39327 - 31.17.249.107 - 8008 - {@user2:server2} Processed request: 0.013sec/0.000sec (0.007sec, 0.000sec) (0.001sec/0.002sec/2) 2711B 200 "GET /_matrix/client/r0/sync?filter=0&set_presence=offline&timeout=0&since=s10164_290056_724_11423_79_5_3_7771_10 HTTP/1.1" "Riot.im/0.9.10 (Linux; U; Android 8.0.0; XT1635-02 Build/OPNS27.76-12-22-; Flavour GooglePlay; MatrixAndroidSDK 0.9.33)" [0 dbevts]
> 2020-04-29 07:33:02,415 - synapse.access.http.8008 - 302 - INFO - POST-39329 - 31.17.249.107 - 8008 - {@user2:server2} Processed request: 0.005sec/0.000sec (0.004sec, 0.000sec) (0.001sec/0.002sec/2) 3344B 200 "POST /_matrix/client/unstable/keys/query HTTP/1.1" "Riot.im/0.9.10 (Linux; U; Android 8.0.0; XT1635-02 Build/OPNS27.76-12-22-; Flavour GooglePlay; MatrixAndroidSDK 0.9.33)" [0 dbevts]
> 2020-04-29 07:33:15,795 - synapse.handlers.presence - 256 - INFO - persist_presence_changes-12502 - Persisting 1 unpersisted presence updates
richvdh commented 4 years ago
> 2020-04-29 07:33:00,887 - synapse.federation.sender - 164 - DEBUG - process_event_queue_for_federation-316 - Handling 10163 -> 10164
> 2020-04-29 07:33:00,889 - synapse.federation.sender - 204 - DEBUG - process_event_queue_for_federation-316 - Sending <FrozenEventV3 event_id='$FlSDMyWVO-iVXSLfi4aAok4AUHi58zpFv_xiWczhFvc', type='m.room.message', state_key='None'> to {'server2'}
> 2020-04-29 07:33:00,889 - synapse.federation.sender - 265 - DEBUG - process_event_queue_for_federation-316 - Sending to: set()

For some reason, server2 has decided that server1 is no longer in the room :/.

Figuring out why is going to take some deep digging. Is the problem limited to a small number of rooms? Perhaps just creating new rooms is a viable workaround.

Otherwise, set up a manhole and prepare for the long haul...

ErrorProne commented 4 years ago

Figuring out why is going to take some deep digging. Is the problem limited to a small number of rooms? Perhaps just creating new rooms is a viable workaround.

The pattern we've seen so far is that only direct chats are impacted by this.

Workarounds right now are restarting server2 (or server3, there is another one behaving exactly the same but with >50 people on it, so the logs are quite exhausting) or re-creating the room.

Given that there is another server acting the same way, could the problem be with server1? I'm just not sure why restarting server2/3 than fixes the problem for the restarted server.

If we can provide any insights with more debug output etc. we are happy to help =), sadly it always takes a while until a room brakes again.

Side note which comes into my mind right now: Some direct chats with server2 and server3 did break in the same timeframe. However, most of the direct chats remained working.

richvdh commented 4 years ago

do you have a manhole configured on each of the affected servers? if not, set one up, so that we can do some digging next time it happens. Docs are in the docs directory.

ErrorProne commented 4 years ago

We'll set one up at server1 and server2. Could you provide us with the commands and information we should get through the manhole as soon as a room breaks again?

richvdh commented 4 years ago

suggest once it happens you dm me at @richvdh:sw1v.org and I'll work through it with you

ErrorProne commented 4 years ago

Sounds like a plan. We'll contact you. Thank you for all the effort =)

richvdh commented 4 years ago

After investigation: it appears that this is due to custom status events corrupting the get_joined_users cache in synapse. The original bug was introduced by #2229, but has gone unnoticed until custom statuses were used.

richvdh commented 4 years ago

(many thanks to @ErrorProne and team for their patience in tracking this one down)

richvdh commented 4 years ago

I'm labelling this as a release-blocker because it's becoming more of an issue now that custom statuses are a thing, and the fix is trivial (#7376)

ErrorProne commented 4 years ago

I'm labelling this as a release-blocker because it's becoming more of an issue now that custom statuses are a thing, and the fix is trivial (#7376)

Great to hear, thank you again =)

TitanSpirit commented 3 years ago

I had exactly the same problem, however I fixed it via changing my config on my apache reverse proxy. This helped me solving direct message problem. https://github.com/matrix-org/synapse/blob/develop/docs/reverse_proxy.md Add first line with AllowEncodedSlashes, add "nocanon" on the end of ProxyPass and add SecRuleEngine off. It fixed problem for me in my case.


    <IfModule security2_module>
           SecRuleEngine off
    </IfModule>

    AllowEncodedSlashes NoDecode
    ProxyPass /_matrix http://127.0.0.1:8008/_matrix nocanon
    ProxyPassReverse /_matrix http://127.0.0.1:8008/_matrix