matrix-org / synapse

Synapse: Matrix homeserver written in Python/Twisted.
https://matrix-org.github.io/synapse
Apache License 2.0
11.82k stars 2.13k forks source link

`/sync` randomly tightloops #15824

Open turt2live opened 1 year ago

turt2live commented 1 year ago

Description

In my test environment for matrix-media-repo I've got a localhost homeserver running on sqlite in docker which suddenly returns immediately to /sync requests. This causes Element to dutifully send the next request, to which Synapse responds immediately, and repeat.

Steps to reproduce

Unclear.

Homeserver

localhost

Synapse Version

v1.85.2

Installation Method

Docker (matrixdotorg/synapse)

Database

sqlite3

Workers

Single process

Platform

Running in Docker.

Configuration

# Configuration file for Synapse.
#
# This is a YAML file: see [1] for a quick introduction. Note in particular
# that *indentation is important*: all the elements of a list or dictionary
# should have the same indentation.
#
# [1] https://docs.ansible.com/ansible/latest/reference_appendices/YAMLSyntax.html
#
# For more information on how to configure Synapse, including a complete accounting of
# each option, go to docs/usage/configuration/config_documentation.md or
# https://matrix-org.github.io/synapse/latest/usage/configuration/config_documentation.html
server_name: "localhost"
pid_file: /data/homeserver.pid
listeners:
  - port: 8008
    tls: false
    type: http
    x_forwarded: true
    resources:
      - names: [client, federation]
        compress: false
database:
  name: sqlite3
  args:
    database: /data/homeserver.db
log_config: "/data/localhost.log.config"
media_store_path: /data/media_store
registration_shared_secret: "redacted, but it's public anyways"
report_stats: false
macaroon_secret_key: "redacted, but it's public anyways"
form_secret: "redacted, but it's public anyways"
signing_key_path: "/data/localhost.signing.key"

# Project-specific settings (not part of Synapse defaults)
enable_media_repo: false
enable_registration: true
enable_registration_without_verification: true
trusted_key_servers: []

# vim:ft=yaml

Relevant log output

media_repo_nginx    | 172.19.0.1 - - [24/Jun/2023:03:24:45 +0000] "GET /_matrix/client/r0/sync?filter=2&timeout=30000&since=s122_30870_0_15_114_1_4_30_0_1 HTTP/1.1" 200 295 "-" "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/114.0.0.0 Safari/537.36" "-"
media_repo_synapse  | 2023-06-24 03:24:45,058 - synapse.access.http.8008 - 461 - INFO - GET-13264 - ::ffff:172.19.0.2 - 8008 - {@travis:localhost} Processe
d request: 0.005sec/0.000sec (0.001sec, 0.000sec) (0.001sec/0.003sec/4) 283B 200 "GET /_matrix/client/r0/sync?filter=2&timeout=30000&since=s122_30870_0_15_114_1_4_30_0_1 HTTP/1.0" "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/114.0.0.0 Safari/537.36" [0 dbevts]      
media_repo_nginx    | 172.19.0.1 - - [24/Jun/2023:03:24:45 +0000] "GET /_matrix/client/r0/sync?filter=2&timeout=30000&since=s122_30870_0_15_114_1_4_30_0_1 HTTP/1.1" 200 295 "-" "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/114.0.0.0 Safari/537.36" "-"
media_repo_synapse  | 2023-06-24 03:24:45,093 - synapse.access.http.8008 - 461 - INFO - GET-13265 - ::ffff:172.19.0.2 - 8008 - {@travis:localhost} Processe
d request: 0.005sec/0.000sec (0.001sec, 0.000sec) (0.001sec/0.003sec/4) 283B 200 "GET /_matrix/client/r0/sync?filter=2&timeout=30000&since=s122_30870_0_15_114_1_4_30_0_1 HTTP/1.0" "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/114.0.0.0 Safari/537.36" [0 dbevts]

Anything else that would be useful to know?

Restarting the Docker Compose stack can sometimes help, but it eventually tightloops again.

turt2live commented 1 year ago

the logs at the beginning of the loop don't reveal much:

media_repo_synapse  | 2023-06-24 03:27:15,741 - synapse.access.http.8008 - 461 - INFO - GET-25 - ::ffff:172.19.0.4 - 8008 - {@travis:localhost} Processed r
equest: 30.007sec/0.000sec (0.000sec, 0.001sec) (0.001sec/0.003sec/4) 438B 200 "GET /_matrix/client/r0/sync?filter=2&timeout=30000&since=s122_30871_0_14_109_1_4_30_0_1 HTTP/1.0" "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/114.0.0.0 Safari/537.36" [0 dbevts]        
media_repo_nginx    | 172.19.0.1 - - [24/Jun/2023:03:27:15 +0000] "GET /_matrix/client/r0/sync?filter=2&timeout=30000&since=s122_30871_0_14_109_1_4_30_0_1 HTTP/1.1" 200 450 "-" "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/114.0.0.0 Safari/537.36" "-"
media_repo_synapse  | 2023-06-24 03:27:15,781 - synapse.access.http.8008 - 461 - INFO - POST-29 - ::ffff:172.19.0.4 - 8008 - {@travis:localhost} Processed 
request: 0.006sec/0.000sec (0.000sec, 0.001sec) (0.002sec/0.003sec/3) 2687B 200 "POST /_matrix/client/r0/keys/query HTTP/1.0" "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/114.0.0.0 Safari/537.36" [0 dbevts]
media_repo_nginx    | 172.19.0.1 - - [24/Jun/2023:03:27:15 +0000] "POST /_matrix/client/r0/keys/query HTTP/1.1" 200 2699 "-" "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/114.0.0.0 Safari/537.36" "-"
media_repo_synapse  | 2023-06-24 03:27:15,784 - synapse.access.http.8008 - 461 - INFO - GET-30 - ::ffff:172.19.0.4 - 8008 - {@travis:localhost} Processed r
equest: 0.008sec/0.000sec (0.000sec, 0.002sec) (0.002sec/0.003sec/4) 283B 200 "GET /_matrix/client/r0/sync?filter=2&timeout=30000&since=s122_30872_0_14_109_1_4_30_0_1 HTTP/1.0" "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/114.0.0.0 Safari/537.36" [0 dbevts]
media_repo_nginx    | 172.19.0.1 - - [24/Jun/2023:03:27:15 +0000] "GET /_matrix/client/r0/sync?filter=2&timeout=30000&since=s122_30872_0_14_109_1_4_30_0_1 HTTP/1.1" 200 295 "-" "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/114.0.0.0 Safari/537.36" "-"
media_repo_synapse  | 2023-06-24 03:27:15,829 - synapse.access.http.8008 - 461 - INFO - GET-31 - ::ffff:172.19.0.4 - 8008 - {@travis:localhost} Processed r
equest: 0.005sec/0.000sec (0.000sec, 0.001sec) (0.001sec/0.003sec/4) 283B 200 "GET /_matrix/client/r0/sync?filter=2&timeout=30000&since=s122_30872_0_14_109_1_4_30_0_1 HTTP/1.0" "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/114.0.0.0 Safari/537.36" [0 dbevts]
media_repo_nginx    | 172.19.0.1 - - [24/Jun/2023:03:27:15 +0000] "GET /_matrix/client/r0/sync?filter=2&timeout=30000&since=s122_30872_0_14_109_1_4_30_0_1 HTTP/1.1" 200 295 "-" "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/114.0.0.0 Safari/537.36" "-"
media_repo_synapse  | 2023-06-24 03:27:15,864 - synapse.access.http.8008 - 461 - INFO - GET-32 - ::ffff:172.19.0.4 - 8008 - {@travis:localhost} Processed r
equest: 0.005sec/0.000sec (0.000sec, 0.001sec) (0.001sec/0.003sec/4) 283B 200 "GET /_matrix/client/r0/sync?filter=2&timeout=30000&since=s122_30872_0_14_109_1_4_30_0_1 HTTP/1.0" "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/114.0.0.0 Safari/537.36" [0 dbevts]
media_repo_nginx    | 172.19.0.1 - - [24/Jun/2023:03:27:15 +0000] "GET /_matrix/client/r0/sync?filter=2&timeout=30000&since=s122_30872_0_14_109_1_4_30_0_1 HTTP/1.1" 200 295 "-" "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/114.0.0.0 Safari/537.36" "-"
media_repo_synapse  | 2023-06-24 03:27:15,899 - synapse.access.http.8008 - 461 - INFO - GET-33 - ::ffff:172.19.0.4 - 8008 - {@travis:localhost} Processed r
equest: 0.005sec/0.000sec (0.000sec, 0.001sec) (0.001sec/0.003sec/4) 283B 200 "GET /_matrix/client/r0/sync?filter=2&timeout=30000&since=s122_30872_0_14_109_1_4_30_0_1 HTTP/1.0" "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/114.0.0.0 Safari/537.36" [0 dbevts]
media_repo_nginx    | 172.19.0.1 - - [24/Jun/2023:03:27:15 +0000] "GET /_matrix/client/r0/sync?filter=2&timeout=30000&since=s122_30872_0_14_109_1_4_30_0_1 HTTP/1.1" 200 295 "-" "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/114.0.0.0 Safari/537.36" "-"
ThomasHalwax commented 1 year ago

I think that's related to #15082.

thebalaa commented 1 year ago

Also seeing this in a local development environment with Synapse 1.85.2, doesn't appear to be related to #15082 because payloads are not empty.

Element web's calls to /sync are returning immediately with the following responses:

3 examples, notice they are all exactly the same:

http://localhost:8008/_matrix/client/r0/sync?filter=0&timeout=30000&since=s1_447_0_1_1_1_1_36_0_1

{
    "next_batch": "s1_450_0_1_1_1_1_36_0_1",
    "account_data": {
        "events": [
            {
                "type": "org.matrix.msc3890.local_notification_settings.ZSBKLRYBUA",
                "content": {
                    "is_silenced": false
                }
            }
        ]
    },
    "device_one_time_keys_count": {
        "signed_curve25519": 50
    },
    "org.matrix.msc2732.device_unused_fallback_key_types": [
        "signed_curve25519"
    ],
    "device_unused_fallback_key_types": [
        "signed_curve25519"
    ]
}

http://localhost:8008/_matrix/client/r0/sync?filter=0&timeout=30000&since=s1_447_0_1_1_1_1_36_0_1

{
    "next_batch": "s1_450_0_1_1_1_1_36_0_1",
    "account_data": {
        "events": [
            {
                "type": "org.matrix.msc3890.local_notification_settings.ZSBKLRYBUA",
                "content": {
                    "is_silenced": false
                }
            }
        ]
    },
    "device_one_time_keys_count": {
        "signed_curve25519": 50
    },
    "org.matrix.msc2732.device_unused_fallback_key_types": [
        "signed_curve25519"
    ],
    "device_unused_fallback_key_types": [
        "signed_curve25519"
    ]
}

http://localhost:8008/_matrix/client/r0/sync?filter=0&timeout=30000&since=s1_447_0_1_1_1_1_36_0_1

{
    "next_batch": "s1_450_0_1_1_1_1_36_0_1",
    "account_data": {
        "events": [
            {
                "type": "org.matrix.msc3890.local_notification_settings.ZSBKLRYBUA",
                "content": {
                    "is_silenced": false
                }
            }
        ]
    },
    "device_one_time_keys_count": {
        "signed_curve25519": 50
    },
    "org.matrix.msc2732.device_unused_fallback_key_types": [
        "signed_curve25519"
    ],
    "device_unused_fallback_key_types": [
        "signed_curve25519"
    ]
}
DMRobertson commented 1 year ago

@thebalaa

3 examples, notice they are all exactly the same:

The request URLs you have given us are also exactly the same. Synapse caches /sync responses so that it can avoid the work of recalculating them if the requests are identical due to e.g. a retransmission; it is not surprising that you get the same response to the same request!

The since token in your examples is s1_447_0_1_1_1_1_36_0_1, and the response's next_batch token is s1_450_0_1_1_1_1_36_0_1 is different. But Element Web did not use the next_batch token as the since parameter in its next request; I don't see any evidence of a problem in Synapse here.

(I note that the logs in the description and in https://github.com/matrix-org/synapse/issues/15824#issuecomment-1605249509 also repeat a request with the same since token.)

I would guess that either

MidhunSureshR commented 1 year ago

I have a similar issue on Synapse 1.88.0rc1 running locally on docker where sync (via Hydrogen) is tight looped: image

since of the sync request matches the next_batch of the returned response so I don't think it's a client issue.

homeserver.log har.zip

davidegirardi commented 1 year ago

I noticed this yesterday and today. Turning off presence seems to stop the continuous sync. Just a workaround but I hope it helps finding the root cause.

kegsay commented 1 year ago

Same problem on docker.io/matrixdotorg/synapse:v1.92.1 with sqlite. Can reliably get it to fail when I trigger an exception in another part of the code.

kegsay commented 1 year ago

Repro steps involve using docker/sqlite synapse with MAS enabled and hooked up correctly. For some reason, sending messages into a room can fail:

ex-full-stack-docker-synapse-1             | 2023-09-20 13:48:18,624 - synapse.http.server - 140 - ERROR - PUT-62 - Failed handle request via 'RoomSendEventRestServlet': <XForwardedForRequest at 0xffffaab8c5d0 method='PUT' uri='/_matrix/client/v3/rooms/%21IitGfgiPurWkJFrNOJ:synapse/send/m.room.message/1' clientproto='HTTP/1.1' site='8008'>
ex-full-stack-docker-synapse-1             | Traceback (most recent call last):
ex-full-stack-docker-synapse-1             |   File "/usr/local/lib/python3.11/site-packages/twisted/internet/defer.py", line 1697, in _inlineCallbacks
ex-full-stack-docker-synapse-1             |     result = context.run(gen.send, result)
ex-full-stack-docker-synapse-1             |              ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
ex-full-stack-docker-synapse-1             |   File "/usr/local/lib/python3.11/site-packages/synapse/rest/client/room.py", line 366, in _do
ex-full-stack-docker-synapse-1             |     ) = await self.event_creation_handler.create_and_send_nonmember_event(
ex-full-stack-docker-synapse-1             |         ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
...
ex-full-stack-docker-synapse-1             |   File "/usr/local/lib/python3.11/site-packages/synapse/storage/database.py", line 484, in _do_execute
ex-full-stack-docker-synapse-1             |     return func(sql, *args, **kwargs)
ex-full-stack-docker-synapse-1             |            ^^^^^^^^^^^^^^^^^^^^^^^^^^
ex-full-stack-docker-synapse-1             | sqlite3.IntegrityError: FOREIGN KEY constraint failed
ex-full-stack-docker-synapse-1             | 2023-09-20 13:48:18,628 - synapse.access.http.8008 - 465 - INFO - PUT-62 - ::ffff:172.21.0.1 - 8008 - {@bob:synapse} Processed request: 0.053sec/0.000sec (0.009sec, 0.000sec) (0.002sec/0.020sec/15) 55B 500 "PUT /_matrix/client/v3/rooms/%21IitGfgiPurWkJFrNOJ:synapse/send/m.room.message/1 HTTP/1.1" "curl/8.1.2" [0 dbevts]

When this happens, the next /sync request will always tightloop. The problem is this line is always being hit

https://github.com/matrix-org/synapse/blob/v1.92.1/synapse/handlers/sync.py#L1993

... because this is always being hit:

        # _cache is not valid at or before the earliest known stream position, so
        # return that the entity has changed.
        if stream_pos <= self._earliest_known_stream_pos:
            self.metrics.inc_misses()
            return True

.. because stream_pos == self._earliest_known_stream_pos. I'm guessing the exception is causing some Fun which isn't propagating the increased stream position or something?

turt2live commented 1 year ago

fwiw, I see this without sending messages. MAS is not configured. sqlite is used, though.

It will be sitting completely idle for several minutes and then suddenly tightloop. No user interaction required.

thebalaa commented 1 year ago

@DMRobertson the sync token is not advanced by Synapse, there's a mistake in my example above. The query string returned by synapse is provided by element but Synapse returns the same since token that was passed in the query string.

@kegsay we tried disabling presence and that seems to resolve the issue for a moment but then it returned

davidegirardi commented 1 year ago

It will be sitting completely idle for several minutes and then suddenly tightloop. No user interaction required.

I had the same experience yesterday but I can't reproduce anymore. 1 container with synapse with a sqlite backend and the default autogenerated config plus URL preview enabled 1 container with Element Web 1 low-privileged user in a room with a few links

It suddenly started to loop extremely fast, even if I restarted the containers or refreshed the browser page, until I deactivated presence.

erikjohnston commented 1 year ago

Hopefully https://github.com/matrix-org/synapse/pull/16540 fixes this

erikjohnston commented 1 year ago

Sounds like it hasn't https://github.com/matrix-org/synapse/pull/16540#issuecomment-1775949595

thebalaa commented 1 year ago

@erikjohnston should #16540 be reverted?

clokep commented 1 year ago

should #16540 be reverted?

No, that fix is still correct, but not enough to fix this it seems.

thebalaa commented 1 year ago

Anything we can do to help move this along? It's quite problematic for us.

erikjohnston commented 1 year ago

@thebalaa if you have an example request URL and response body of a /sync that would be useful

justin-russell commented 1 year ago

Here's two separate client responses from sync:

From Element:

Request URL http://localhost:8008/_matrix/client/r0/sync?filter=0&timeout=30000&since=s385_307_0_11_70_1_1_19_0_1

Response

{
    "next_batch": "s385_307_0_11_70_1_1_19_0_1",
    "device_lists": {
        "changed": [
            "@admin:localhost"
        ]
    },
    "device_one_time_keys_count": {
        "signed_curve25519": 50
    },
    "org.matrix.msc2732.device_unused_fallback_key_types": [
        "signed_curve25519"
    ],
    "device_unused_fallback_key_types": [
        "signed_curve25519"
    ]
}

From a Python client (matrix-nio): Request URL http://localhost:8008/_matrix/client/r0/sync?access_token=syt_YWRtaW4_KWzSqMuAsUEaRUsAQqcJ_1wY9BF&since=s385_312_0_11_70_1_1_19_0_1

Response

{
    "next_batch": "s385_312_0_11_70_1_1_19_0_1",
    "device_lists": {
        "changed": [
            "@admin:localhost"
        ]
    },
    "device_one_time_keys_count": {
        "signed_curve25519": 50
    },
    "org.matrix.msc2732.device_unused_fallback_key_types": [
        "signed_curve25519"
    ],
    "device_unused_fallback_key_types": [
        "signed_curve25519"
    ]
}

I'm running Synapse version 1.96.0rc1 using Docker with the image matrixdotorg/synapse:develop. I'm also using an sqlite database.

Here's my homeserver.yaml. I generated it using the generate script in the container

# Configuration file for Synapse.
#
# This is a YAML file: see [1] for a quick introduction. Note in particular
# that *indentation is important*: all the elements of a list or dictionary
# should have the same indentation.
#
# [1] https://docs.ansible.com/ansible/latest/reference_appendices/YAMLSyntax.html
#
# For more information on how to configure Synapse, including a complete accounting of
# each option, go to docs/usage/configuration/config_documentation.md or
# https://matrix-org.github.io/synapse/latest/usage/configuration/config_documentation.html
server_name: "localhost"
pid_file: /data/homeserver.pid
listeners:
  - port: 8008
    tls: false
    type: http
    x_forwarded: true
    resources:
      - names: [client, federation]
        compress: false
database:
  name: sqlite3
  args:
    database: /data/homeserver.db
log_config: "/data/localhost.log.config"
media_store_path: /data/media_store
registration_shared_secret: "NKeqBJ3@=aEpq@m6&tyg,7Mp.PHsS=SKH#0gJWVH*Nhig0H9nt"
report_stats: false
macaroon_secret_key: "mRL:x_0p*VuKb2ZAio,^h&YCY77ISYs.OKr*+PBM:E0XJz:ZzV"
form_secret: "gr1E3TptS,_v-UfE6vN+l*dgc6:eSk;V3S6ZYm6B:D30O.St-h"
signing_key_path: "/data/localhost.signing.key"
trusted_key_servers:
  - server_name: "matrix.org"

# vim:ft=yaml

I'm able to reproduce this issue by doing the following:

  1. Starting with a fresh synapse container (no existing users. I've simply just generated the homeserver.yaml file and started the server), I created the @admin user using the register_new_matrix_user script inside the synapse container.
  2. I also disabled rate limiting for that user (for something else I've been messing around with): INSERT INTO ratelimit_override values ("@admin:localhost", 0, 0);. I found that command from here: https://github.com/matrix-org/synapse/issues/6286#issuecomment-646944920
  3. I then logged in as the admin account in Element and typed in a few messages in a room for a few seconds.
  4. I restart the synapse container
  5. Once synapse is back running, I refresh my logged in Element client. Usually the tight loop starts happening after the refresh. If it doesn't happen immediately, refresh a few times more. It usually happens pretty quick.

I was messing around a little more and found that when my client sync is doing the tight loop, if I exec into the synapse container and create another user with register_new_matrix_user, the tight loop immediately stops. Could it possibly be something to do with me disabling ratelimiting for that user? That or maybe something to do with only having a single user in the database? Not sure.

kegsay commented 11 months ago

This is particularly frustrating in test environments where it seems easily reproducible due to it being a low traffic environment. I'm guessing actual deployments see this less because new events are somehow able to unwedge things. In my scenario, the error on foreign key constraint will reliably reproduce this bug, but it's a convoluted setup as it involves MAS. I'll see if I can dig into this more this week.

qet119 commented 11 months ago

I used the account and password in the Mjolnir configuration file to log in to Element-web, which caused a large number of never-ending _matrix/client/v3/sync requests in the console. đŸ˜‚