matrix-org / synapse

Synapse: Matrix homeserver written in Python/Twisted.
https://matrix-org.github.io/synapse
Apache License 2.0
11.79k stars 2.13k forks source link

Appservice stream position got stuck ~10k behind current, preventing requests to appservices #13950

Open turt2live opened 1 year ago

turt2live commented 1 year ago

Description

A repeat of https://github.com/matrix-org/synapse/issues/1834 essentially

image

Steps to reproduce

Unclear - it suddenly became sad.

Homeserver

t2bot.io

Synapse Version

1.68.0 + custom patches

Installation Method

pip (from PyPI)

Platform

Ubuntu physical hardware.

Relevant log output

Available upon request.

Anything else that would be useful to know?

After manually fastforwarding the stream position and restarting the worker it appeared to be running about 3-5 minutes behind for longer than expected. This may have been due to a larger server restart causing caches to be evicted during peak hours, though.

turt2live commented 1 year ago

Suppose the stream position information itself would be useful:

synapse=# select max(stream_ordering) from events;
    max
-----------
 806649123
(1 row)

synapse=# select * from appservice_stream_position;
 lock | stream_ordering
------+-----------------
 X    |       806639937
(1 row)

synapse=# update appservice_stream_position set stream_ordering = (select max(stream_ordering) from events);
UPDATE 1
richvdh commented 1 year ago

duplicate of https://github.com/matrix-org/synapse/issues/11629 ?

DMRobertson commented 1 year ago

Is this correlated with an upgrade to 1.68.0?

custom patches

Are these publicly shareable?

turt2live commented 1 year ago

duplicate of #11629 ?

Aside from the title, possibly. This was all bridges on t2bot.io, which are not new.

Is this correlated with an upgrade to 1.68.0?

Negative. 1.68.0 was applied on Tuesday (2 days ago)

custom patches

Are these publicly shareable?

Yup: https://github.com/matrix-org/synapse/compare/develop...t2bot:synapse:t2bot.io (relevant patches might be around the appservice transaction optimization, but it was working "fine" up until the incident, and was working fine once fast-forwarded)

turt2live commented 1 year ago

ftr, ran into a variation of this today where the stream position was ~3600 behind, fluctuating towards worse badness. It eventually caught up on its own, however.

Not sure how the appservice sender is able to fall behind like this.