Open richvdh opened 3 years ago
this is currently complicated by the fact that the code does some batching of lookups. It's not obvious that the batching achieves much (at least on postgres) so we could maybe strip it out
Not sure if this could be related, but in EMS land we've noticed with the latest few synapse releases it seems that joining HQ sometimes puts a small host into an OOM loop, never recovering. Previously hosts took a much longer time joining HQ and OOMed for a while trying to do it, but eventually stabilized. Lately it feels like sometimes the host never recovers, until we give it more headroom in RAM.
The limit we have for these kind of small hosts is 1GB before the cluster kills the host to protect server stability. This issue we've seen lately is resolved by raising that limit to 1.5GB, which seems enough to process whatever it wants to (immediately and failing) to do.
During the last two days this impacts my single-user server, too. I’m not in big rooms like HQ (the biggest one has around 200 users). I can’t connect this to any events, like me joining a room or something, it just happened out of the blue; since then, i barely can send messages to any rooms, and my sync requests take a really long time if they succeed at all.
It also seems to “get to its senses” every now and then; during these times everything works like nothing happened, but it doesn’t take long, maybe a few minutes, before it goes back to PostgreSQL hell.
I tried reindexing and vacuuming my DB hoping that it will speed up these queries, but to no avail.
Until this gets fixed, is it safe (and useful) to downgrade to 1.46 (the version i used before 1.48.0)?
Also, if i can help with any debug data, let me know.
The issue predates 1.46, so I wouldn't assume downgrading is going to help.
Thatʼs strange, because our company HS works just fine with 1.46.
Also, does this mean thereʼs no workaround available? Anything one can do to use Synapse until it gets fixed?
FTR this also seems to affect the WhatsApp bridge somehow as not all my messages get forwarded to WhatsApp.
@gergelypolonkai
Have you verified that the queries that are causing you trouble are the same as the ones that are in the description of this PR? (I just want to make sure we're talking about the same issue and not perhaps about something similar.)
Although this issue existed before 1.46, it's always possible that something new has aggravated it further for you, so you're welcome to try and downgrade — we always try to keep at least 1 version of rollback possible. Synapse won't start up if you roll back too far, so it's harmless to return to 1.46 (or 1.47).
It seems like you can roll back to 1.46 (and further, if you wanted to) as the database is compatible. If you'd like to try and report back, that could be useful (and if you're lucky you might get your server back to give some time to try and investigate what's going on).
The only thing that doesn't match in my query is that ::bigint part; mine has 9690528
instead of 3405820
.
@gergelypolonkai Do you think you could try rolling back to 1.46 and seeing how that works for you?
I just did that, After starting it, it feels better (at least message sending doesn't timeout), but let me use it for a few hours before I jump to conclusions.
Nope, after like 15 minutes itʼs the same 😔
Let me know if i can help with anything else, iʼm happy to help when iʼm behind my keyboard.
Thanks for trying! How are you installing Synapse? (wondering in case you'd be willing to try a branch to see if something improves the situation for you)
I’m using virtualenv/pip install on bare metal. So sure, shoot at me with the branch name and i can easily try.
@gergelypolonkai The branch rei/p/stcache
contains a way of deduplicating these queries, which might help. (Though I am surprised to see that you're having this issue on a single-user homeserver, to be honest!). It's been running on librepush.net since yesterday (with additional code that runs the old implementation and verifies they give the same result), plus I've tried to be reasonably paranoid with the testing. You're welcome to give it a try.
@reivilibre I’m also surprised that i’m affected not just because my HS is single-user, but because all the rooms i participate in are small (<250 users) and most of these rooms don’t have a long state history.
FTR, here’s what i used if someone with less Python-fu wants to give a try:
pip install git+https://github.com/matrix-org/synapse.git@rei/p/stcache
It installed smoothly and started up. I’ll check back within a few hours to let you know if it looks good from my server.
Sorry for not coming back earlier, yesterday i had a terrible migraine.
I can still see the query occasionally firing up (almost every time i switch rooms i Nheko). However, it feels smoother; at least sending messages isn’t slowed down, which is a great win in my book.
I am also having this problem (single user system). Currently 8 such querys are running. After 6 hours they are still running and eating up all my free disk space (227GB). I saw that 27 time during the night i got errors like psycopg2.errors.DiskFull: could not write to file "base/pgsql_tmp/pgsql_tmp28.181": No space left on device
Problem is that with this queries running i cannot get my homeserver to work again :-(. My client is loosing its connection due to load and filling diskspace.
This sounds like the recursive part of the query never gets to a point where it no longer returns a tuple, causing it to run forever. That would almost suggest some kind of cyclical relationship we're unable to break?
Probably. I tried to workaround the problem by deleting the biggest rooms via the api (they are all federated ones from other servers like synapse:matrix.org. But no luck. However i also have a problem deleting one room, don't know if this is a coincidence.
This sounds like the recursive part of the query never gets to a point where it no longer returns a tuple, causing it to run forever. That would almost suggest some kind of cyclical relationship we're unable to break?
let's be clear: this issue is about the fact that we run multiple identical queries in parallel, which is inefficient, even if the queries themselves perform correctly.
If the queries aren't terminating (or if you're not seeing identical queries with identical constants in the VALUES(3405820::bigint)
part), then it's a separate bug; please open a different issue. (Though see also https://github.com/matrix-org/synapse/issues/9826 and https://github.com/matrix-org/synapse/issues/7772, both of which may be related.)
@richvdh: Thanks for the clarification. I also have multiple queries running but the constants are different. I will open a new issue for this then as they are not terminating.
@reivilibre does your branch get updated from master occasionally? I just upgraded to 1.54 and this issue persists; can i use your branch without essentially downgrading (not that i mind if it does mean a downgrade, though).
Upgrading to 1.54 and thus reverting this change brought a significant drawback in performance, so i reverted to this branch.
This was biting again, so I've updated the branch to 1.57.0 (as a new branch: rei/p/stcache-1.57
; merely reverting #12126) to get us out of a pickle. Assuming this helps us out with the current situation, we may want to prioritise doing it properly and getting something akin to this solution mainlined.
Edit: it seemed to help a bit, but we still had OOMing afterwards.
We had a federation sender instance which died. On restart, it rapidly consumed all available ram and OOMed again.
Inspection from postgres side shows it is doing many duplicate queries of the form
This query is
_get_state_groups_from_groups
, which is called from_get_state_for_groups
. Although the latter has a cache, if many threads hit it at the same time, all will find the cache empty and go on to hit the database. I think we need a zero-timeoutResponseCache
on_get_state_groups_from_groups
.