many duplicate `_get_state_groups_from_groups` queries leading to OOMs

richvdh commented 3 years ago

We had a federation sender instance which died. On restart, it rapidly consumed all available ram and OOMed again.

Inspection from postgres side shows it is doing many duplicate queries of the form

WITH RECURSIVE state(state_group) AS ( VALUES(3405820::bigint) UNION ALL SELECT prev_state_group FROM state_group_edges e, state s WHERE s.state_group = e.state_group ) SELECT DISTINCT ON (type, state_key) type, state_key, event_id FROM state_groups_state WHERE state_group IN ( SELECT state_group FROM state ) ORDER BY type, state_key, state_group DESC

This query is _get_state_groups_from_groups, which is called from _get_state_for_groups. Although the latter has a cache, if many threads hit it at the same time, all will find the cache empty and go on to hit the database. I think we need a zero-timeout ResponseCache on _get_state_groups_from_groups.

richvdh commented 3 years ago

this is currently complicated by the fact that the code does some batching of lookups. It's not obvious that the batching achieves much (at least on postgres) so we could maybe strip it out

jaywink commented 3 years ago

Not sure if this could be related, but in EMS land we've noticed with the latest few synapse releases it seems that joining HQ sometimes puts a small host into an OOM loop, never recovering. Previously hosts took a much longer time joining HQ and OOMed for a while trying to do it, but eventually stabilized. Lately it feels like sometimes the host never recovers, until we give it more headroom in RAM.

The limit we have for these kind of small hosts is 1GB before the cluster kills the host to protect server stability. This issue we've seen lately is resolved by raising that limit to 1.5GB, which seems enough to process whatever it wants to (immediately and failing) to do.

gergelypolonkai commented 2 years ago

During the last two days this impacts my single-user server, too. I’m not in big rooms like HQ (the biggest one has around 200 users). I can’t connect this to any events, like me joining a room or something, it just happened out of the blue; since then, i barely can send messages to any rooms, and my sync requests take a really long time if they succeed at all.

It also seems to “get to its senses” every now and then; during these times everything works like nothing happened, but it doesn’t take long, maybe a few minutes, before it goes back to PostgreSQL hell.

I tried reindexing and vacuuming my DB hoping that it will speed up these queries, but to no avail.

Until this gets fixed, is it safe (and useful) to downgrade to 1.46 (the version i used before 1.48.0)?

Also, if i can help with any debug data, let me know.

daenney commented 2 years ago

The issue predates 1.46, so I wouldn't assume downgrading is going to help.

gergelypolonkai commented 2 years ago

Thatʼs strange, because our company HS works just fine with 1.46.

Also, does this mean thereʼs no workaround available? Anything one can do to use Synapse until it gets fixed?

gergelypolonkai commented 2 years ago

FTR this also seems to affect the WhatsApp bridge somehow as not all my messages get forwarded to WhatsApp.

reivilibre commented 2 years ago

@gergelypolonkai

Have you verified that the queries that are causing you trouble are the same as the ones that are in the description of this PR? (I just want to make sure we're talking about the same issue and not perhaps about something similar.)

Although this issue existed before 1.46, it's always possible that something new has aggravated it further for you, so you're welcome to try and downgrade — we always try to keep at least 1 version of rollback possible. Synapse won't start up if you roll back too far, so it's harmless to return to 1.46 (or 1.47).

It seems like you can roll back to 1.46 (and further, if you wanted to) as the database is compatible. If you'd like to try and report back, that could be useful (and if you're lucky you might get your server back to give some time to try and investigate what's going on).

gergelypolonkai commented 2 years ago

The only thing that doesn't match in my query is that ::bigint part; mine has 9690528 instead of 3405820.

callahad commented 2 years ago

@gergelypolonkai Do you think you could try rolling back to 1.46 and seeing how that works for you?

gergelypolonkai commented 2 years ago

I just did that, After starting it, it feels better (at least message sending doesn't timeout), but let me use it for a few hours before I jump to conclusions.

gergelypolonkai commented 2 years ago

Nope, after like 15 minutes itʼs the same 😔

Let me know if i can help with anything else, iʼm happy to help when iʼm behind my keyboard.

reivilibre commented 2 years ago

Thanks for trying! How are you installing Synapse? (wondering in case you'd be willing to try a branch to see if something improves the situation for you)

gergelypolonkai commented 2 years ago

I’m using virtualenv/pip install on bare metal. So sure, shoot at me with the branch name and i can easily try.

reivilibre commented 2 years ago

@gergelypolonkai The branch rei/p/stcache contains a way of deduplicating these queries, which might help. (Though I am surprised to see that you're having this issue on a single-user homeserver, to be honest!). It's been running on librepush.net since yesterday (with additional code that runs the old implementation and verifies they give the same result), plus I've tried to be reasonably paranoid with the testing. You're welcome to give it a try.

gergelypolonkai commented 2 years ago

@reivilibre I’m also surprised that i’m affected not just because my HS is single-user, but because all the rooms i participate in are small (<250 users) and most of these rooms don’t have a long state history.

FTR, here’s what i used if someone with less Python-fu wants to give a try:

pip install git+https://github.com/matrix-org/synapse.git@rei/p/stcache

It installed smoothly and started up. I’ll check back within a few hours to let you know if it looks good from my server.

gergelypolonkai commented 2 years ago

Sorry for not coming back earlier, yesterday i had a terrible migraine.

I can still see the query occasionally firing up (almost every time i switch rooms i Nheko). However, it feels smoother; at least sending messages isn’t slowed down, which is a great win in my book.

foxcris commented 2 years ago

I am also having this problem (single user system). Currently 8 such querys are running. After 6 hours they are still running and eating up all my free disk space (227GB). I saw that 27 time during the night i got errors like psycopg2.errors.DiskFull: could not write to file "base/pgsql_tmp/pgsql_tmp28.181": No space left on device

Problem is that with this queries running i cannot get my homeserver to work again :-(. My client is loosing its connection due to load and filling diskspace.

daenney commented 2 years ago

This sounds like the recursive part of the query never gets to a point where it no longer returns a tuple, causing it to run forever. That would almost suggest some kind of cyclical relationship we're unable to break?

foxcris commented 2 years ago

Probably. I tried to workaround the problem by deleting the biggest rooms via the api (they are all federated ones from other servers like synapse:matrix.org. But no luck. However i also have a problem deleting one room, don't know if this is a coincidence.

richvdh commented 2 years ago

This sounds like the recursive part of the query never gets to a point where it no longer returns a tuple, causing it to run forever. That would almost suggest some kind of cyclical relationship we're unable to break?

let's be clear: this issue is about the fact that we run multiple identical queries in parallel, which is inefficient, even if the queries themselves perform correctly.

If the queries aren't terminating (or if you're not seeing identical queries with identical constants in the VALUES(3405820::bigint) part), then it's a separate bug; please open a different issue. (Though see also https://github.com/matrix-org/synapse/issues/9826 and https://github.com/matrix-org/synapse/issues/7772, both of which may be related.)

foxcris commented 2 years ago

@richvdh: Thanks for the clarification. I also have multiple queries running but the constants are different. I will open a new issue for this then as they are not terminating.

gergelypolonkai commented 2 years ago

@reivilibre does your branch get updated from master occasionally? I just upgraded to 1.54 and this issue persists; can i use your branch without essentially downgrading (not that i mind if it does mean a downgrade, though).

gergelypolonkai commented 2 years ago

Upgrading to 1.54 and thus reverting this change brought a significant drawback in performance, so i reverted to this branch.

reivilibre commented 2 years ago

This was biting again, so I've updated the branch to 1.57.0 (as a new branch: rei/p/stcache-1.57; merely reverting #12126) to get us out of a pickle. Assuming this helps us out with the current situation, we may want to prioritise doing it properly and getting something akin to this solution mainlined.

Edit: it seemed to help a bit, but we still had OOMing afterwards.

matrix-org / synapse

many duplicate `_get_state_groups_from_groups` queries leading to OOMs #10301