Postgres running out of memory (12+ GB) when processing large room → Panic with error `Closed`

JeWe37 commented 1 year ago

Running with matrix-docker-ansible-deploy, I get the following error

Jul 23 00:00:07 jendrikserver systemd[1]: Starting Synapse State Auto Compressor...
Jul 23 00:00:07 jendrikserver matrix-synapse-auto-compressor[270871]: 9dc6498df3811da4e068dbc8454a5dd9880b41a50e8b177ec0e694b50bd81871
Jul 23 00:00:07 jendrikserver matrix-synapse-auto-compressor[270885]: [2023-07-23T00:00:07Z INFO  synapse_auto_compressor] synapse_auto_compressor started
Jul 23 00:00:08 jendrikserver matrix-synapse-auto-compressor[270885]: [2023-07-23T00:00:08Z INFO  synapse_auto_compressor::manager] Running compressor on room !YTvKGNlinIzlkMTVRl:matrix.org with chunk size 500
Jul 23 00:05:54 jendrikserver matrix-synapse-auto-compressor[270885]: [2023-07-23T00:05:54Z ERROR panic] thread 'main' panicked at 'called `Result::unwrap()` on an `Err` value: Error { kind: Closed, cause: None }': src/database.rs:443

This corresponds to this version of the code, I believe, which is now this and doesn't seem to have changed.

The room in question(!YTvKGNlinIzlkMTVRl:matrix.org) is the element room.

I'm guessing you will need some more information, how could I provide that?

reivilibre commented 1 year ago

Do you get this error every time?

Hard to really say since the error is a bit ... nondescript ... but any reason you might expect your database connection to get closed? (Maybe the database has been restarted for some reason?)

How are you connecting to your database? TCP or a UNIX socket? Not saying it would help, but you could try using a UNIX socket connection instead of TCP if you're not already (in case it's some TCP timeout or so...)

JeWe37 commented 1 year ago

Error happens on every run. This seems to be where matrix-docker-ansible-deploy sets the connection: https://github.com/spantaleev/matrix-docker-ansible-deploy/blob/856a328e96c679bc7d51760d00708d921a08cb17/roles/custom/matrix-synapse-auto-compressor/defaults/main.yml#L45

I have no real idea why the database might be getting closed. If I look at the postgres logs I see this

Jul 24 00:05:04 jendrikserver matrix-postgres[157976]: 2023-07-24 00:05:03.996 UTC [1] LOG:  background worker "parallel worker" (PID 36969) was terminated by signal 9: Killed
Jul 24 00:05:04 jendrikserver matrix-postgres[157976]: 2023-07-24 00:05:03.996 UTC [1] DETAIL:  Failed process was running:
Jul 24 00:05:04 jendrikserver matrix-postgres[157976]:                 SELECT target.prev_state_group, source.prev_state_group, state.type, state.state_key, state.event_id
Jul 24 00:05:04 jendrikserver matrix-postgres[157976]:                 FROM state_group_edges AS target
Jul 24 00:05:04 jendrikserver matrix-postgres[157976]:                 LEFT JOIN state_group_edges AS source ON (target.prev_state_group = source.state_group)
Jul 24 00:05:04 jendrikserver matrix-postgres[157976]:                 LEFT JOIN state_groups_state AS state ON (target.prev_state_group = state.state_group)
Jul 24 00:05:04 jendrikserver matrix-postgres[157976]:                 WHERE target.prev_state_group = ANY($1)
Jul 24 00:05:04 jendrikserver matrix-postgres[157976]:
Jul 24 00:05:04 jendrikserver matrix-postgres[157976]: 2023-07-24 00:05:03.996 UTC [1] LOG:  terminating any other active server processes
Jul 24 00:05:04 jendrikserver matrix-postgres[157976]: 2023-07-24 00:05:04.114 UTC [37002] FATAL:  the database system is in recovery mode
Jul 24 00:05:04 jendrikserver matrix-postgres[157976]: 2023-07-24 00:05:04.119 UTC [37003] FATAL:  the database system is in recovery mode
Jul 24 00:05:04 jendrikserver matrix-postgres[157976]: 2023-07-24 00:05:04.123 UTC [37004] FATAL:  the database system is in recovery mode
Jul 24 00:05:04 jendrikserver matrix-postgres[157976]: 2023-07-24 00:05:04.127 UTC [37005] FATAL:  the database system is in recovery mode
Jul 24 00:05:04 jendrikserver matrix-postgres[157976]: 2023-07-24 00:05:04.133 UTC [37006] FATAL:  the database system is in recovery mode
Jul 24 00:05:04 jendrikserver matrix-postgres[157976]: 2023-07-24 00:05:04.137 UTC [37007] FATAL:  the database system is in recovery mode
Jul 24 00:05:05 jendrikserver matrix-postgres[157976]: 2023-07-24 00:05:05.091 UTC [1] LOG:  all server processes terminated; reinitializing
Jul 24 00:05:10 jendrikserver matrix-postgres[157976]: 2023-07-24 00:05:10.884 UTC [37008] LOG:  database system was interrupted; last known up at 2023-07-23 23:57:25 UTC

which to me looks like it's crashing postgres? Looking at the timings in the logs of the auto compressor also definitely not a timeout:

Jul 24 00:00:02 jendrikserver matrix-synapse-auto-compressor[563082]: 71f93083a03dd2c8be5a90deae601d4c5547d009268ba24f3dba937f4899a7ce
Jul 24 00:00:02 jendrikserver matrix-synapse-auto-compressor[563103]: [2023-07-24T00:00:02Z INFO  synapse_auto_compressor] synapse_auto_compressor started
Jul 24 00:00:03 jendrikserver matrix-synapse-auto-compressor[563103]: [2023-07-24T00:00:03Z INFO  synapse_auto_compressor::manager] Running compressor on room !YTvKGNlinIzlkMTVRl:matrix.org with chunk size 500
Jul 24 00:05:04 jendrikserver matrix-synapse-auto-compressor[563103]: [2023-07-24T00:05:04Z ERROR panic] thread 'main' panicked at 'called `Result::unwrap()` on an `Err` value: Error { kind: Closed, cause: None }': src/database.rs:443

DMRobertson commented 1 year ago

Jul 24 00:05:04 jendrikserver matrix-postgres[157976]: 2023-07-24 00:05:03.996 UTC [1] LOG:  background worker "parallel worker" (PID 36969) was terminated by signal 9: Killed

This sounds like some other mechanism killed a postgres process because it ate up too much RAM. (OOMKiller?) Check your system logs for clues.

JeWe37 commented 1 year ago

It was indeed the OOM killer. Processing that room seems to be eating 12GB+ of RAM. Is that expected behavior for large rooms like that?

reivilibre commented 1 year ago

It doesn't sound like wanted behaviour at the very least!

matrix-org / rust-synapse-compress-state

Postgres running out of memory (12+ GB) when processing large room → Panic with error `Closed` #123