Open JeWe37 opened 1 year ago
Do you get this error every time?
Hard to really say since the error is a bit ... nondescript ... but any reason you might expect your database connection to get closed? (Maybe the database has been restarted for some reason?)
How are you connecting to your database? TCP or a UNIX socket? Not saying it would help, but you could try using a UNIX socket connection instead of TCP if you're not already (in case it's some TCP timeout or so...)
Error happens on every run. This seems to be where matrix-docker-ansible-deploy sets the connection: https://github.com/spantaleev/matrix-docker-ansible-deploy/blob/856a328e96c679bc7d51760d00708d921a08cb17/roles/custom/matrix-synapse-auto-compressor/defaults/main.yml#L45
I have no real idea why the database might be getting closed. If I look at the postgres logs I see this
Jul 24 00:05:04 jendrikserver matrix-postgres[157976]: 2023-07-24 00:05:03.996 UTC [1] LOG: background worker "parallel worker" (PID 36969) was terminated by signal 9: Killed
Jul 24 00:05:04 jendrikserver matrix-postgres[157976]: 2023-07-24 00:05:03.996 UTC [1] DETAIL: Failed process was running:
Jul 24 00:05:04 jendrikserver matrix-postgres[157976]: SELECT target.prev_state_group, source.prev_state_group, state.type, state.state_key, state.event_id
Jul 24 00:05:04 jendrikserver matrix-postgres[157976]: FROM state_group_edges AS target
Jul 24 00:05:04 jendrikserver matrix-postgres[157976]: LEFT JOIN state_group_edges AS source ON (target.prev_state_group = source.state_group)
Jul 24 00:05:04 jendrikserver matrix-postgres[157976]: LEFT JOIN state_groups_state AS state ON (target.prev_state_group = state.state_group)
Jul 24 00:05:04 jendrikserver matrix-postgres[157976]: WHERE target.prev_state_group = ANY($1)
Jul 24 00:05:04 jendrikserver matrix-postgres[157976]:
Jul 24 00:05:04 jendrikserver matrix-postgres[157976]: 2023-07-24 00:05:03.996 UTC [1] LOG: terminating any other active server processes
Jul 24 00:05:04 jendrikserver matrix-postgres[157976]: 2023-07-24 00:05:04.114 UTC [37002] FATAL: the database system is in recovery mode
Jul 24 00:05:04 jendrikserver matrix-postgres[157976]: 2023-07-24 00:05:04.119 UTC [37003] FATAL: the database system is in recovery mode
Jul 24 00:05:04 jendrikserver matrix-postgres[157976]: 2023-07-24 00:05:04.123 UTC [37004] FATAL: the database system is in recovery mode
Jul 24 00:05:04 jendrikserver matrix-postgres[157976]: 2023-07-24 00:05:04.127 UTC [37005] FATAL: the database system is in recovery mode
Jul 24 00:05:04 jendrikserver matrix-postgres[157976]: 2023-07-24 00:05:04.133 UTC [37006] FATAL: the database system is in recovery mode
Jul 24 00:05:04 jendrikserver matrix-postgres[157976]: 2023-07-24 00:05:04.137 UTC [37007] FATAL: the database system is in recovery mode
Jul 24 00:05:05 jendrikserver matrix-postgres[157976]: 2023-07-24 00:05:05.091 UTC [1] LOG: all server processes terminated; reinitializing
Jul 24 00:05:10 jendrikserver matrix-postgres[157976]: 2023-07-24 00:05:10.884 UTC [37008] LOG: database system was interrupted; last known up at 2023-07-23 23:57:25 UTC
which to me looks like it's crashing postgres? Looking at the timings in the logs of the auto compressor also definitely not a timeout:
Jul 24 00:00:02 jendrikserver matrix-synapse-auto-compressor[563082]: 71f93083a03dd2c8be5a90deae601d4c5547d009268ba24f3dba937f4899a7ce
Jul 24 00:00:02 jendrikserver matrix-synapse-auto-compressor[563103]: [2023-07-24T00:00:02Z INFO synapse_auto_compressor] synapse_auto_compressor started
Jul 24 00:00:03 jendrikserver matrix-synapse-auto-compressor[563103]: [2023-07-24T00:00:03Z INFO synapse_auto_compressor::manager] Running compressor on room !YTvKGNlinIzlkMTVRl:matrix.org with chunk size 500
Jul 24 00:05:04 jendrikserver matrix-synapse-auto-compressor[563103]: [2023-07-24T00:05:04Z ERROR panic] thread 'main' panicked at 'called `Result::unwrap()` on an `Err` value: Error { kind: Closed, cause: None }': src/database.rs:443
Jul 24 00:05:04 jendrikserver matrix-postgres[157976]: 2023-07-24 00:05:03.996 UTC [1] LOG: background worker "parallel worker" (PID 36969) was terminated by signal 9: Killed
This sounds like some other mechanism killed a postgres process because it ate up too much RAM. (OOMKiller?) Check your system logs for clues.
It was indeed the OOM killer. Processing that room seems to be eating 12GB+ of RAM. Is that expected behavior for large rooms like that?
It doesn't sound like wanted behaviour at the very least!
Running with matrix-docker-ansible-deploy, I get the following error
This corresponds to this version of the code, I believe, which is now this and doesn't seem to have changed.
The room in question(
!YTvKGNlinIzlkMTVRl:matrix.org
) is the element room.I'm guessing you will need some more information, how could I provide that?