Closed scott-materials closed 1 year ago
As a test, I changed this setting in /etc/pgbouncer/pgbouncer.ini
:
server_idle_timeout = 1209600
This is two weeks. We'll see if this offers a temporary fix.
Yeah, it looks like PGBouncer's default timeout settings are pretty strict (like 600s for idling). I've never had to mess with these because I'm pretty sure DigitalOcean switched many of the timeout settings for me. Looks like this page has other timeout settings that you might need. server_idle_timeout
looks to be the key parameter though. Others might be ones like query_wait_timeout
. I can't remember where I read it, but I'm pretty sure you can set these to 0 in order to disable timeouts.
(1) The simmate worker checks in with the database at a regular interval so the connection doesn't become stale.
This would be difficult to implement because it requires a separate process running in addition to the workflow/task (one process to ping the db and the second to run the workflow). Prefect does this out of the box, but Simmate's executor runs the workflow in the main thread. I know Prefect has ran into a lot of challenging bugs because of switching out of the main thread, so I think it'd be preferable to just disable idle timeouts. I can't think of any downsides to it.
You're right that 0 disables timeouts. I'm not sure what consequences that will have over the long term. Maybe I can just bump it up to 8 weeks, so it'll gradually clear out stale connections.
The approach actually didn't solve the problem. After more sleuthing, I found this:
https://tldp.org/HOWTO/TCP-Keepalive-HOWTO/usingkeepalive.html
Ubuntu's default is 7200 seconds -- i.e., the kernel kills connection after 2 hrs of idling.
My solution is to set the tcp_keepalive_time
to some large value. I did it for 720000 seconds, i.e., 83 days.
The change is made permanent by doing
sudo nano /etc/sysctl.conf
and adding this line:
net.ipv4.tcp_keepalive_time=720000
then rebooting the server.
There should be a way to establish a new django connection when one closes/fails. It's just something I wasn't able to figure out / didn't have enough time to fix. I'm sure others will run into random timeout issues, so it might be worth digging into.
So fixing this is also tied to #78.
Describe the bug
My simmate worker seems to time out on the cloud database. For example, if I'm doing an NEB calculation:
Presumably, the cloud database terminates its connection to the worker after some period of inactivity. If this is the case, we could fix this in one of two ways:
(1) The simmate worker checks in with the database at a regular interval so the connection doesn't become stale. (2) The time out period is extended on the cloud server. This is probably a pgbouncer configuration option rather than a postgres option.
My guess is that the first option is better, since pgbouncer's ability to drop a connection after some period of inactivity is a desirable feature, in general. I.e., it should be the worker's responsibility to say, 'I'm still active' every 10 minutes or so, rather than allowing connections to remain active for, say, up to 2 weeks (or longer).
To Reproduce
No response
Error
Versions
Additional details
No response