jacksund / simmate

The Simulated Materials Ecosystem (Simmate) is a toolbox and framework for computational materials research.
https://simmate.org
BSD 3-Clause "New" or "Revised" License
30 stars 9 forks source link

Database timing out #338

Closed scott-materials closed 1 year ago

scott-materials commented 1 year ago

Describe the bug

My simmate worker seems to time out on the cloud database. For example, if I'm doing an NEB calculation:

                    INFO     Using
                             /21dayscratch/scr/s/c/scw/simmate/start/run3/diffus
                             ion.vasp.neb-single-path-mit.00/relaxation.vasp.mvl
                             -neb-endpoint.start
                    INFO     Running 'mpirun vasp_std > vasp.out'
2022-12-19 16:42:34 INFO     Saving to database and writing outputs
Traceback (most recent call last):
  File "/nas/longleaf/home/scw/mambaforge/envs/scotts_simmate/lib/python3.10/site-packages/django/db/backends/utils.py", line 89, in _execute
    return self.cursor.execute(sql, params)
psycopg2.OperationalError: server closed the connection unexpectedly
        This probably means the server terminated abnormally
        before or while processing the request.

...

│ /nas/longleaf/home/scw/mambaforge/envs/scotts_simmate/lib/python3.10/site-pa │
│ ckages/django/db/backends/postgresql/base.py:306 in _set_autocommit          │
│                                                                              │
│   303 │                                                                      │
│   304 │   def _set_autocommit(self, autocommit):                             │
│   305 │   │   with self.wrap_database_errors:                                │
│ ❱ 306 │   │   │   self.connection.autocommit = autocommit                    │
│   307 │                                                                      │
│   308 │   def check_constraints(self, table_names=None):                     │
│   309 │   │   """                                                            │
│                                                                              │
│ ╭────────────────────────────── locals ──────────────────────────────╮       │
│ │ autocommit = False                                                 │       │
│ │       self = <DatabaseWrapper vendor='postgresql' alias='default'> │       │
│ ╰────────────────────────────────────────────────────────────────────╯       │
╰──────────────────────────────────────────────────────────────────────────────╯
InterfaceError: connection already closed

Presumably, the cloud database terminates its connection to the worker after some period of inactivity. If this is the case, we could fix this in one of two ways:

(1) The simmate worker checks in with the database at a regular interval so the connection doesn't become stale. (2) The time out period is extended on the cloud server. This is probably a pgbouncer configuration option rather than a postgres option.

My guess is that the first option is better, since pgbouncer's ability to drop a connection after some period of inactivity is a desirable feature, in general. I.e., it should be the worker's responsibility to say, 'I'm still active' every 10 minutes or so, rather than allowing connections to remain active for, say, up to 2 weeks (or longer).

To Reproduce

No response

Error

Copy complete error message here.

Versions

Additional details

No response

scott-materials commented 1 year ago

As a test, I changed this setting in /etc/pgbouncer/pgbouncer.ini:

server_idle_timeout = 1209600

This is two weeks. We'll see if this offers a temporary fix.

jacksund commented 1 year ago

Yeah, it looks like PGBouncer's default timeout settings are pretty strict (like 600s for idling). I've never had to mess with these because I'm pretty sure DigitalOcean switched many of the timeout settings for me. Looks like this page has other timeout settings that you might need. server_idle_timeout looks to be the key parameter though. Others might be ones like query_wait_timeout. I can't remember where I read it, but I'm pretty sure you can set these to 0 in order to disable timeouts.

jacksund commented 1 year ago

(1) The simmate worker checks in with the database at a regular interval so the connection doesn't become stale.

This would be difficult to implement because it requires a separate process running in addition to the workflow/task (one process to ping the db and the second to run the workflow). Prefect does this out of the box, but Simmate's executor runs the workflow in the main thread. I know Prefect has ran into a lot of challenging bugs because of switching out of the main thread, so I think it'd be preferable to just disable idle timeouts. I can't think of any downsides to it.

scott-materials commented 1 year ago

You're right that 0 disables timeouts. I'm not sure what consequences that will have over the long term. Maybe I can just bump it up to 8 weeks, so it'll gradually clear out stale connections.

scott-materials commented 1 year ago

The approach actually didn't solve the problem. After more sleuthing, I found this:

https://tldp.org/HOWTO/TCP-Keepalive-HOWTO/usingkeepalive.html

Ubuntu's default is 7200 seconds -- i.e., the kernel kills connection after 2 hrs of idling.

My solution is to set the tcp_keepalive_time to some large value. I did it for 720000 seconds, i.e., 83 days.

The change is made permanent by doing

sudo nano /etc/sysctl.conf

and adding this line:

net.ipv4.tcp_keepalive_time=720000

then rebooting the server.

jacksund commented 1 year ago

There should be a way to establish a new django connection when one closes/fails. It's just something I wasn't able to figure out / didn't have enough time to fix. I'm sure others will run into random timeout issues, so it might be worth digging into.

So fixing this is also tied to #78.