opendatacube / datacube-core

Open Data Cube analyses continental scale Earth Observation data through time
http://www.opendatacube.org
Apache License 2.0
493 stars 175 forks source link

Postgres Connection Timeout Crash #1606

Open omad opened 6 days ago

omad commented 6 days ago

I'm hunting a crash for some users running ODC code on the NCI, somewhere in database connection handling. A unique quirk of the NCI environment is that idle TCP Connections are killed after 5 minutes, whereas in most environments it's 2 hours. This is probably relevant, but I'm not sure.

An example stack trace ```Python traceback File "/g/data/v10/private/modules/ard-scene-select-py3-dea/20240405/lib/python3.10/site-packages/scene_select/dass_logs.py", line 60, in decorated raise ex File "/g/data/v10/private/modules/ard-scene-select-py3-dea/20240405/lib/python3.10/site-packages/scene_select/dass_logs.py", line 52, in decorated result = fn(*args, **kwargs) File "/g/data/v10/private/modules/ard-scene-select-py3-dea/20240405/lib/python3.10/site-packages/scene_select/ard_scene_select.py", line 851, in scene_select l1_count, uuids2archive = l1_scenes_to_process( File "/g/data/v10/private/modules/ard-scene-select-py3-dea/20240405/lib/python3.10/site-packages/scene_select/ard_scene_select.py", line 588, in l1_scenes_to_process files2process, uuids2archive, duplicates = l1_filter( File "/g/data/v10/private/modules/ard-scene-select-py3-dea/20240405/lib/python3.10/site-packages/scene_select/ard_scene_select.py", line 446, in l1_filter for l1_dataset in dc.index.datasets.search(product=l1_product, time=month_as_range(year, month)): File "/g/data/v10/public/modules/dea/20231204/lib/python3.10/site-packages/datacube/index/postgres/_datasets.py", line 548, in search for product, datasets in self._do_search_by_product(query, File "/g/data/v10/public/modules/dea/20231204/lib/python3.10/site-packages/datacube/index/postgres/_datasets.py", line 691, in _do_search_by_product with self._db_connection() as connection: File "/g/data/v10/public/modules/dea/20231204/lib/python3.10/contextlib.py", line 142, in __exit__ next(self.gen) File "/g/data/v10/public/modules/dea/20231204/lib/python3.10/site-packages/datacube/index/postgres/_transaction.py", line 60, in _db_connection with self._index._active_connection(transaction=transaction) as conn: File "/g/data/v10/public/modules/dea/20231204/lib/python3.10/contextlib.py", line 142, in __exit__ next(self.gen) File "/g/data/v10/public/modules/dea/20231204/lib/python3.10/site-packages/datacube/index/postgres/index.py", line 157, in _active_connection with self._db._connect() as conn: File "/g/data/v10/public/modules/dea/20231204/lib/python3.10/contextlib.py", line 142, in __exit__ next(self.gen) File "/g/data/v10/public/modules/dea/20231204/lib/python3.10/site-packages/datacube/drivers/postgres/_connections.py", line 231, in _connect connection.close() File "/g/data/v10/public/modules/dea/20231204/lib/python3.10/site-packages/sqlalchemy/engine/base.py", line 1251, in close conn.close() File "/g/data/v10/public/modules/dea/20231204/lib/python3.10/site-packages/sqlalchemy/pool/base.py", line 1166, in close self._checkin() File "/g/data/v10/public/modules/dea/20231204/lib/python3.10/site-packages/sqlalchemy/pool/base.py", line 1008, in _checkin _finalize_fairy( File "/g/data/v10/public/modules/dea/20231204/lib/python3.10/site-packages/sqlalchemy/pool/base.py", line 800, in _finalize_fairy connection_record.checkin() File "/g/data/v10/public/modules/dea/20231204/lib/python3.10/site-packages/sqlalchemy/pool/base.py", line 542, in checkin finalizer(connection) File "/g/data/v10/public/modules/dea/20231204/lib/python3.10/site-packages/sqlalchemy/engine/default.py", line 677, in _reset_characteristics characteristic.reset_characteristic(self, dbapi_connection) File "/g/data/v10/public/modules/dea/20231204/lib/python3.10/site-packages/sqlalchemy/engine/characteristics.py", line 50, in reset_characteristic dialect.reset_isolation_level(dbapi_conn) File "/g/data/v10/public/modules/dea/20231204/lib/python3.10/site-packages/sqlalchemy/engine/default.py", line 748, in reset_isolation_level self.set_isolation_level(dbapi_conn, self.default_isolation_level) File "/g/data/v10/public/modules/dea/20231204/lib/python3.10/site-packages/sqlalchemy/dialects/postgresql/psycopg2.py", line 851, in set_isolation_level connection.set_isolation_level(level) psycopg2.OperationalError: server closed the connection unexpectedly This probably means the server terminated abnormally before or while processing the request. ```

I can reliably cause exceptions using plain psycopg2, by attempting to use a connection that's been idle for more than 5 minutes.

I haven't been able to reproduce the exception using the ODC Index API. I've been inserting time.sleep(330) between different calls, which is what I could do with plain psycopg2.

I suspect that ODC is more resilient due to setting pool_recycle=60 on the SQLAlchemy connection pool. https://github.com/opendatacube/datacube-core/blob/4f1e636f8969e553d12835c73a6fedb28844f065/datacube/drivers/postgres/_connections.py#L131-L134

One line in the traceback that I'm suspicious of is this line 231, since I think exiting the context would close the connection. Manually calling .close() seems unnecessary. https://github.com/opendatacube/datacube-core/blob/4f1e636f8969e553d12835c73a6fedb28844f065/datacube/drivers/postgres/_connections.py#L226-L231

Summary