I'm hunting a crash for some users running ODC code on the NCI, somewhere in database connection handling. A unique quirk of the NCI environment is that idle TCP Connections are killed after 5 minutes, whereas in most environments it's 2 hours. This is probably relevant, but I'm not sure.
An example stack trace
```Python traceback
File "/g/data/v10/private/modules/ard-scene-select-py3-dea/20240405/lib/python3.10/site-packages/scene_select/dass_logs.py", line 60, in decorated
raise ex
File "/g/data/v10/private/modules/ard-scene-select-py3-dea/20240405/lib/python3.10/site-packages/scene_select/dass_logs.py", line 52, in decorated
result = fn(*args, **kwargs)
File "/g/data/v10/private/modules/ard-scene-select-py3-dea/20240405/lib/python3.10/site-packages/scene_select/ard_scene_select.py", line 851, in scene_select
l1_count, uuids2archive = l1_scenes_to_process(
File "/g/data/v10/private/modules/ard-scene-select-py3-dea/20240405/lib/python3.10/site-packages/scene_select/ard_scene_select.py", line 588, in l1_scenes_to_process
files2process, uuids2archive, duplicates = l1_filter(
File "/g/data/v10/private/modules/ard-scene-select-py3-dea/20240405/lib/python3.10/site-packages/scene_select/ard_scene_select.py", line 446, in l1_filter
for l1_dataset in dc.index.datasets.search(product=l1_product, time=month_as_range(year, month)):
File "/g/data/v10/public/modules/dea/20231204/lib/python3.10/site-packages/datacube/index/postgres/_datasets.py", line 548, in search
for product, datasets in self._do_search_by_product(query,
File "/g/data/v10/public/modules/dea/20231204/lib/python3.10/site-packages/datacube/index/postgres/_datasets.py", line 691, in _do_search_by_product
with self._db_connection() as connection:
File "/g/data/v10/public/modules/dea/20231204/lib/python3.10/contextlib.py", line 142, in __exit__
next(self.gen)
File "/g/data/v10/public/modules/dea/20231204/lib/python3.10/site-packages/datacube/index/postgres/_transaction.py", line 60, in _db_connection
with self._index._active_connection(transaction=transaction) as conn:
File "/g/data/v10/public/modules/dea/20231204/lib/python3.10/contextlib.py", line 142, in __exit__
next(self.gen)
File "/g/data/v10/public/modules/dea/20231204/lib/python3.10/site-packages/datacube/index/postgres/index.py", line 157, in _active_connection
with self._db._connect() as conn:
File "/g/data/v10/public/modules/dea/20231204/lib/python3.10/contextlib.py", line 142, in __exit__
next(self.gen)
File "/g/data/v10/public/modules/dea/20231204/lib/python3.10/site-packages/datacube/drivers/postgres/_connections.py", line 231, in _connect
connection.close()
File "/g/data/v10/public/modules/dea/20231204/lib/python3.10/site-packages/sqlalchemy/engine/base.py", line 1251, in close
conn.close()
File "/g/data/v10/public/modules/dea/20231204/lib/python3.10/site-packages/sqlalchemy/pool/base.py", line 1166, in close
self._checkin()
File "/g/data/v10/public/modules/dea/20231204/lib/python3.10/site-packages/sqlalchemy/pool/base.py", line 1008, in _checkin
_finalize_fairy(
File "/g/data/v10/public/modules/dea/20231204/lib/python3.10/site-packages/sqlalchemy/pool/base.py", line 800, in _finalize_fairy
connection_record.checkin()
File "/g/data/v10/public/modules/dea/20231204/lib/python3.10/site-packages/sqlalchemy/pool/base.py", line 542, in checkin
finalizer(connection)
File "/g/data/v10/public/modules/dea/20231204/lib/python3.10/site-packages/sqlalchemy/engine/default.py", line 677, in _reset_characteristics
characteristic.reset_characteristic(self, dbapi_connection)
File "/g/data/v10/public/modules/dea/20231204/lib/python3.10/site-packages/sqlalchemy/engine/characteristics.py", line 50, in reset_characteristic
dialect.reset_isolation_level(dbapi_conn)
File "/g/data/v10/public/modules/dea/20231204/lib/python3.10/site-packages/sqlalchemy/engine/default.py", line 748, in reset_isolation_level
self.set_isolation_level(dbapi_conn, self.default_isolation_level)
File "/g/data/v10/public/modules/dea/20231204/lib/python3.10/site-packages/sqlalchemy/dialects/postgresql/psycopg2.py", line 851, in set_isolation_level
connection.set_isolation_level(level)
psycopg2.OperationalError: server closed the connection unexpectedly
This probably means the server terminated abnormally
before or while processing the request.
```
I can reliably cause exceptions using plain psycopg2, by attempting to use a connection that's been idle for more than 5 minutes.
I haven't been able to reproduce the exception using the ODC Index API. I've been inserting time.sleep(330) between different calls, which is what I could do with plain psycopg2.
This isn't the only time we've seen crashes that look like this on the NCI.
I've run out of steam investigating this, since I can't get a small reproducible example. For now at least.
I really want something reproducible before attempting a fix.
The connection handling code in ODC is far more complicated than I would have expected, and I'm not convinced it needs to be that way. However, maybe it does, I'm just not sure. :)
I'm hunting a crash for some users running ODC code on the NCI, somewhere in database connection handling. A unique quirk of the NCI environment is that idle TCP Connections are killed after 5 minutes, whereas in most environments it's 2 hours. This is probably relevant, but I'm not sure.
An example stack trace
```Python traceback File "/g/data/v10/private/modules/ard-scene-select-py3-dea/20240405/lib/python3.10/site-packages/scene_select/dass_logs.py", line 60, in decorated raise ex File "/g/data/v10/private/modules/ard-scene-select-py3-dea/20240405/lib/python3.10/site-packages/scene_select/dass_logs.py", line 52, in decorated result = fn(*args, **kwargs) File "/g/data/v10/private/modules/ard-scene-select-py3-dea/20240405/lib/python3.10/site-packages/scene_select/ard_scene_select.py", line 851, in scene_select l1_count, uuids2archive = l1_scenes_to_process( File "/g/data/v10/private/modules/ard-scene-select-py3-dea/20240405/lib/python3.10/site-packages/scene_select/ard_scene_select.py", line 588, in l1_scenes_to_process files2process, uuids2archive, duplicates = l1_filter( File "/g/data/v10/private/modules/ard-scene-select-py3-dea/20240405/lib/python3.10/site-packages/scene_select/ard_scene_select.py", line 446, in l1_filter for l1_dataset in dc.index.datasets.search(product=l1_product, time=month_as_range(year, month)): File "/g/data/v10/public/modules/dea/20231204/lib/python3.10/site-packages/datacube/index/postgres/_datasets.py", line 548, in search for product, datasets in self._do_search_by_product(query, File "/g/data/v10/public/modules/dea/20231204/lib/python3.10/site-packages/datacube/index/postgres/_datasets.py", line 691, in _do_search_by_product with self._db_connection() as connection: File "/g/data/v10/public/modules/dea/20231204/lib/python3.10/contextlib.py", line 142, in __exit__ next(self.gen) File "/g/data/v10/public/modules/dea/20231204/lib/python3.10/site-packages/datacube/index/postgres/_transaction.py", line 60, in _db_connection with self._index._active_connection(transaction=transaction) as conn: File "/g/data/v10/public/modules/dea/20231204/lib/python3.10/contextlib.py", line 142, in __exit__ next(self.gen) File "/g/data/v10/public/modules/dea/20231204/lib/python3.10/site-packages/datacube/index/postgres/index.py", line 157, in _active_connection with self._db._connect() as conn: File "/g/data/v10/public/modules/dea/20231204/lib/python3.10/contextlib.py", line 142, in __exit__ next(self.gen) File "/g/data/v10/public/modules/dea/20231204/lib/python3.10/site-packages/datacube/drivers/postgres/_connections.py", line 231, in _connect connection.close() File "/g/data/v10/public/modules/dea/20231204/lib/python3.10/site-packages/sqlalchemy/engine/base.py", line 1251, in close conn.close() File "/g/data/v10/public/modules/dea/20231204/lib/python3.10/site-packages/sqlalchemy/pool/base.py", line 1166, in close self._checkin() File "/g/data/v10/public/modules/dea/20231204/lib/python3.10/site-packages/sqlalchemy/pool/base.py", line 1008, in _checkin _finalize_fairy( File "/g/data/v10/public/modules/dea/20231204/lib/python3.10/site-packages/sqlalchemy/pool/base.py", line 800, in _finalize_fairy connection_record.checkin() File "/g/data/v10/public/modules/dea/20231204/lib/python3.10/site-packages/sqlalchemy/pool/base.py", line 542, in checkin finalizer(connection) File "/g/data/v10/public/modules/dea/20231204/lib/python3.10/site-packages/sqlalchemy/engine/default.py", line 677, in _reset_characteristics characteristic.reset_characteristic(self, dbapi_connection) File "/g/data/v10/public/modules/dea/20231204/lib/python3.10/site-packages/sqlalchemy/engine/characteristics.py", line 50, in reset_characteristic dialect.reset_isolation_level(dbapi_conn) File "/g/data/v10/public/modules/dea/20231204/lib/python3.10/site-packages/sqlalchemy/engine/default.py", line 748, in reset_isolation_level self.set_isolation_level(dbapi_conn, self.default_isolation_level) File "/g/data/v10/public/modules/dea/20231204/lib/python3.10/site-packages/sqlalchemy/dialects/postgresql/psycopg2.py", line 851, in set_isolation_level connection.set_isolation_level(level) psycopg2.OperationalError: server closed the connection unexpectedly This probably means the server terminated abnormally before or while processing the request. ```I can reliably cause exceptions using plain psycopg2, by attempting to use a connection that's been idle for more than 5 minutes.
I haven't been able to reproduce the exception using the ODC Index API. I've been inserting
time.sleep(330)
between different calls, which is what I could do with plain psycopg2.I suspect that ODC is more resilient due to setting
pool_recycle=60
on the SQLAlchemy connection pool. https://github.com/opendatacube/datacube-core/blob/4f1e636f8969e553d12835c73a6fedb28844f065/datacube/drivers/postgres/_connections.py#L131-L134One line in the traceback that I'm suspicious of is this line 231, since I think exiting the context would close the connection. Manually calling
.close()
seems unnecessary. https://github.com/opendatacube/datacube-core/blob/4f1e636f8969e553d12835c73a6fedb28844f065/datacube/drivers/postgres/_connections.py#L226-L231Summary