Database Connection Failure on AML clusters using kedro `ThreadRunner`

gitgud5000 commented 2 weeks ago

Description

I have an issue running Kedro with ThreadRunner to execute the following pipeline: Pasted image 20240613004553

The primary layer shown in the Kedro Viz above is a series of 21 SQLScriptDataset objects (a pandas.sql_dataset.SQLQueryDataset subclass which formats input queries in a special way using parameters in the catalog and then calls super().__init__).

This Kedro pipeline is triggered as part of a CommandJob in Azure Machine Learning (AML), using a command_job.py which runs a Kedro session with something like this:

if __name__ == "__main__":
...
with KedroSession.create() as session:
    session.run(...,runner = runner)

Problem/Error

After most or all of the datasets in the primary layer are loaded, SQLAlchemy produces the following error:

sqlalchemy.pool.impl.QueuePool Error on connect(): ORA-28547: connection to server failed, probable Oracle Net admin error
...
DatabaseError: (cx_Oracle.DatabaseError) ORA-28547: connection to server failed, probable Oracle Net admin error
(Background on this error at: https://sqlalche.me/e/20/4xp6)

Context

In AML, these jobs can be run on two types of compute: a Compute Instance, which is an Ubuntu VM used for development, and Clusters, which are managed infrastructures that allow for the creation of single/multi-node computes for deployment.

When executing the CommandJob, essentially running kedro run with ThreadRunner on a Cluster, the job fails. However, this issue does not occur when running the same job on a Compute Instance, or when run locally from source using kedro run.

These command jobs run with the same environment image in both cases.

Steps to Reproduce

Set up an AML Cluster with the specified environment.
Execute the CommandJob to run the Kedro pipeline with ThreadRunner.

Observe the error.

Expected Result

I would expect the job to run successfully on the cluster, as it does on other compute instances with the same configuration.

Actual Result

The job fails with the following error:

sqlalchemy.pool.impl.QueuePool Error on connect(): ORA-28547: connection to server failed, probable Oracle Net admin error
...
DatabaseError: (cx_Oracle.DatabaseError) ORA-28547: connection to server failed, probable Oracle Net admin error
(Background on this error at: https://sqlalche.me/e/20/4xp6)

Attempts to Resolve

Using a small max_workers in the runner configuration.
Used connection parameters for the engine

Tried different engine parameters, including:

{
"pool_size": 5,
"max_overflow": 10,
"pool_timeout": 120,
"pool_recycle": 1800,
"pool_reset_on_return": null,
"isolation_level": "AUTOCOMMIT",
"pool_pre_ping": true,
"echo_pool": "debug"
}

Different Oracle (ugh, I know) drivers
Different versions of oracledb and cx-Oracle no luck.

Logs

Here is a log file of a run with 'echo_pool': 'debug' and a similar setup, with 5 SQLScriptDataset as input. Running in AzureML.log Pasted image 20240613014731

Your Environment

Python version: 3.11.9
kedro version: 0.19.6
kedro-datasets version: 3.0.0
cx-Oracle version: 8.3.0
OS: mcr.microsoft.com/azureml/inference-base-2204:20240531.v1
Oracle client: oracle/instantclient_21_8
Azure ML compute type: Standard_D16_v3
Oracle DB version: 19

astrojuanlu commented 2 weeks ago

Hi @gitgud5000, thanks for opening this issue and sorry you had a bumpy experience. We will look into this shortly.

ArmandoRl1 commented 5 days ago

I'm having a similar problem where the issue appears when running on a cluster, but not running locally or on a compute instance.

astrojuanlu commented 4 days ago

Hi @ArmandoRl1, could you give more details on your setup? @gitgud5000 already gave a good writeup but the more information we have about this the better.

kedro-org / kedro