kedro-org / kedro

Kedro is a toolbox for production-ready data science. It uses software engineering best practices to help you create data engineering and data science pipelines that are reproducible, maintainable, and modular.
https://kedro.org
Apache License 2.0
9.47k stars 875 forks source link

Database Connection Failure on AML clusters using kedro `ThreadRunner` #3951

Open gitgud5000 opened 2 weeks ago

gitgud5000 commented 2 weeks ago

Description

I have an issue running Kedro with ThreadRunner to execute the following pipeline: Pasted image 20240613004553

The primary layer shown in the Kedro Viz above is a series of 21 SQLScriptDataset objects (a pandas.sql_dataset.SQLQueryDataset subclass which formats input queries in a special way using parameters in the catalog and then calls super().__init__).

This Kedro pipeline is triggered as part of a CommandJob in Azure Machine Learning (AML), using a command_job.py which runs a Kedro session with something like this:

if __name__ == "__main__":
...
with KedroSession.create() as session:
    session.run(...,runner = runner)

Problem/Error

After most or all of the datasets in the primary layer are loaded, SQLAlchemy produces the following error:

sqlalchemy.pool.impl.QueuePool Error on connect(): ORA-28547: connection to server failed, probable Oracle Net admin error
...
DatabaseError: (cx_Oracle.DatabaseError) ORA-28547: connection to server failed, probable Oracle Net admin error
(Background on this error at: https://sqlalche.me/e/20/4xp6)

Context

In AML, these jobs can be run on two types of compute: a Compute Instance, which is an Ubuntu VM used for development, and Clusters, which are managed infrastructures that allow for the creation of single/multi-node computes for deployment.

When executing the CommandJob, essentially running kedro run with ThreadRunner on a Cluster, the job fails. However, this issue does not occur when running the same job on a Compute Instance, or when run locally from source using kedro run.

These command jobs run with the same environment image in both cases.

Steps to Reproduce

Attempts to Resolve

Logs

Here is a log file of a run with 'echo_pool': 'debug' and a similar setup, with 5 SQLScriptDataset as input. Running in AzureML.log Pasted image 20240613014731

Your Environment

astrojuanlu commented 2 weeks ago

Hi @gitgud5000, thanks for opening this issue and sorry you had a bumpy experience. We will look into this shortly.

ArmandoRl1 commented 5 days ago

I'm having a similar problem where the issue appears when running on a cluster, but not running locally or on a compute instance.

astrojuanlu commented 4 days ago

Hi @ArmandoRl1, could you give more details on your setup? @gitgud5000 already gave a good writeup but the more information we have about this the better.