Universal Kedro deployment (Part 2) - Offer a unified interface to external compute and storage backends

Galileo-Galilei commented 3 years ago

Preamble

This is the second part of my serie of design documentation on refactoring Kedro to make deployment easier:

The first part is described in the issue #770 and focuses on refactoring configuration to separate external and applicative configuration.
This second part focuses on DataCatalog entries which have a compute/storage backend different than "python / in memory operations". This includes Spark, SQL, SAS, Mlflow... The goal is to suggest an API to manage these external connections in a kedronic way (credits to @BenjaminLevyQB for the name). This has some overlaps with #891 regarding DataCatalog management and with #880 regarding on how to solve it.

Defining the feature: The need for a well identified place in the KedroSession and the template for managing connections

Identifying the uses cases: perform operations with connections to external backend / servers

As Kedro's popularity is increasing and more and more people are using it in an enterprise context, they tend to have more needs to interact with external/legacy systems. The main use cases are:

Load data in memory from this backend (e.g. SQLTableDataSet)
Save data from memory to this backend (e.g. SQLTableDataSet)
Perform computation in this backend (e.g. SparkContext, SQLQueryDataSet)
Use a local client to connect to a remote server and perform/trigger operations (kedro-mlflow MlflowArtifactDataSet, kedro-dolt, kedro-neptune...)

As of now (kedro==0.17.4) Kedro offers very limited support for these use cases, and it hurts maintenability and transition to deployment, mainly because it is hard to modify these backend connections credentials for production.

Overview of possible solutions in kedro==0.17.4

Above use cases are currently handled on a per-backend basis, with very different choices in the implementation. Hereafter is a non exhaustive list of examples where a connection to a remote serve is instantiated:

Backend	Current "best" solution	kedro object	Connection object	Connection Registration location	Access within kedro objects
Spark	Kedro documentation	ProjectContext	SparkSession.builder.appName()	ProjectContext.init	get singleton inside nodes with getOrCreate
SQL	Two datasets SQLQueryDataSet to perform computation and SQLTableDataSet to load existing data	AbstractDataSet	sqlalchemy.engine() inside pd.read_sql_query	SQLTableDataSet._load	session-> context -> dataset -> load(): no easy access inside nodes
SAS	No official plugin, but my team has built an internal plugin which mimics SQL behaviour with `SASTableDataSet` and `SASQueryDataSet`	AbstractDataSet	saspy.SASsession()	SASTableDataSet.init	session-> hook: no easy access inside nodes
Big Query	Two datasets that mimic SQL ones	AbstractDataSet	bigquery.Client()	GBQQueryDataSet.init	session-> context -> dataset -> load(): no easy access inside nodes
Mlflow	kedro-mlflow, pipelineX	Custom configuration class in the plugin + Custom Hooks (for both)	mlflow.tracking.MlflowClient()	MlflowPipelineHook.before_pipeline_run	session-> hook: no easy access inside nodes
Dolt	kedro-dolt	Custom Hook	pymysql.connect()	DoltHook.init	session-> hook: no easy access inside nodes
Neptune	kedro-neptune	Custom AbstractDataSet	neptune.init()	NeptuneRunDataSet.init	session-> context -> dataset -> load(): no easy access inside nodes
Dataiku	kedro_to_dataiku	Instantation on the fly where needed	dataiku.api_client()	Instantation on the fly where needed	import package
Rest API	APIDataSet	APIDataSet	requests.request(**self._request_args)	APIDataSet._load	session-> context -> dataset -> load(): no easy access inside nodes

We can make the following observations:

there is no unified place where we can easily declare "connectors" to the backend (say backend_config.yml, see a better name further). As a consequence:
- these backends are not singleton or configuration: they are python objects recreated each time they are needed.
- it is hardly possible to change the backend (e.g. in production to change its credentials) without accessing the code, and this is a no go for deployment.
there is no unified place where we can easily "setup" these connections. They are initialised either in hooks, in custom project context, on dataset instantiation or on dataset loading. When you are looking in the code it it really hard to find out what side effect occurs behind the scene "I don't understand why it behaves this way.... Oh look someone modified the ProjectContext and add some hidden logic inside an auto registered hook"). It would make sense to have a dedicated "time" in the workflow where these objects are instantiated (maybe lazily when the KedroSession is activated?, maybe before the first use, and then reuse it for the next times through the session?).
There is only one example where the connection is retrieved inside the nodes (the SparkSession). This is because this connection is a singleton thanks to the original package design, but you cannot do something similar for, say, SQL (issue #880 tries to solve it by introducing a SQLConnection DataSet though)

Understanding the limitations: why these solutions are not sustainable in the long run

Limit 1 : Computation should be part of the nodes instead of catalog for maintenance

Many computation are performed in the catalog while they belong to the nodes: according to the principles of kedro because only I/O operations should take place in the catalog. This causes several maintenance issues:

the catalog.yml is overcrowded and hard to read / understand / modify / test (#781 even if not directly related, #880 with the ability to load sql queries from another file)
the DAG cannot be visualised (you have to create a "fake" node and fake output to create the DAG and trigger the DataSet.load methods in a specific order, see #164 or this discourse discussion!)
the "code" is not visible inside kedro-viz right panel, because it is in the dataset
When computations is performed inside a dataset load method, it is "hidden" from the user

this makes Kedro hard to use for DataWarehousing not written in pure python as required in #360.

Limit 2 : Performance issues arise because of current implementation

Perfoming calculations inside the DataCatalog raises several other issues:

Instability: Connections are instantiated several times (once per dataset, and eventually once per use) which could fire limit rates
RAM overflow: You do not want to load everything in memory by default: it is slow, causes RAM overflow and do not leverage compute backend (#575, #752, #880).
Lack of genericity: A new DataSet should be created for each new backend (Impala #504, Presto #801, GBQ #442, requests.Session #364, SparkHiveDataset #725) , and even just for more complex queries (templated query #214, looping on table #126, special backend options #752, deleting rows in SQL table #545)
Impossibility to launch asynchronous remote execution: Modifying the same database several times raise pieline DAG resolution errors (#806), triggering events on server (dolt #746

Limit 3: It is hard to bypass existing issues

As of kedro==0.17.X, it is hard to modify the existing implementation for your custom use case without the DataCatalog because:

You do not have a well identified place to create this connection, unlike other Kedro objects (e.g. functions belong to nodes which are in pipeline which are registered in pipeline_registry; datasets belong to the DataCatalog which takes its configs from the catalog.yml), so it is hard to find out the best way for you use case (a hook? a dataset? a custom ProjectContext?)
It is very hard to trick the computation to be performed in a node because you often need to access credentials (and you want to leverage ProjectContext and ConfigLoader for this) to create the connections. Accessing the credentials inside nodes is unsecured and make reuse hard (#801, .

The current solutions are the following, and None is satisfying:

Create custom datasets: tend to multiply the datasets for custom needs, and fail to achieve an "universal" solutio that matches most use cases
Modify objects at runtime with hooks: lead to dangerous side effects and hard to maintain behaviour
Create a custom ProjectContext: lead to dangerous side effects and hard to maintain behaviour

Thoughts and suggestions on API design changes

Desired properties for remote connections

Here is a minimal set of property I can think of (feel free to add some if you think some are missing):

Offer a clear interface for remote connections: quoting the Zen of Python "There should be one-- and preferably only one --obvious way to do it." In Kedro words, there should have one (and preferably only one) location in the template where objects of a given type should be declared.
Lazy instantiation: we should not trigger the connection unless it is needed
Support for both catalog/dataset and pipelines/nodes access. This means that we should be able to call a dataset with an "engine" parameter (e.g. say you want to read an SQL table with SQLTableDataSet, you want to reuse the existing connection and not instantiate it inside the DataSet) and inside a node (e.g. run a complex query with python code).
Leverage kedro's existing mechanism to handle credentials

Benefits for kedro user

Remove catalog cluttering and reduce its size (this fits well in the wider story described in #891)
Simplify overhead and maintenance (Kedro tells you were to put such connections, it is NOT up to you to choose)
Avoid creating a lot of Custom DataSets for their connections specificities
Have support of kedro-viz / auto completion / enable testing for better & faster development

API design suggestion

I suggest to have an API very similar to Kedro's DataCatalog to manage external "engines" (an engine is a client to interact with a remote server, a database or any other backend).

We would have the following correspondance:	data	clients for external connections
`AbstractDataSet`	`AbstractEngine`
`DataCatalog`	`EngineCatalog`
`catalog.yml`	`engines.yml`

An AbstractEngine would implement the following methods:


class AbstractEngine:
    def __init__(self, init_args, credentials)
        # store parameters as attributes here
        self.init_args=init_args
        self._credentials=credentials

    def get_or_create(self):
        # [the connection is a singleton](https://python-3-patterns-idioms-test.readthedocs.io/en/latest/Singleton.html)
        # this functions tries to retrieve the connection it, or create if if the singleton is not instantiated yet
        # the goal is to lazily create the connection object, but not connect to the backend yet
        # the goal is to call it at each call, so we instantiate it only when needed
        return my_connection_object(**self.init_args, self._credentials)

    def ping(self):
        # (optional) eventually connect to the backend and check if connection is healthy?
        pass

    def describe(self):
        # (optional) some information of the connection?
        pass

On a per project basis, a dedicated configuration file wil enable to declare the connections in a catalog-like way. As for catalog, anyone can create custom engine that are intened to be used inside nodes. The following notations are not well defined, but I guess they are close enough to the DataCatalog to be self-explainable:

#conf/local/engines.yml

my_sql_connection:
    type: sql.SQLAlchemyEngine
    init_args: 
        arg1: value1 #whatever you want 
    credentials: my_sql_creds  # retrieved from `credentials.yml` to leverage existing mechanism

my_spark_session:
    type: spark.SparkEngine
    init_args: 
        arg1: value1 #the .config() args? API to be designed more precisely
    credentials: my_spark_creds  # retrieved from `credentials.yml` to leverage existing mechanism

my_neptune_client:
    type: neptune.NeptuneEngine
    credentials: my_neptune_creds  # retrieved from `credentials.yml` to leverage existing mechanism

# whatever engine you need, e.g. to interact with mlflow, requests, any other sql backend, google big query...

When session.load_context(), the EngineCatalog object is instantiated and accessible like pipelines and catalog:

# example.py
with KedroSession.create(my_package):
    print(session.engines)
    # > kedro.engines.EngineCatalog object
    print(session.engines.list()
    # > ["my_sql_connection","my_spark_session", "my_neptune_client"]  # similar to DataCatalog's API

The key part is that these connection objects are accessible from the nodes as for the catalog, hereafter is an example with sql:

# src/<project_name>/pipelines/<my_pipeline>/pipeline.py

def create_pipeline(**kwargs):
# src/<project_name>/pipelines/<my_pipeline>/pipeline.py

def create_pipeline(**kwargs):
    return Pipeline(
    [node(
        func=func_for_sql,
        inputs=dict(
            engine="my_sql_engine",
            args="params:param1"
            ),
        outputs="my_sql_table"
    )]
)

Important note for developers: the key idea is that the connection is lazily instantiated and created at the first call and stored in a singleton; any further call will reuse the same connection

This refers to a custom func_for_sql function declared by the developer:

def func_for_sql(engine, args):
    query="""
        create table as ... from ...
    """
    res=engine.execute(query)
    return res

Note that there is maximum flexibility here: you don't have to load the results in memory, you can use python wrappers to write your SQL query instead of writing a big string (and benefits autocompletion, testing...): in short you can do anything you can code.

This feature request is very similar to the SQLConnectionDataset described in #880 and the associated PR #886. Above example focuses on SQL because it seems to be one of the most common use case, but I hope it it clear that this implementation would cover much more use cases (including, but not limited too, SparkSession and tracking backends which also seem to be common use cases). As a side consequence, it also helps minimizing the catalog size which partially solves somes issues discussed in #891.

Other impacts

If this approach was kept, I would incline to remove all datasets which perform computations, including all xxxQueryDataSet (where xxx=SQl, GBQ...) eventually APIDataSet (not sure about this one) and the documentation bout SparkContext. We should keep the xxxTableDataSet which only perform I/O operations and no computations though.

Other possible implementations

Instead of creating new "Engine" objects, we could only create new catalog objects to instantiate these connections, but since these objects are not real dataset (they don't have save methods), I think it is more consistent to separate them from the catalog.yml. Furthermore, we may want to pass an engine connection to a DataSet object (say a SQLTableDataSet) to be backend independent, and this assume we have instantiated the EngineCatalog object before the DataCatalog.
We could create a plugin to manage these objects quite easily; however, the interaction with the session (when are these "engines" instantiated ? How to interact with nodes in a catalog-like way?) may be difficult and need specific tricks.

datajoely commented 3 years ago

Hi @Galileo-Galilei - as always thank you for such a fantastic contribution to the project. I too have been thinking about this and am very keen to make this work. I need a bit more time to digest this but - there is some good work to do here!

datajoely commented 3 years ago

Hi @Galileo-Galilei - this is a great piece of work and thank you for putting so much time and care into this.

I'm going to break down my thinking into two distinct parts:

Moving to a singleton pattern for all datasets objects which work via a remote connection and are capable of performing remote execution
Should we as Kedro allow for arbitrary execution of statement that we have no visibility of in terms of lineage and has no guarantee of being acyclic. That being said, it would be sweet to work with SQL data without serialising data locally

Regarding the first point, ✅ we've actually been discussing internally and are 100% aligned that we need to implement this - great minds think alike 😛 . I'm quite a fan of your engines.yml pattern - we will do some prototyping on our side to see what the least breaking change looks like.

Now the second point is a bit trickier 🤷 - I had been pushing for this on our side for a while, but have recently come around to @idanov 's idea that this breaks too many of our core principles to support natively. In this world - our pipelines are no longer guaranteed to be reproducible and I think it becomes hard for us to argue the DAG generated by kedro-viz reflects reality. As it stands we are also not planning to merge the #886 into Kedro for these reasons.

For transformations, SQL support in Kedro is poor - we can use it as a point of consumption, but serialising back and forth to Python to perform transformations is silly. Your proposed solution would allow users to leverage SQL as an execution engine, but I struggle to fit it into Kedro's dataset-orientated DAG.

If you want to do transformations in SQL it's hard for me not to recommend using something like dbt and their ELT pattern for the data engineering side of things and focus on Kedro for the ML engineering parts of your workflow. They solve this lineage point by building a DAG via jinjafied SQL ref() and source() macros. I've spent a lot of time thinking about how Kedro should work in this space and am prototyping some ways that we can better compliment such setups.

I'm keen to see what the community thinks here - the former data engineer in me wants this functionality to just get things done, but the PM in me feels this will be a nightmare to support.

Galileo-Galilei commented 3 years ago

Hello @datajoely, thank you very much for the answer.

Good to see we're align on the first point :wink:

Regarding the second point, I think we may have slightly different objectives:

from a PM / PO / Tech Lead point of view, I understand that we're opening the gates to hell here. We may break existing functionalities (e.g. resuming a pipeline from a node, having unconsistent behaviour because asynchronous execution of a query is erformed behind the scene while the pipeline continue running, eventually breaking the DAG...) and these drawbacks must be thoroughly thought in regards to the expected benefits.
from the point of view of a lead data scientist (who performs code review) or a responsible of internal framework/best practices in an enterprise, you likely just want to "get things done" in the best possible way. I have seen & reviewed many kedro projects in the past 2 years, and it turns out that even if you don't implement it, people will do it anyway, in a way or another :) I can hardly blame them for that because I understand it is much more convenient to centralise some database interactions inside your kedro project instead of using a different tool. Interacting with Spark is also a very common use case in my enterprise. It is not acceptable for me to let people do whatever they want on a per project basis to deal with these use cases.

Maybe the best solution is to create a plugin (not integrated to the core framework) to enable this possibility, with adequate warnings on the risks of doing so. The MVP for this plugin would only contain new datasets like SQLConnectionDataSet described above and not implement the whole "engine" system, waiting for you to come up with something more integrated to the core framework. This would be very easy to implement and maitain on the short term, and easily reversible in case you implement something better in the future, even if it would not be as handy as the "engine" design pattern.

P.S: Thanks for pointing dbt out, I'll have a look!

datajoely commented 2 years ago

Some concerns discussed here will be addressed by https://github.com/kedro-org/kedro/pull/1163

deepyaman commented 1 year ago

I wonder if using something like https://github.com/fugue-project/fugue (or https://github.com/ibis-project/ibis?) under the hood makes sense if want to support engines in this way. Both really took off well after @Galileo-Galilei's initial post, of course. :)

astrojuanlu commented 3 weeks ago

Read this a bit quickly, but IIUC the idea is quite bold: to make the I/O layer thinner by abstracting only the connection/engine/transport and not baking in the in-memory representation.

Funny how we ended up re-discovering very similar ideas in one of the first Tech Design sessions I participated in... https://github.com/kedro-org/kedro/issues/1936#issuecomment-1597474696

I'm guessing some of the issues presented have been sort of addressed with the introduction of ibis.TableDataset. But others remain, like the limitations of credentials.

I think some of these views can be applied to the way we handle pretrained models, specifically LLMs, see https://github.com/kedro-org/kedro/discussions/3979

My question then would be: with the core of Kedro intact, can a custom DataCatalog (EngineCatalog?) be created that demonstrate these ideas?

As I mentioned elsewhere we're doing an issue cleanup to use Discussions for enhancement proposals #3767. I think this one is a bit more self-contained than part 1, so I'm taking the liberty of moving it to a Discussion directly. Let's continue the conversation there.

kedro-org / kedro