Closed Galileo-Galilei closed 3 weeks ago
Hi @Galileo-Galilei - as always thank you for such a fantastic contribution to the project. I too have been thinking about this and am very keen to make this work. I need a bit more time to digest this but - there is some good work to do here!
Hi @Galileo-Galilei - this is a great piece of work and thank you for putting so much time and care into this.
I'm going to break down my thinking into two distinct parts:
Regarding the first point, ✅ we've actually been discussing internally and are 100% aligned that we need to implement this - great minds think alike 😛 . I'm quite a fan of your engines.yml
pattern - we will do some prototyping on our side to see what the least breaking change looks like.
Now the second point is a bit trickier 🤷 - I had been pushing for this on our side for a while, but have recently come around to @idanov 's idea that this breaks too many of our core principles to support natively. In this world - our pipelines are no longer guaranteed to be reproducible and I think it becomes hard for us to argue the DAG generated by kedro-viz reflects reality. As it stands we are also not planning to merge the #886 into Kedro for these reasons.
For transformations, SQL support in Kedro is poor - we can use it as a point of consumption, but serialising back and forth to Python to perform transformations is silly. Your proposed solution would allow users to leverage SQL as an execution engine, but I struggle to fit it into Kedro's dataset-orientated DAG.
If you want to do transformations in SQL it's hard for me not to recommend using something like dbt and their ELT pattern for the data engineering side of things and focus on Kedro for the ML engineering parts of your workflow. They solve this lineage point by building a DAG via jinjafied SQL ref()
and source()
macros. I've spent a lot of time thinking about how Kedro should work in this space and am prototyping some ways that we can better compliment such setups.
I'm keen to see what the community thinks here - the former data engineer in me wants this functionality to just get things done, but the PM in me feels this will be a nightmare to support.
Hello @datajoely, thank you very much for the answer.
Good to see we're align on the first point :wink:
Regarding the second point, I think we may have slightly different objectives:
Maybe the best solution is to create a plugin (not integrated to the core framework) to enable this possibility, with adequate warnings on the risks of doing so. The MVP for this plugin would only contain new datasets like SQLConnectionDataSet
described above and not implement the whole "engine" system, waiting for you to come up with something more integrated to the core framework. This would be very easy to implement and maitain on the short term, and easily reversible in case you implement something better in the future, even if it would not be as handy as the "engine" design pattern.
P.S: Thanks for pointing dbt out, I'll have a look!
Some concerns discussed here will be addressed by https://github.com/kedro-org/kedro/pull/1163
I wonder if using something like https://github.com/fugue-project/fugue (or https://github.com/ibis-project/ibis?) under the hood makes sense if want to support engines in this way. Both really took off well after @Galileo-Galilei's initial post, of course. :)
Read this a bit quickly, but IIUC the idea is quite bold: to make the I/O layer thinner by abstracting only the connection/engine/transport and not baking in the in-memory representation.
Funny how we ended up re-discovering very similar ideas in one of the first Tech Design sessions I participated in... https://github.com/kedro-org/kedro/issues/1936#issuecomment-1597474696
I'm guessing some of the issues presented have been sort of addressed with the introduction of ibis.TableDataset
. But others remain, like the limitations of credentials.
I think some of these views can be applied to the way we handle pretrained models, specifically LLMs, see https://github.com/kedro-org/kedro/discussions/3979
My question then would be: with the core of Kedro intact, can a custom DataCatalog
(EngineCatalog
?) be created that demonstrate these ideas?
As I mentioned elsewhere we're doing an issue cleanup to use Discussions for enhancement proposals #3767. I think this one is a bit more self-contained than part 1, so I'm taking the liberty of moving it to a Discussion directly. Let's continue the conversation there.
Preamble
This is the second part of my serie of design documentation on refactoring Kedro to make deployment easier:
DataCatalog
entries which have a compute/storage backend different than "python / in memory operations". This includes Spark, SQL, SAS, Mlflow... The goal is to suggest an API to manage these external connections in a kedronic way (credits to @BenjaminLevyQB for the name). This has some overlaps with #891 regardingDataCatalog
management and with #880 regarding on how to solve it.Defining the feature: The need for a well identified place in the KedroSession and the template for managing connections
Identifying the uses cases: perform operations with connections to external backend / servers
As Kedro's popularity is increasing and more and more people are using it in an enterprise context, they tend to have more needs to interact with external/legacy systems. The main use cases are:
SQLTableDataSet
)SQLTableDataSet
)SparkContext
,SQLQueryDataSet
)MlflowArtifactDataSet
, kedro-dolt, kedro-neptune...)As of now (
kedro==0.17.4
) Kedro offers very limited support for these use cases, and it hurts maintenability and transition to deployment, mainly because it is hard to modify these backend connections credentials for production.Overview of possible solutions in kedro==0.17.4
Above use cases are currently handled on a per-backend basis, with very different choices in the implementation. Hereafter is a non exhaustive list of examples where a connection to a remote serve is instantiated:
SASTableDataSet
andSASQueryDataSet
We can make the following observations:
backend_config.yml
, see a better name further). As a consequence:SparkSession
). This is because this connection is a singleton thanks to the original package design, but you cannot do something similar for, say, SQL (issue #880 tries to solve it by introducing a SQLConnection DataSet though)Understanding the limitations: why these solutions are not sustainable in the long run
Limit 1 : Computation should be part of the nodes instead of catalog for maintenance
Many computation are performed in the catalog while they belong to the nodes: according to the principles of kedro because only I/O operations should take place in the catalog. This causes several maintenance issues:
catalog.yml
is overcrowded and hard to read / understand / modify / test (#781 even if not directly related, #880 with the ability to load sql queries from another file)load
method, it is "hidden" from the userthis makes Kedro hard to use for DataWarehousing not written in pure python as required in #360.
Limit 2 : Performance issues arise because of current implementation
Perfoming calculations inside the
DataCatalog
raises several other issues:Limit 3: It is hard to bypass existing issues
As of
kedro==0.17.X
, it is hard to modify the existing implementation for your custom use case without the DataCatalog because:catalog.yml
), so it is hard to find out the best way for you use case (a hook? a dataset? a custom ProjectContext?)ProjectContext
andConfigLoader
for this) to create the connections. Accessing the credentials inside nodes is unsecured and make reuse hard (#801, .The current solutions are the following, and None is satisfying:
ProjectContext
: lead to dangerous side effects and hard to maintain behaviourThoughts and suggestions on API design changes
Desired properties for remote connections
Here is a minimal set of property I can think of (feel free to add some if you think some are missing):
SQLTableDataSet
, you want to reuse the existing connection and not instantiate it inside the DataSet) and inside a node (e.g. run a complex query with python code).Benefits for kedro user
API design suggestion
I suggest to have an API very similar to Kedro's
DataCatalog
to manage external "engines" (an engine is a client to interact with a remote server, a database or any other backend).AbstractDataSet
AbstractEngine
DataCatalog
EngineCatalog
catalog.yml
engines.yml
An
AbstractEngine
would implement the following methods:On a per project basis, a dedicated configuration file wil enable to declare the connections in a catalog-like way. As for catalog, anyone can create custom engine that are intened to be used inside nodes. The following notations are not well defined, but I guess they are close enough to the DataCatalog to be self-explainable:
When
session.load_context()
, theEngineCatalog
object is instantiated and accessible like pipelines and catalog:The key part is that these connection objects are accessible from the nodes as for the catalog, hereafter is an example with sql:
Important note for developers: the key idea is that the connection is lazily instantiated and created at the first call and stored in a singleton; any further call will reuse the same connection
This refers to a custom
func_for_sql
function declared by the developer:Note that there is maximum flexibility here: you don't have to load the results in memory, you can use python wrappers to write your SQL query instead of writing a big string (and benefits autocompletion, testing...): in short you can do anything you can code.
This feature request is very similar to the
SQLConnectionDataset
described in #880 and the associated PR #886. Above example focuses on SQL because it seems to be one of the most common use case, but I hope it it clear that this implementation would cover much more use cases (including, but not limited too,SparkSession
and tracking backends which also seem to be common use cases). As a side consequence, it also helps minimizing the catalog size which partially solves somes issues discussed in #891.Other impacts
If this approach was kept, I would incline to remove all datasets which perform computations, including all
xxxQueryDataSet
(where xxx=SQl, GBQ...) eventually APIDataSet (not sure about this one) and the documentation bout SparkContext. We should keep thexxxTableDataSet
which only perform I/O operations and no computations though.Other possible implementations
save
methods), I think it is more consistent to separate them from thecatalog.yml
. Furthermore, we may want to pass an engine connection to a DataSet object (say aSQLTableDataSet
) to be backend independent, and this assume we have instantiated the EngineCatalog object before the DataCatalog.