kedro-org / kedro

Kedro is a toolbox for production-ready data science. It uses software engineering best practices to help you create data engineering and data science pipelines that are reproducible, maintainable, and modular.
https://kedro.org
Apache License 2.0
9.53k stars 877 forks source link

Dolt // Kedro #746

Closed max-hoffman closed 3 years ago

max-hoffman commented 3 years ago

Introduction

The Dolt team is interested in exposing DoltDB as a Kedro DataSet type. We are also excited about the idea of exposing diffing and other SQL features for change capture if useful to the Kedro team.

I briefly filled out the bullet points below, but the write-up in my draft PR is more straight-to-the-point.

Draft PR -> https://github.com/dolthub/kedro/pull/1

Included in the PR -- brief tutorial notes/comments.

The starter integration is not heavily tested, we don't intend these additions to make it into a final PR, we are most interested in design feedback.

Background

Dolt is an SQL-database with Git-versioning. Standalone it can be datasource for workflow managers. Without custom code it just does what MySQL or SQLLite does. Varying levels of Git-functionality can be included in integrations to provide versioning, diffing, merging and reproducibility for tabular datasets that is unique to our storage layer (we have quite a few blogs on this). We have gotten a lot of positive feedback so far in this space and hope we can help solve thorny versioning problems!

Problem

What's in scope

  1. Generic database integration
  2. Commits in database that end-user manages themselves
  3. Metadata that helps users and/or Kedro track lineage
  4. Application database that extends workflow change-capture

What's not in scope

Design

Kedro remote-object interface that I've focused on:

I made an example Dolt integration that behaves similar to Pandas DataFrames for end-users, but uses Dolt to capture lineage and deltas of those tables for users.

Metadata storage, remotes, and advanced branching logic are all optional extensions beyond an otherwise pd.DataFrame experience.

Journaling scope limited to data catalogs, and versioning having a different meaning in Dolt are two friction points that I haven't addressed in my sample code.

Alternatives considered

Two other integration patterns:

  1. Expose the Dolt database itself, that users can interact with natively
  2. Context manager that can "squash" the metadata log by wrapping an execution runtime.

Neither of these struck me as particularly suited to Kedro's existing UX.

edit: SQL-server integration was mentioned as more appealing than an FS-based approach in our intro call. The two are interchangeable, FS is just easier to demo and test currently.

Testing

Explain the testing strategies to verify your design correctness (if possible). TODO

Rollout strategy

Is the change backward compatible? If not, what is the migration strategy? TODO (short answer yes)

Future iterations

Will there be future iterations of this design?

Hopefully! We are excited for feedback!

max-hoffman commented 3 years ago

Hey @limdauto! I'm interested in demoing/getting feedback on this sample code. Should I email the group that met last week? Do you guys have a separate public forum we could communicate on async?

limdauto commented 3 years ago

Hey @max-hoffman thank you very much for this. I was trying out your dataset last night and promise I would write a longer response by the end of the week. In general, I think the dataset is great. But I'm trying to make sure I understand a few points such as "Metadata that helps users and/or Kedro track lineage" more clearly. I'm not sure how lineage is being tracked here.

Re comm channel, we can continue the discussion here for sure.

max-hoffman commented 3 years ago

Hey Lim, that's a good question!

Database and workflow metadata systems can complement one-another. Here's a version we did with Metaflow.

I apologize in advance for the long explanation and if I incorrectly generalized Kedro's journalizing system!

Lineage in Dolt means the commit graph -- the atomic unit of change is a commit. A commit has data changes, one or more parents (more if merging two branches), and metadata associated with the commit. Dolt's dolt_commits and dolt_log table respectively contain the entire commit history, and the lineage specific to the current working branch:

> dolt sql -q "select * from dolt_commits"
+----------------------------------+-----------------+---------------------+-----------------------------------+----------------------------+
| commit_hash                      | committer       | email               | date                              | message                    |
+----------------------------------+-----------------+---------------------+-----------------------------------+----------------------------+
| 2bf1bh2l53sf5qs048qg619en84livu0 | Bojack Horseman | bojack@horseman.com | 2021-04-13 12:46:58.901 -0700 PDT | Initialize data repository |
+----------------------------------+-----------------+---------------------+-----------------------------------+----------------------------+

Dolt's data storage format and SQL components are unique, but otherwise this should sound like Git's model.

Lineage in workflow managers usually means something else -- connecting runtime/execution/session variables with inputs and outputs. Kedro's journaling system tracks lineage metadata as an append-only log. Kedro's CSV DataSet copies files by timestamp to maintain a change-set history.

When a user's data source is Dolt, they don't have to download two CSVs and pull them into Jupyter to compare side-by-side. Dolt tracks schema changes between the two, row diffs, row-level blames, entire history of data and commits. Everything is "under one roof," so to speak.

A different use-case would be if, for example, Kedro's DataCatalog was implemented as a Dolt table that tracked versions of DataSet inputs/outputs between executions: create table journal (node_name text, dataset_name: text, save text, load text). If the save and load columns are md5 tree hashes of CSV files or folders, the workflow manager can check dolt status to know whether any node inputs have changed. When files do change, the history of their changes are recorded in the commit graph.

limdauto commented 3 years ago

Hi @max-hoffman, thank you very much for taking the time to write the issue and for making the demo. Apology for the delay in response, partially because I have been in some training all week and partially because I really want to wrap my head around what exactly we are trying to accomplish here. First thing first, such an awesome piece of technology you and the Dolt team have built there. I can't express enough how excited I am with Dolt. It feels like having a superpower I don't yet know what to do with.

Regarding an integration with Kedro, you have touched on many great ideas in your issue and in the demo. However, please allow me to take a step back and look at this from a Kedro user perspective first. As a Kedro user, I believe I can already use Dolt right now as a data source in Kedro without any extra dataset, thanks to your SQL interface. I would use it wherever I want to track different versions of my tabular datasets. It would be an alternative option to Kedro's path-based VersionedDataSet for different tabular formats, e.g. csv.

The workflow is:

  1. Since Kedro allows users to inject extra behaviours to its execution timeline through a mechanism called Hooks, I'd write a before_pipeline_run hook to start a Dolt SQL server and a after_pipeline_run hook to commit the data and stop the SQL server:

    class ProjectHooks:
    def __init__(self):
        project_path = Path(__file__).parent.parent.parent
        self.dolt = Dolt(project_path)
        self.dolt_sql_server = DoltSQLServerContext(self.dolt, ServerConfig())
    
    @hook_impl
    def before_pipeline_run(self):
        self.dolt_sql_server.start_server()
    
    @hook_impl
    def after_pipeline_run(self, run_params: Dict[str, Any]):
        self.dolt.add(".")
        try:
            self.dolt.commit(
                message=f"Update data from Kedro run {run_params['run_id']} with params {run_params['extra_params']}"
            )
        except DoltException as e:
            if "no changes added to commit" not in str(e):
                raise
        finally:
            self.dolt_sql_server.stop_server()
  2. Then whenever I want to write data to Dolt, I'd just use the SQL interface through the built-in pandas.SQLTableDataSet. For example:

example_test_x:
  type: pandas.SQLTableDataSet
  table_name: example_test_x
  credentials:
    con: mysql://root@localhost:3306/kedro_dolt_demo
  save_args:
    if_exists: replace

And voila! If your data change between kedro run, it'd show up as Dolt commits in dolt log. For example, I have setup an example project here to demonstrate this. It's exactly the same as a default project created with our pandas-iris starter:

kedro new --starter=pandas-iris

with a modified hooks.py and catalog.yml to integrate with Dolt as explained above*. The pipeline contains a node that splits data for training and testing purpose based on some parameters. When I run the pipeline with different train/test split ratio:

kedro run --params example_test_data_ratio:0.1
kedro run --params example_test_data_ratio:0.2

there are corresponding commits in dolt:

D:\kedro-dolt-demo (main -> origin) 
(kedro-38) λ dolt log
commit m3112s3uuird3rtjt28cdeitp5prp6td
Author: Lim Hoang <limdauto@gmail.com>
Date:   Fri Apr 23 23:46:04 +0100 2021

        Update data from Kedro run 2021-04-23T22.45.45.157Z with params {'example_test_data_ratio': 0.2}

commit jc77hh54t97na1hs8i6k8b5pfrh7tiej
Author: Lim Hoang <limdauto@gmail.com>
Date:   Fri Apr 23 23:45:01 +0100 2021

        Update data from Kedro run 2021-04-23T22.44.41.926Z with params {'example_test_data_ratio': 0.1}

We can now all of Dolt tools to interact with the data, e.g. dolt diff

Screenshot 2021-04-24 000933

I believe this workflow is more familiar and idiomatic to Kedro users while still showcasing the values that Dolt would bring. If you are happy with this approach, we could definitely write it up in our documentation in the section for Tools integration next to Spark. Some further ideas to improve upon this would be to allow users to checkout different data branches by passing in an extra param from the CLI, e.g. kedro run --params dolt_branch:yesterday_data and use dolt.checkout programmatically in before_pipeline_run hook. The dream here would be to be able to incorporate this concept of data branches with data scientists' experimentation tracking tools, which we also do through Hooks. Writing this up takes a bit more time so I will leave it till another day.


(*) I lie a little bit here. Even though I recommend we start and stop Dolt SQL server programmtically, I actually had to do it manually in my demo project with dolt sql-server --max-connections=10 from another terminal. When I start the server from another terminal, I got the nice diff of my data as presented above. However, when I start it programmatically, the diff simply says table deleted/table added. Do you have any idea why? Our SQLTableDataSetuses pandas read_sql_table and to_sql underneath. Also thanks for fixing the --max-connections yesterday haha... Otherwise it was hanging for me before.

limdauto commented 3 years ago

The previous comment was written from the perspective of a Kedro user. I want to write up a lot more thoughts from the perspective of a Kedro developer because I think Dolt is a great tool and could solve a lot of thorny problems for us like Change Data Capture, but I still need to organise my thoughts a bit more. This is very much like a superpower that I'm still trying to wrap my head around.

max-hoffman commented 3 years ago

Thanks for giving such a thorough response!

It might take me a little while to go through this line-by-line, but my first reaction is that hooks sound like a promising direction. I did a write-up here where I wrap generic reads/writes with Dolt's features. Let me know if that helps clarify.

I have my next week planned out but I'd love to take another attempt using before_pipeline_run the following week after studying Kedro a bit more.

max-hoffman commented 3 years ago

@limdauto your demo is great! I'll do one iteration and try to write a blog draft next week if that's OK. Feel free to ask as many questions as you want in the meantime.

max-hoffman commented 3 years ago

Hey Lim, I'm working on reproducing your demo right now. I'll push some changes after I work through setup issues.

I forgot to respond to your questions:

limdauto commented 3 years ago

Hey, thank you for the update and the extra information. It all looks good from my side.

Branching, remotes and auditing would need some custom code.

I agree. At the same time, Kedro is, among other things, a codification of workflow. I'd love to design a sensible workflow that makes sense for our users first. I'd imagine the implementation is the simpler issue here. So I'm happy to progress with first a simple, most obvious workflow as outlined in the hook-based approach. Then we can think of further improvements once there is adoption.

Re diff: Thanks for the explanation. Let me see if I can reproduce the problem.

limdauto commented 3 years ago

@max-hoffman relatedly, I need to sort out a proposal for https://ep2021.europython.eu/events/call-for-proposals/ this week. I'm thinking of submitting a proposal around a Dolt // Kedro integration because I truly believe there is a much bigger future here. I might send it over to you for comments. cc @yetudada

max-hoffman commented 3 years ago

@limdauto We'd be excited to speak at events. Let me know if I can help. I got a Kedro-Dolt blog draft together last night if that's helpful at all. I was going to make a plugin package and do a second draft with hardened examples before sending it your way.

oscarbatori commented 3 years ago

@limdauto / @yetudada - shall we arrange a meeting to discuss how to take this forward from marketing/possible GTM perspective?

max-hoffman commented 3 years ago

@limdauto I am going to spend time making dolt handle "replace" diffs more intuitively. I'll try to get those changes in and a blog draft to you by Monday/Tuesday next week.

max-hoffman commented 3 years ago

Hey @limdauto, took me a bit longer than expected but I fixed the diff bugs and added an initial blog doc here https://github.com/dolthub/kedro-dolt-demo/blob/max/branches/blog.md.

The diff fixes will roll into the next dolt release on Monday.

We will make a push to tighten a publish this blog soon. We'd love your feedback!

max-hoffman commented 3 years ago

I made and released a plugin here -- https://github.com/dolthub/kedro-dolt.

datajoely commented 3 years ago

I think we can close this :) Thanks @max-hoffman

datajoely commented 3 years ago

I'm also adding a ticket to our backlog to add some Dolt references in our docs