kedro-org / kedro

Kedro is a toolbox for production-ready data science. It uses software engineering best practices to help you create data engineering and data science pipelines that are reproducible, maintainable, and modular.
https://kedro.org
Apache License 2.0
9.49k stars 874 forks source link

[spike] Investigate possible integration with DVC #2691

Open astrojuanlu opened 1 year ago

astrojuanlu commented 1 year ago

Description

It would be nice if we had a page on our docs that described how DVC and Kedro can be used together.

Context

Kedro users have been asking for DVC for some time. For example:

The scope of this issue is to document how such thing can be done, but there's a chance that the best way of documenting it is to contribute a plugin.

stichbury commented 1 year ago

I asked Chat GPT; what's the quality like for this response? Just wondering if we can use it start off a page...

Kedro and DVC are both tools that can be used for data science and machine learning projects. While they serve different purposes, they can be used together to enhance the reproducibility and version control aspects of your projects. Here's how you can integrate Kedro and DVC:

Initialize a Kedro project: Start by creating a new Kedro project using the kedro new command. This will set up the basic project structure and configuration files.

Set up DVC: Install DVC and initialize it within your Kedro project directory using the dvc init command. This will create the necessary DVC files and directories.

Define your Kedro pipelines: Use Kedro to define your data processing pipelines. This involves creating modular pipeline nodes that represent individual steps in your data science workflow.

Version your data with DVC: Use DVC to version your data files and track their changes over time. By using the dvc add command, you can tell DVC to track specific data files in your Kedro project. This ensures that your data remains under version control and can be easily reproduced.

Integrate DVC with Kedro pipelines: Modify your Kedro pipeline definitions to include DVC commands. For example, you can use the dvc run command within a Kedro pipeline node to run a data processing step and have DVC track the inputs and outputs of that step.

Use DVC for data versioning and sharing: DVC provides features for managing large datasets and collaborating with others. You can use DVC's remote storage options (such as AWS S3 or Google Cloud Storage) to store your data and share it with your team or collaborators.

By combining Kedro and DVC, you can leverage Kedro's pipeline management capabilities with DVC's data versioning and reproducibility features. This allows you to track and reproduce your data science experiments effectively, ensuring the integrity and consistency of your results.

noklam commented 1 year ago

@stichbury brilliant idea!

JaynouOliver commented 8 months ago

Sure, please assign me, I want to contribute and learn on the go

astrojuanlu commented 8 months ago

Hi @JaynouOliver, go ahead! No need to assign the issue, start working on a new documentation page and open a pull request when it's ready for a first review.

JaynouOliver commented 8 months ago

Sure!

astrojuanlu commented 8 months ago

Interesting perspective from a DVC user: https://fosstodon.org/@blakeNaccarato/111256190959866234

I appreciate the separation of concerns that working with DVC facilitates. Stages as shell commands make non-Python stages trivial. It's good for general processing outside research pipelines too, e.g. document processing.

Stage caching is enabled by hash comparison of deps/outs on disk and avoids costly recompute.

But this design forces disk access between each stage and lots of intermediate files. An abstraction enabling all-in-memory stages could help at the expense of caching.

astrojuanlu commented 8 months ago

Today @datajoely mentioned this in our Slack, didn't realize that our dataset versioning sort of overlaps https://linen-slack.kedro.org/t/16014653/hello-very-much-new-to-the-ml-world-i-m-trying-to-setup-a-fr#e111a9d2-188c-4cb3-8a64-37f938ad21ff

DVC and Kedro don’t gell super nicely together, it can be done but our support for native DataSet versioning and Delta (spark) (non-spark) also work in this space

stichbury commented 8 months ago

Hi @JaynouOliver -- how are you? Today is the last day of October so please do slip any PRs into our queue if you have them for Hacktoberfest.

JaynouOliver commented 8 months ago

Hi. I was not doing it for hacktoberfest. Mind if I submit it by tomorrow?

stichbury commented 8 months ago

Then that's grand, yes please, that would work for us. Thank you.

astrojuanlu commented 5 months ago

For the record, yesterday two users asked me how to combine Kedro and DVC.

stichbury commented 5 months ago

For the record, yesterday two users asked me how to combine Kedro and DVC.

Did you tell them? Did you write it down? If not, is the above generated content any use? Shall we publish?

I have many questions.

astrojuanlu commented 5 months ago

It was an in-person chat after my talk. I told them to try https://github.com/FactFiber/kedro-dvc/ but also warned them that Kedro versioning is not easily configurable so it might be hard https://github.com/kedro-org/kedro/issues/2355 I think this has to be an engineering spike before a documentation issue.

stichbury commented 5 months ago

Perfect, thanks for the background and also for the change in the ticket, makes sense to me.