kedro-org / kedro

Kedro is a toolbox for production-ready data science. It uses software engineering best practices to help you create data engineering and data science pipelines that are reproducible, maintainable, and modular.
https://kedro.org
Apache License 2.0
9.53k stars 879 forks source link

Kedro dataset CLI commands #3714

Open datajoely opened 4 months ago

datajoely commented 4 months ago

Description

Related to the overall plug-in epic of #583 I've been thinking about both the Kedro team's own maintenance burden and what user friction I see with working with dataset contributions today.

Context

At a high level the following points contribute to this status quo:

Possible Implementation

I suggest Kedro introduce a set of CLI commands focused on this dataset workflow. We have history of these ideas in the micropackaging journey as well.

They would all follow the kedro dataset <command> pattern:

command priority description
pull P0 This would accept either kedro-datasets name as per the catalog e.g. polars.GenericDataSet. It would pull the source code, add the dependencies and provide an example catalog entry. Longer term we could think about how 3rd party polyrepos could work e.g. (1) (2)
create P0 Create class in users environment with correct structure, may need a workflow for file based (fsspec) or not. Get users contribution ready on day 1, can even include test and lint rules.
install P2 Provide an easy wrapper over the correct pip command, adding the dependency to your project and providing an example catalog entry.
contribute P2 Provide a workflow for pushing the results of pulls/creates back into the open source project