kedro-org / kedro

Kedro is a toolbox for production-ready data science. It uses software engineering best practices to help you create data engineering and data science pipelines that are reproducible, maintainable, and modular.
https://kedro.org
Apache License 2.0
9.47k stars 875 forks source link

Overview: Kedro's dependencies and what to do about Cookiecutter #3967

Open lrcouto opened 6 days ago

lrcouto commented 6 days ago

The original issue: Kedro has a lot of dependencies

cookiecutter
├── Jinja2<4.0.0,>=2.7
│   └── MarkupSafe>=2.0
├── arrow
│   ├── python-dateutil>=2.7.0
│   │   └── six>=1.5
│   └── types-python-dateutil>=2.8.10
├── binaryornot>=0.4.4
│   └── chardet>=3.0.2
├── click<9.0.0,>=7.0
├── python-slugify>=4.0.0
│   └── text-unidecode>=1.3
├── pyyaml>=5.3.1
├── requests>=2.23.0
│   ├── certifi>=2017.4.17
│   ├── charset-normalizer<4,>=2
│   ├── idna<4,>=2.5
│   └── urllib3<3,>=1.21.1
└── rich
    ├── markdown-it-py>=2.2.0
    │   └── mdurl~=0.1
    └── pygments<3.0.0,>=2.13.0

Attempting to remove Rich

The Cookiecutter Issue

graph TD
    A[kedro new]
    B[Initialize flag_inputs]
    C[Validate flag_inputs]
    D[Get starters_dict]
    E{starter_alias in starters_dict?}
    F[Set template_path and directory]
    G[Set selected_tools to lowercase]
    H[Create tmpdir]
    I[Get cookiecutter_dir]
    J[Get prompts_required]
    K{config_path provided?}
    L[Make cookiecutter_context]
    M[Cleanup tmpdir]
    N[Get extra_context]
    O[Make cookiecutter_args]
    P{telemetry_consent provided?}
    Q[Validate telemetry_consent]
    R[Call create_project]
    S[Call cookiecutter]

    A --> B --> C --> D --> E
    E -- Yes --> F
    E -- No --> F
    F --> G --> H --> I --> J --> K
    K -- No --> L
    K -- Yes --> M
    L --> M --> N --> O --> P
    P -- Yes --> Q --> R
    P -- No --> R
    R --> S

Current ideas for solutions

Further questions to discuss

datajoely commented 6 days ago

I wonder if we can invoke cookiecutter via pipx it's literally only needed once

astrojuanlu commented 6 days ago

Notice that both kedro new and kedro pipeline create use cookiecutter, but refactoring the former is much more difficult than refactoring the latter. So, on @lrcouto ideas for solutions, we could account for the fact that maybe we could make kedro pipeline create not dependent on cookiecutter, and focus on what to do with kedro new.

noklam commented 5 days ago

I cannot join today Tech Design and I will watch the recording. I leave some comment on the issue to clarify:

The only way we can currently run Kedro without needing Rich is by downgrading Cookiecutter to a version before they themselves added Rich as one of their dependencies, which is hacky and not ideal.

cookiecutter is not needed as a "runtime" dependencies, by runtime I mean kedro run . If user still need to use kedro new or kedro pipeline createthen cookiecutter is needed.

To me the problem right now it that user cannot INSTALL kedro without installing cookiecutter, thus either solutions that I propose can address this with different tradeoff (see the summary):

  1. kedro / kedro-core
  2. move cookiecutter, rich as optional dependencies, essentially the core dependency will be equivalent to kedro-core as a pacakge, if user need to use more they may install kedro[standard] (arbitrary name, follow FastAPI convention)
  3. There is 3rd option that I didn't mention before, which is we could vendor cookiecutter within kedro (increase kedro library in terms of size, but reduce dependencies), see this thread for full discussion. I feel like this is a heavyweight solution and not worth the effort, but I want to bring it up as an alternative.

Replace cookiecutter

I will not consider this option unless we aim as expanding the feature. For example, there has been quite a lot of issue running kedro new in databricks (network, permission issues). Do we have alternative that can handle this better?

How would a possible split in two packages, or having one install option with extra dependencies, affect our user experience?

This is explains mostly in Spike: Make cookiecutter optional / not a core dependency of kedro

  1. Move cookiecutter/rich out from core to kedro[something]

Pro:

Con:

  1. Two-package approach, i.e. kedro and kedro-core

Pro:

Con:

datajoely commented 5 days ago

One last idea - pip vendors certain tools (like rich) so there is no risk of conflicts. Maybe that's what we need to do here? https://github.com/pypa/pip/tree/main/src/pip/_vendor

lrcouto commented 3 days ago

Here's the summary of what we discussed on the Tech Design session on Jun 26th:

Some interesting remarks:

Proposed solutions:

astrojuanlu commented 22 hours ago

To clarify on the two packages solution, there are 2 approaches:

  1. Disjoint kedro and kedro-slim, aka the FastAPI approach as described by @noklam here

basically fastapi and fastapi-slim does not rely on each other. They are essentially duplicate but standalone packages as I understand.

Indeed, they're generated from the same codebase but they don't depend on each other, see https://github.com/tiangolo/fastapi/pull/11503. Compare https://pyoven.org/package/fastapi with https://pyoven.org/package/fastapi-slim .

  1. kedro depending on kedro-core, aka the Dask Conda approach:

https://github.com/conda-forge/dask-feedstock/blob/18eb09f9125074b37541f8c8fffd704e32837686/recipe/meta.yaml#L16-L19

There is https://anaconda.org/conda-forge/dask, depending on dask-core, distributed, pandas etc (hence equivalent to pip install dask[complete]) and https://anaconda.org/conda-forge/dask-core, with minimal dependencies.

Other packages doing the same:


The "kedro[new]" approach would then be similar to the Dask PyPI approach.

merelcht commented 10 hours ago

Thanks you so much for the great write-up of the problem and the discussion summary @lrcouto 👏 ⭐

I'd like to look at this with a short-term and long-term solution view.

Aside from these two solutions, we might need to find an alternative for cookiecutter if it is indeed being maintained less and less. I don't think that necessarily solves any of our issues though, because it would just replace the cookiecutter dependency with e.g. copier and there's a chance that any replacement introduces Rich again at some point. So although this is related, I wouldn't consider replacing cookiecutter a solution for anything other than making sure we use up to date packages as dependencies.

lrcouto commented 3 hours ago

I am leaning towards separating Kedro in two packages as a solution as well. Out of those, I think having kedro depending on kedro-core is my favorite. It would be a big endeavor to implement, but I think it would prevent this kind of issue from happening in the future as well. We could keep kedro-core as lean as possible, having only what's strictly necessary for kedro run, and have other amenities and extra features on the larger kedro packages.