kedro-org / kedro

Kedro is a toolbox for production-ready data science. It uses software engineering best practices to help you create data engineering and data science pipelines that are reproducible, maintainable, and modular.
https://kedro.org
Apache License 2.0
9.88k stars 894 forks source link

Spike: Make `cookiecutter` optional / not a core dependency of `kedro` #3884

Closed astrojuanlu closed 2 months ago

astrojuanlu commented 4 months ago

Description

cookiecutter is a mandatory dependency of kedro, because it's what powers kedro new and kedro pipeline create.

However, it's a heavy dependency. From #3659:

cookiecutter
├── Jinja2<4.0.0,>=2.7
│   └── MarkupSafe>=2.0
├── arrow
│   ├── python-dateutil>=2.7.0
│   │   └── six>=1.5
│   └── types-python-dateutil>=2.8.10
├── binaryornot>=0.4.4
│   └── chardet>=3.0.2
├── click<9.0.0,>=7.0
├── python-slugify>=4.0.0
│   └── text-unidecode>=1.3
├── pyyaml>=5.3.1
├── requests>=2.23.0
│   ├── certifi>=2017.4.17
│   ├── charset-normalizer<4,>=2
│   ├── idna<4,>=2.5
│   └── urllib3<3,>=1.21.1
└── rich
    ├── markdown-it-py>=2.2.0
    │   └── mdurl~=0.1
    └── pygments<3.0.0,>=2.13.0

Currently we are forcing Kedro users to install dependencies that are needed for development (kedro new, kedro pipeline create) also in their production environments.

One of those problematic transitive dependencies is rich. As of #3838, kedro can work even if rich is not installed, but the current method to do that is very brittle and difficult for users. As a result, we are effectively blocked from making progress with #2928.

For backwards compatibility reasons, pip install "kedro==0.19.*" will need to keep carrying cookiecutter as a dependency. The task then is to come up with a roadmap in which pip install "kedro>=0.20" does not.

astrojuanlu commented 4 months ago

2928 was prioritised as "High" and this issue blocks progress on that one, so I am prioritising this one as "High" too.

lrcouto commented 3 months ago

A couple more details about the issue:

Our whole project creation flow currently relies on cookiecutter for creating new projects, and for creating pipelines as well. This bumps into the problems mentioned by @astrojuanlu, since cookiecutter is pretty intense when it comes to dependencies.

As of now, cookiecutter is performing three functions on Kedro:

Project templates

This is what advertises itself for. It's a templating tool. We use it to fetch directory structure and pre-created files from both the Kedro repo itself and the kedro-starters repo. Which leads to...

Git operations

To fetch those templates, cookiecutter uses functions that do operations like cloning a repo or checking out a branch. This can become a bit of an issue from the developer side, because we cede control to an external library to do these operations and it can be tricky to debug. For example, this bit on the _create_project() function:

result_path = cookiecutter(template=template_path, **cookiecutter_args)

From then on, cookiecutter takes over and will do the whole cloning, selecting the right branch or tag of the repo, and creating the project itself. It's convenient, and at this moment our whole project creation from kedro new on is pretty much preparing data structures that will subsequently be fed to this one function. It effectively acts as a chokepoint, and because it's not ours, debugging it can be a little bit of a pain, having to dive down multiple functions on the cookiecutter library to understand what's going on. They have their own clone/checkout function so we're forced to adapt to the way they want to do it.

Prompting

This was the one that caused the issue when we tried to remove Rich as a mandatory dependency. That one cookiecutter function is also handling our prompts, we just pass which ones we require for that specific project creation on that **cookiecutter_args structure. That function is essentially equivalent to a CLI call to cookiecutter with whichever arguments you want to use, so it makes sense.

The TL;DR is that our whole project creation flow is a funnel that ends on cookiecutter.

merelcht commented 3 months ago

Thanks for the comprehensive write-up @lrcouto ! So am I correct in thinking that if a user already has a Kedro project (e.g. cloned from a repo) they can do everything they want in Kedro without needing cookiecutter?

lrcouto commented 3 months ago

Thanks for the comprehensive write-up @lrcouto ! So am I correct in thinking that if a user already has a Kedro project (e.g. cloned from a repo) they can do everything they want in Kedro without needing cookiecutter?

We're also using it to create new pipelines using the kedro pipeline create command. But as long as they have the project and all of the pipelines they're going to use, they would not need cookiecutter anymore.

noklam commented 3 months ago

In other words, if we consider kedro new or kedro pipeline create not a "core" functionalities, we can move it out from the core dependencies.

The tricky bit here is for 0.19, we need to maintain backward compatibility. In 0.20, we can definitely split it to kedro[all] and move cookiecutter out from core.

I have test it today, after pip uninstall cookiecutter, I can perform these common Kedro's action

In fact, ipython and jupyter are not Kedro's core dependencies, we could handle cookiecutter in similar way. We try to intercept this error in CLI, since all prompts are CLI. i.e.

image

Code: https://github.com/kedro-org/kedro/blob/b6e585f80ace1896e6bd3c31d827a1991ed2c664/kedro/framework/cli/utils.py#L362-L369

noklam commented 3 months ago

For the records, there are proposal for optional opt-out dependencies (from the Pydantic's author): https://discuss.python.org/t/optional-dependency-groups-omitting-package-requirements/20833

This problem is also not unique to us, see this quote from the FastAPI's author in the same thread, he has issue with rich as well:

I just came to chime in and say that I agree this would be useful to me for FastAPI and Typer at least

For example, pip install typer should include Rich and Shellingham by default, and if someone explicitly wants to strip that out they would add some additional syntax (e.g. pip install typer[minimal]).

Long story short, there are plenty of discussion but until today it's not supported (PEP doesn't exist yet). I thought it's a good idea to take inspiration from pydantic and FastAPI as they are arguably the most popular libraries in Python ecosystem. It's beneficial if we don't deviate too much from them.

Pydantic: There were discussion about pydantic-core and pydantic-core-lite (a smaller binary), I cannot find pydantic-core-lite. I suspect it's because of the discussion and there is no satisfying mechanism to support the "opt-out" dependencies, as a result it's not worth the effort to maintain this separately.

FastAPI:

In fastapi[standard], it install fastapi-cli as a dependency: https://github.com/tiangolo/fastapi/blob/653315c496304be928b46c34d35097d0ce847646/pyproject.toml#L56-L61

In FastAPI case, it makes sense since it's a web application and the CLI part isn't a runtime requirements when served. For Kedro, the CLI is a core part. If we only split the "cookiecutters" part, it's too small and IMO not worth the effort with little gain.

When one install pip install fastapi, it installs fastapi-slim[standard] (I couldn't find the code for fastapi-slim, it's probably using the same fastapi code base but packaged separately for convenience). It automatically load the pdm_build.py with https://backend.pdm-project.org/migration/#setup-script

It is something quite new as well (2months old) image

noklam commented 3 months ago

This is my proposed solution:

1. Move cookiecutter out from core and to something like kedro[new] /kedro[init]

When user run kedro new: If they have cookiecutter installed, it should run as is. If they don't, it will show a message like:

 f"Module '{module_name}' not found. Make sure to install requirements with `pip install kedro[new]`

Docs: We need to update the docs for "Get started", we need to replace pip install kedro with pip install kedro[new]

Pro: pip install kedro will work as the same for most people, except those need to create new project. (less often) Con:

2. Two-package approach, i.e. kedro and kedro-core

The FastAPI way of having fastapi and fastapi-slim

It uses a magic constant to have different build: https://github.com/tiangolo/fastapi/blob/653315c496304be928b46c34d35097d0ce847646/pdm_build.py#L6

It uses pdm, though I am not sure is that necessary to support these kind of workflow.

Pro:

Con:

astrojuanlu commented 3 months ago

Big fan of (2), pip install kedro including ["kedro-core", "cookiecutter", "rich"]. This retains backwards compatibility (we don't even have to wait until 0.20) and gives an exit path for users who want to install the "slim" or "lightweight" version.

lrcouto commented 3 months ago

Besides Nok's proposal, another idea to at least stop relying on Rich would be to try to handle the prompting outside of the cookiecutter context. That would allow us to keep everything mostly as it and not have to downgrade the Cookiecutter version if we want to remove Rich. I do think it would be a bit of reinventing the wheel though, since we'd have to create something that Cookiecutter is already doing.

lrcouto commented 3 months ago

Besides Nok's proposal, another idea to at least stop relying on Rich would be to try to handle the prompting outside of the cookiecutter context. That would allow us to keep everything mostly as it and not have to downgrade the Cookiecutter version if we want to remove Rich. I do think it would be a bit of reinventing the wheel though, since we'd have to create something that Cookiecutter is already doing.

On that idea, @ankatiyar suggested looking into Click's prompting functions to possibly replace Cookiecutter's.

@noklam opened an issue on the Cookiecuter repo regarding their own dependency on Rich.

noklam commented 3 months ago

Big fan of (2), pip install kedro including ["kedro-core", "cookiecutter", "rich"]. I will say this is (3). The most notable difference is that pip install kedro will automtically install kedro-core, the FastAPI approach will not, the benefit is that you can see exactly how many people are downloading the slim package on purpose.

(2) that FastAPI takes is a bit different, basically fastapi and fastapi-slim does not rely on each other. They are essentially duplicate but standalone packages as I understand. I am trying to ask why the author takes that approach and the lesson he learnt, one drawback that I found is that it will cause some issues to the plugins. As most likely they will pin kedro instead of kedro-core, this make it harder to actually install the lean dependencies.

I do like that both fastapi and fastapi-slim share the same codebase so we don't need to maintain a separate one.

noklam commented 3 months ago

Besides Nok's proposal, another idea to at least stop relying on Rich would be to try to handle the prompting outside of the cookiecutter context. That would allow us to keep everything mostly as it and not have to downgrade the Cookiecutter version if we want to remove Rich. I do think it would be a bit of reinventing the wheel though, since we'd have to create something that Cookiecutter is already doing.

On that idea, @ankatiyar suggested looking into Click's prompting functions to possibly replace Cookiecutter's.

@noklam opened an issue on the Cookiecuter repo regarding their own dependency on Rich.

Cookiecutter use click.prompt before they switch to rich, the implementation can be found in the https://github.com/cookiecutter/cookiecutter/pull/1901 linked.

noklam commented 3 months ago

See this from the fastapi community. Some arguments I see there:

https://github.com/tiangolo/fastapi/discussions/11733

astrojuanlu commented 3 months ago

I get it: plugins depending on kedro would need to migrate to kedro-core if they want to drop the unwanted dependencies. And for projects, the moment there is 1 plugin depending on the full kedro, they're stuck.

And yet, as @/tiangolo explains here:

So, the idea with having the new setup with fastapi apart from fastapi-slim is to make the simplest developer experience for the majority of cases, in particular for newcomers, as it's much easier to install pip install fastapi and have the default recommended packages.

For a newbie, pip install fastapi[standard] means explaining what the [] means, what is this standard keyword, what is the difference between pip install fastapi and pip install fastapi[standard], and in many cases that command won't even work, because people might be using Zsh (e.g. in Mac by default) where it needs quotes around pip install "fastapi[standard]", so, what should be the simplest use case for the newcomers becomes the most complex one, more difficult to explain and use.

He got quite a bit of opposition, but I completely agree with his stance (and Sebastián's intuition for DX is what brought FastAPI this far anyway).

One insightful comment by the way:

in my humble opinion cli is not the thing that is needed for web server.

(back to #3659 and #143)

merelcht commented 3 months ago

I actually prefer option 1 of @noklam's proposals. We already have the concept of pip install ...[...] in the Kedro ecosystem with kedro-datasets so I feel it's more consistent with other things we have done.

I can see why people like option 2, but with that one we need to think more about whether we would then break up Kedro in a modular way, where you can "add-on" parts by doing pip install kedro-... or if it's similar to fastapi where the packages are duplicates. I'm also still concerned if this is worth the engineering effort as well as the future maintenance. What I'd like to know: are Kedro's core dependencies a reason for people not to adopt Kedro?

lrcouto commented 2 months ago

We will be closing this issue for now, based on this discussion on Kedro's dependencies. We have decided on a longer term approach to deal with Cookiecutter, and by extension Rich, that will take some time to design and implement. This issue will be addressed once we're ongoing with this process.