kedro-org / kedro

Kedro is a toolbox for production-ready data science. It uses software engineering best practices to help you create data engineering and data science pipelines that are reproducible, maintainable, and modular.
https://kedro.org
Apache License 2.0
9.95k stars 903 forks source link

Remove the default entry point from the Kedro project template and starters #2495

Open jmholzer opened 1 year ago

jmholzer commented 1 year ago

Description

Currently, we expose an entry point to packaged projects that corresponds to cookiecutter.repo_name. This is done by src/setup.py in our project template and starters:

entry_point = (
    "{{ cookiecutter.repo_name }} = {{ cookiecutter.python_package }}.__main__:main"
)

Update 22/10/2024 This is now in pyproject.toml:

[project.scripts]
{{ cookiecutter.repo_name }} = "{{ cookiecutter.python_package }}.__main__:main"

This allows a user to run their installed, packaged projects from the command line by using the 'repo name' of their project, which is defined in cookiecutter.json as follows:

"repo_name": "{{ cookiecutter.project_name.strip().replace(' ', '-').replace('_', '-').lower() }}",

This has been a part of the code base since Kedro 0.14.0, though it is not documented anywhere. I do not think we should include undocumented features in Kedro, so we have two options:

  1. Document the feature by explaining what a 'repo name' is.
  2. Remove the undocumented feature.

I do not prefer option 1, neither does @noklam. This is because the concept of a 'repo name' is not documented anywhere and adding it would cause our users confusion, since its meaning is not intuitive. In addition, there is already a more intuitive way of running a packaged project, using python -m <package_name>. For these reasons, @noklam and I prefer option 2.

Possible alternatives

We could discuss modifying the entry point, assigning a different command to it that is more intuitive. However, package_name contains underscores, which is inconsistent with the CLI exposed by kedro run and project_name is also unsuitable for the reason that it can contain spaces.

merelcht commented 1 month ago

This feature makes it possible to run Kedro e.g. on databricks jobs. I get the point that "repo-name" isn't super intuitive, but that's only visible in the "raw" framework and starters template anyway, so when a user has created a project that's already translated into the actual name of the project. We've also never had users flag this, so I think we can just close this issue. @astrojuanlu ?

noklam commented 1 month ago

If I understand, this is actually not the entrypoint we use for databricks. This will work as long as the __main__.py exist as this is how Python execute a module, i.e. python -m <package>.

The extra entrypoint is a weird one that we never document. For example, when you create a spaceflights project call my_project. It creates a CLI that you can now do my_project in terminal, as if it is doing kedro run. IMO we should remove it since no one is using it, and adding an extra way to run a project is confusing but doesn't add much benefit.

astrojuanlu commented 1 month ago

So if I understand correctly, python -m <package> will always work, and this issue is about removing the package_name CLI, right?

I see the package_name CLI is not mentioned in our tutorial https://docs.kedro.org/en/stable/tutorial/package_a_project.html#run-a-packaged-project nor in our single-machine deployment page https://docs.kedro.org/en/stable/deployment/single_machine.html#package-based

My guess is that having this extra way of running the project adds very little value, given that users can already:

And also this is just about the defaults in the template right? Users can still define their own entry point.

merelcht commented 2 weeks ago

Ah sorry, I was confused! It does look like we can just remove that entrypoint. Removing repo_name from the cookiecutter settings, might not be entirely straightforward though. I ran into some issues testing it just now, so we'll have to make sure removing it doesn't cause issues with older versions.