kedro-org / kedro

Kedro is a toolbox for production-ready data science. It uses software engineering best practices to help you create data engineering and data science pipelines that are reproducible, maintainable, and modular.
https://kedro.org
Apache License 2.0
9.53k stars 877 forks source link

Update how to use Kedro install/new? #753

Closed WaylonWalker closed 3 years ago

WaylonWalker commented 3 years ago

Introduction

It would be nice to see some consistent usage of global commands such as kedro new across different machines regardless of the version installed. Some kedro commands need to run before having an environment setup and can lead to users installing kedro globally as mentioned in #681.

Problem

Different versions of kedro are quite drastically different. kedro new creates a different template that likely will not run on different versions so it is very important to pick the right version from the start. Something as simple as a user trying kedro 6 months ago and having an old version installed on their machine could throw them off and make the experience very frustrating.

What's in scope

Global kedro commands that need ran before having a projects environment fully setup and need a specific kedro version.

What's not in scope

Project commands that are used from within a virtual environment after the user is already fully setup and working.

Design

document using pipx for kedro new

pipx can create reproducible results of kedro new regardless of the version that is currently in the user's path. If this was documented in the docs and in the projects readme it would help prevent users from ending up with a globally installed kedro, or starting a project from the wrong version.

pipx run --spec git+https://github.com/quantumblacklabs/kedro.git@0.16.1 kedro new
pipx run --spec kedro==0.16.1 kedro new

Alternatives considered

Using pipx seems to be the simplest way to have different versions of kedro in the global scope to run install and new commands

Testing

I opened a new replit.com instance and ran the following commands with success.

pipx run --spec git+https://github.com/quantumblacklabs/kedro.git@0.16.1 kedro new
pipx run --spec kedro==0.16.1 kedro new

Since pipx runs inside a sandbox it does not appear to be a good solution for install.

lorenabalan commented 3 years ago

kedro new creates a different template that likely will not run on different versions so it is very important to pick the right version from the start.

I'm not sure I follow the problem here. Can you help me understand why it's important to know the version out front here? I would assume at the very beginning of a project the user would just pick the most-recently released version, rather than go back in time. kedro new is at the very top of workflow, and it creates a template that guarantees to work with the installed version*, and aims to still work with all patch releases. Admittedly, the complexity of some architecture changes meant that we introduced accidental bugs in the 0.17s particularly with kedro install, which is not ideal.

* starting version 0.17.0 - indeed between 0.16.3 and that, kedro new with starters was fetching the latest starter version by default, rather than a guaranteed working version.

lorenabalan commented 3 years ago

Took a look at pipx - wasn't familiar with it before. pipx run indeed looks interesting - "trying out" a package before installing. Was the intention to have this documented in our docs or something else? I'm not sure what the expected output/action is.

WaylonWalker commented 3 years ago

The issue is that I need to have a globally installed kedro to run new and install, and a locally installed kedro in each of my project's virtual environments. I can guarantee everyone on my team has the same kedro version installed in our projects by pinning the version and we are all running and encouraging good use of virtual environments.

I cannot control what kedro is installed globally on everyone's machine. For someone who has used kedro previously their globally installed version is likely not the latest. So two people running the same kedro new command may get different results without understanding why if they are not following kedro closely.

I think pipx would be one solution to provide consistency across different users with these global commands. I do not think that install currently works, as it seemed to install into the pipx sandbox rather than the venv I wanted. With pipx you can guarantee that a user either uses the latest or a specific version rather than being based on the last time the user installed kedro.

Here are a few snippets from the pipx readme

How is it Different from pip?

pip is a general-purpose package installer for both libraries and apps with no environment isolation. pipx is made specifically for application installation, as it adds isolation yet still makes the apps available in your shell: pipx creates an isolated environment for each application and its associated packages.


Walkthrough: Running an Application in a Temporary Virtual Environment

This is an alternative to pipx install.

pipx run downloads and runs the above mentioned Python "apps" in a one-time, temporary environment, leaving your system untouched afterwards.

This can be handy when you need to run the latest version of an app, but don't necessarily want it installed on your computer.

You may want to do this when you are initializing a new project and want to set up the right directory structure, when you want to view the help text of an application, or if you simply want to run an app in a one-off case and leave your system untouched afterwards.

For example, the blog post How to set up a perfect Python project uses pipx run to kickstart a new project with cookiecutter, a tool that creates projects from project templates.

A nice side benefit is that you don't have to remember to upgrade the app since pipx run will automatically run a recent version for you.


WaylonWalker commented 3 years ago

What are your thoughts on converting the kedro new docs over to pipx? I've converted myself over to using pipx run kedro new for the past few weeks and it has worked well for the few sample projects I have created. I had actually stumbled into 0.17.3 this way shortly after release.

stale[bot] commented 3 years ago

This issue has been automatically marked as stale because it has not had recent activity. It will be closed if no further activity occurs. Thank you for your contributions.

datajoely commented 3 years ago

Hi @WaylonWalker I don't think we're going to migrate to pipx in the short term - we already step outside beginner workflows with pip-compile so I think it's best we don't introduce too many 'new' things.