kedro-org / kedro

Kedro is a toolbox for production-ready data science. It uses software engineering best practices to help you create data engineering and data science pipelines that are reproducible, maintainable, and modular.
https://kedro.org
Apache License 2.0
9.94k stars 903 forks source link

Evaluating CLI command usage #1293

Closed antonymilne closed 2 years ago

antonymilne commented 2 years ago

As per discussions https://github.com/kedro-org/kedro/issues/1077 and https://github.com/kedro-org/kedro/discussions/844 we want to separate dev requirements into their own file outside requirements.txt. This means any requirements that are not required for kedro run or running a packaged kedro project (which only exposes kedro run). Remember in v0.18 there is no kedro install command.

Things to check and think about:

Also worth checking whether any kedro requirements should be moved to project requirements. e.g. pip-tools is there but after deprecation of kedro install can presumably be moved to a project requirement? Do we need jupyter_client in kedro requirements? etc.

yetudada commented 2 years ago

I have been thinking about this ticket because I think we should base it on how useful these dependencies are for users. Here's how our users have used CLI-commands which have dependencies in the project's requirements.txt, since the 1st of September 2021:

CLI Command Unique Runs Dependencies
kedro lint 127 black==21.5b1, flake8>=3.7.9, <4.0 and isort~=5.0
kedro test 212 pytest-cov~=3.0, pytest-mock>=1.7.1, <2.0 and pytest~=6.2
kedro activate-nbstripout 10 nbstripout~=0.4
kedro ipython 374 ipython==7.0
kedro jupyter lab 124 ipython==7.0, jupyter~=1.0, jupyter_client>=5.1.0, <7.0 and jupyterlab~=3.0
kedro jupyter notebook 266 ipython==7.0, jupyter~=1.0 and jupyter_client>=5.1.0, <7.0

kedro run has been run 7,747 times and kedro viz has been run 933 times

If you look at this table, we can ask some questions like:

yetudada commented 2 years ago

Additionally @AntonyMilneQB messaged with this comment: Just for the record in case it gets discussed next week when I’m not here, because this is something I’ve been thinking about for ages also... When I asked @idanov whether these kedro lint, kedro activate-nbstripout, kedro test, etc. commands belonged in Kedro at all, his opinion was “no, not really, but people like them for convenience”. So if telemetry data shows that no one is using them then great 😄

With that said, part of Kedro’s current offering is that it simplifies setting up a data science project because it bundles all of the things you need (isort, black, etc.). Maybe this isn’t so important any more, but if we do remove them then we might need to change our “Kedro’s value” messaging slightly. One way of making these tools still available but not part of the core Kedro project would be to move them into custom starter a bit like this… but that does add a maintenance burden.

If it turns out that we do want to keep some of these commands then I think the really belong in a CLI group together, say kedro develop or something. My reasoning is thatkedro lint etc. are each just a single command rather than a group they horribly clutter up the CLI and do not belong at the top-level of the hierarchy along with core commands like run or components like pipeline, catalog

So personally I would be very happy if, as part of the CLI cleanup, we could either remove or at least hide away these commands 😄

yetudada commented 2 years ago

And I'll respond with:

antonymilne commented 2 years ago

Just a quick note on telemetry data for kedro jupyter and kedro ipython - even though these might be executed relatively few times, the number of times the command is run isn't really comparable to the number of times a command like kedro lint is run. kedro jupyter starts a potentially long-running Jupyter session which the user can interact with over an extended period of time rather than just it just being a one-off process that exits when complete, like kedro lint. So you could be interacting heavily with Jupyter even though you run the command only once a week.

antonymilne commented 2 years ago

Also @yetudada did you deliberately not look at use of kedro build-docs and kedro build-reqs? I suspect they're both very low. If we're thinking of removing kedro lint etc. then I think the future of these commands should also be considered.

antonymilne commented 2 years ago

One more point for consideration. Keeping these requirements in requirements.txt adds a maintenance burden for keeping on top of dependencies, e.g. https://github.com/kedro-org/kedro/issues/1327 https://github.com/kedro-org/kedro/pull/1322. Moving dependencies to dev_requirements.txt would improve the situation, and obviously removing them altogether would remove the maintenance burden completely.

merelcht commented 2 years ago

Outcome of this ticket would be to check with users whether we can remove the mentioned CLI commands or not. That would give us evidence about whether we need dev-requirements or not.

antonymilne commented 2 years ago

I think kedro jupyter notebook/lab and kedro ipython should be left out of this and saved for the ongoing work on improving the interactive workflow. kedro jupyter convert is a good candidate to kill though. In my opinion the following commands are the ones that we should consider killing/moving in this issue:

activate-nbstripout
build-docs
build-reqs
jupyter convert
lint
test
merelcht commented 2 years ago

Summary of survey results + telemetry data

For comparison: unique user calls to kedro run from 1 Jan - 10 June: 3068, and unique user calls to kedro viz: 726

CLI command Unique calls 1 Jan-10 June Survey results Slack Survey results Discord
kedro activate-nbstripout 6
yes no
1 21
yes no
1 10

Comments from users about kedro activate-nbstripout:

CLI command Unique calls 1 Jan-10 June Survey results Slack Survey results Discord
kedro build-docs 88
yes no
4 18
yes no
2 11

Comments from users about kedro build-docs:

CLI command Unique calls 1 Jan-10 June Survey results Slack Survey results Discord
kedro build-reqs 263
yes no
7 15
yes no
7 5

Comments from users about kedro build-reqs:

CLI command Unique calls 1 Jan-10 June Survey results Slack Survey results Discord
kedro jupyter convert 19
yes no
0 22
yes no
3 9

Comments from users about kedro jupyter convert:

CLI command Unique calls 1 Jan-10 June Survey results Slack Survey results Discord
kedro lint 159
yes no
4 18
yes no
5 6

Comments from users about kedro lint:

CLI command Unique calls 1 Jan-10 June Survey results Slack Survey results Discord
kedro test 459
yes no
4 18
yes no
5 5

Comments from users about kedro test:

merelcht commented 2 years ago

Based on the above results my suggestions for next steps are as follows:

  1. Deprecate and remove the kedro jupyter convert command
  2. Deprecate and remove the kedro activate-nbstripout command
  3. I'm leaning towards deprecating the kedro build-docscommand, but we could do some additional research to re-evaluate its purpose. Some users shared that they really like this command, but would also like it to have more extensive functionality, such as automatically including the kedro viz diagram. Because of the nature of this command it isn't run that often, likely only at the end of a project (which is reflected by the telemetry data) and one user reported that it the command didn't work, with another elaborating "[building documentation] it's an afterthought for most projects (for handover or something), and the command would probably fail or raise so many issues that it might not be worth it"
  4. Re-evaluate the purpose of kedro lint and see if we can offer an alternative to users
  5. Re-evaluate the purpose of kedro test and see if we can offer an alternative to users. My guess is that kedro test is also run as part of CI builds and that's the reason why it seems to be used quite a lot in telemetry compared to some of the other commands. kedro test is really only a wrapper around pytest so we could re-educate our users to just start using pytest like we did with kedro install and pip install -r requirements.txt.
  6. kedro build-reqs: I would leave this command for now. This command (compared to the others) has the highest usage numbers combined with self-proclaimed usage by both OS and internal QB users. It's also one of the more complex commands and we'd have to investigate what alternative we could offer our users.
antonymilne commented 2 years ago

Thanks for all the research and the most excellent write up of all this! Do you have the number of unique calls to kedro run and kedro viz in the same time period just for comparison?

Based on this research, personally I would remove all these commands in favour of my alternative proposal (to follow in next post). If we don't want to go down that route then:

  1. kedro activate-nbstripout: completely agree, kill it
  2. kedro jupyter convert: completely agree, kill it
  3. kedro build-docs: agree with you, but more strongly in favour of killing it. In one sense it seems slightly more valuable than the very thin wrapper ones like kedro lint and kedro test because it does something that is a bit tricky (i.e. sphinx, which we provide some config for). On the flip side if we killed it then we could also remove docs folder from the template which would be nice. If we keep this then completely agree we should re-think what it's really for and whether we want to continue using sphinx for it. There are probably easier to use alternatives available now. N.B. if we want to do mermaid diagrams as one user mentioned, we could use https://github.com/kedro-org/kedro/discussions/1269
  4. kedro lint. if we keep it then I think we should consider again whether pre-commit might be useful here. I think the argument against it before was difficulty on windows, but I'm not sure.
  5. kedro test: agree with you, but more strongly in favour of killing it. Unlike kedro lint this just wraps a single CLI command. Agree with you that's it's like deprecation of kedro install in favour of pip install, which is a way more important command and AFAIK we haven't received any complaints about that at all. And I don't think there is a good alternative to pytest here really
  6. kedro build-reqs: disagree here, I think the case for keeping this isn't any stronger than for kedro test (both in terms of telemetry data and user feedback). It did used to be a complex command but is not so any more in 0.18. It's really just a thin wrapper for pip-tools compile but with non-standard file names (i.e. requirements.txt -> .lock rather than requirements.in -> .txt). Now that pip has ok dependency resolution built in I think bundling piptools is much less important than it used to be. Personally I do like the piptools workflow (just as I like pytest), but I don't think it belongs in core kedro. So IMO the same re-education journey as for deprecation of kedro test would apply here

If we do keep some of these commands in kedro core then I think we should consider hiding them away inside a new group (name tbc) so they clutter the CLI less:

kedro dev lint
kedro dev test
kedro dev build-reqs
antonymilne commented 2 years ago

Proposal

We strip out all these commands from core kedro, all our current starters, remove packages from requirements.txt, etc.

We (or someone else) make a new plugin kedro-with-tools (name completely made up for now) that includes:

Important note: I say "we or someone else" because I think could be a very good candidate for an unofficial community-maintained plugin. Or we could get it started and then hand over, or we could maintain it but be much more relaxed in accepting updates than we are with the core kedro template.

User journey

Note that pip install immediately makes the new kedro-with-tools-starter alias available thanks to #1592.

Starting a new project:

pip install kedro-with-tools 
kedro new -s kedro-with-tools-starter  # Makes new project template with pre-commit etc. config
kedro tools lint etc. commands are now available

In existing project:

pip install kedro-with-tools 
kedro tools init  # Not sure if we'll actually need this command at all. But it would copy pre-commit etc. config to already existing project
kedro tools lint etc. commands are now available

Pros

References that inspired this idea: https://github.com/kedro-org/kedro/issues/826; https://github.com/kedro-org/kedro/discussions/844#discussioncomment-1034549; https://github.com/kedro-org/kedro-starters/pull/40

Cons

Overall I think both the above can be addressed by documentation and deprecating warnings.

merelcht commented 2 years ago

@AntonyMilneQB Thanks for sharing and elaborating on your thoughts!

I like the idea of moving the commands we want to keep to a separate plugin if we're keeping a reasonable amount. If we decide we're removing most of them anyway, I'm not sure it's even worth creating this plugin. From your earlier comment it sounded like you'd only really want to keep kedro lint, so then a new plugin might be overkill. But 100% agree to create a plugin if we decide to keep several of the commands.

antonymilne commented 2 years ago

Just to clarify: if we (or someone else) made this plugin, I think it could still contain the build-docs, build-reqs and test functionality. If it's not part of the core kedro package I feel much more relaxed about putting in functionality that is deemed useful for software engineering best practice but not commonly used at the moment.

Also, slightly simpler than a whole plugin: my original thought was that this could be done with just a custom (ideally community-maintained) starter that would provide:

I still think this is fine, just doing it as a plugin is slightly neater since that way you also get a nice kedro new --starter= alias even if's not in our official starters repo.

NeroOkwa commented 2 years ago

Thanks @MerelTheisenQB and @AntonyMilneQB for the research, and excellent summary. I agree with @AntonyMilneQB on removing these commands to declutter the CLI, or alternatively moving the additional commands we decide to keep to a separate plugin.

yetudada commented 2 years ago

This stellar research @MerelTheisenQB 🎉 Thank you so much for this. I've read through your thoughts; thanks for the additions @MerelTheisenQB, @AntonyMilneQB and @NeroOkwa.

Here are my summary thoughts:

So with these changes, we still maintain some of our value proposition. But it's clear our users like other parts of Kedro and not necessarily our support of tooling that helps them create better software.

deepyaman commented 2 years ago

Proposal

We strip out all these commands from core kedro, all our current starters, remove packages from requirements.txt, etc.

We (or someone else) make a new plugin kedro-with-tools (name completely made up for now) that includes:

  • kedro tools group of commands (like the kedro dev group I said above) including lint, test, build-docs, whatever
  • dependencies to make all these work
  • a starter template (call it kedro with-tools-starter) that is heavier than the core one and includes any required configuration (e.g. pyproject.toml, .pre-commit-config.yaml). Along these lines: add simple starter kedro-starters#40

@AntonyMilneQB--and everybody else in favor of killing kedro lint and kedro test, but kedro lint especially--I don't entirely see how this workflow would work.

I do think a non-trivial part of the value proposition of Kedro is introducing users to and enforcing basic software engineering best practices. Anecdotally, I remember being one of the first people on the client-facing side in QuantumBlack (at least in North America?) to enforce coding standards on projects over 3 years ago. More recently, I think a much larger percentage of projects do so--yes, in parallel to greater adoption of tools like Black and isort by the broader community, but also I think at higher rates due to it being bundled with Kedro. In my humble opinion, Kedro has always been opinionated, and part of the value comes from this opinionated nature.

I think removing the kedro lint command and documenting how to replicate its current behavior by leveraging the underlying tools (or something like pre-commit) directly is fine. Maybe this will help some users understand that there's no "magic".

However, going so far as to remove linter configuration needs to be considered much more carefully:

When it comes to kedro test, I honestly think most users still aren't writing tests (hence the low command usage), but making the config optional isn't going to make that any better. Again, totally fine with just pointing people to the equivalent pytest command here.

With regards to making these commands part of an optional plugin, if anything, I would make them part of a plugin installed by default. This doesn't reduce the maintenance burden, and you can't outsource the development of such a core plugin to the broader community, but it does make it optional for those who may have good reason to not want the functionality.

Finally, FWIW, I think the set of core tools included as part of the project template is actually pretty good. While I personally would include things like mypy and Prettier, I think the black/isort/Flake8 suite is a good representation of "the bare minimum" to say you're following software engineering best practices on this front, without being too difficult for the majority of users to follow. Outsourced to the community, I'm sure somebody will throw in random things like importlinter that may be nice to have, but definitely not at the same level as Black or isort in terms of getting somebody on board with better software development practices.

merelcht commented 2 years ago

I've created follow up tickets for all actions mentioned in the discussion above:

Please add any further comments to the specific follow ups.