Closed antonymilne closed 2 years ago
I have been thinking about this ticket because I think we should base it on how useful these dependencies are for users. Here's how our users have used CLI-commands which have dependencies in the project's requirements.txt
, since the 1st of September 2021:
CLI Command | Unique Runs | Dependencies |
---|---|---|
kedro lint |
127 | black==21.5b1 , flake8>=3.7.9, <4.0 and isort~=5.0 |
kedro test |
212 | pytest-cov~=3.0 , pytest-mock>=1.7.1, <2.0 and pytest~=6.2 |
kedro activate-nbstripout |
10 | nbstripout~=0.4 |
kedro ipython |
374 | ipython==7.0 |
kedro jupyter lab |
124 | ipython==7.0 , jupyter~=1.0 , jupyter_client>=5.1.0, <7.0 and jupyterlab~=3.0 |
kedro jupyter notebook |
266 | ipython==7.0 , jupyter~=1.0 and jupyter_client>=5.1.0, <7.0 |
kedro run
has been run 7,747 times andkedro viz
has been run 933 times
If you look at this table, we can ask some questions like:
kedro lint
, kedro test
, kedro ipython
, kedro jupyter lab
and kedro jupyter notebook
?kedro activate-nbstripout
and its dependency?kedro ipython
, kedro jupyter lab
and kedro jupyter notebook
series? And should we remove this workflow?Additionally @AntonyMilneQB messaged with this comment:
Just for the record in case it gets discussed next week when I’m not here, because this is something I’ve been thinking about for ages also... When I asked @idanov whether these kedro lint
, kedro activate-nbstripout
, kedro test
, etc. commands belonged in Kedro at all, his opinion was “no, not really, but people like them for convenience”. So if telemetry data shows that no one is using them then great 😄
With that said, part of Kedro’s current offering is that it simplifies setting up a data science project because it bundles all of the things you need (isort
, black
, etc.). Maybe this isn’t so important any more, but if we do remove them then we might need to change our “Kedro’s value” messaging slightly. One way of making these tools still available but not part of the core Kedro project would be to move them into custom starter a bit like this… but that does add a maintenance burden.
If it turns out that we do want to keep some of these commands then I think the really belong in a CLI group together, say kedro develop
or something. My reasoning is thatkedro lint
etc. are each just a single command rather than a group they horribly clutter up the CLI and do not belong at the top-level of the hierarchy along with core commands like run or components like pipeline, catalog
So personally I would be very happy if, as part of the CLI cleanup, we could either remove or at least hide away these commands 😄
And I'll respond with:
Just a quick note on telemetry data for kedro jupyter
and kedro ipython
- even though these might be executed relatively few times, the number of times the command is run isn't really comparable to the number of times a command like kedro lint
is run. kedro jupyter
starts a potentially long-running Jupyter session which the user can interact with over an extended period of time rather than just it just being a one-off process that exits when complete, like kedro lint
. So you could be interacting heavily with Jupyter even though you run the command only once a week.
Also @yetudada did you deliberately not look at use of kedro build-docs
and kedro build-reqs
? I suspect they're both very low. If we're thinking of removing kedro lint
etc. then I think the future of these commands should also be considered.
One more point for consideration. Keeping these requirements in requirements.txt adds a maintenance burden for keeping on top of dependencies, e.g. https://github.com/kedro-org/kedro/issues/1327 https://github.com/kedro-org/kedro/pull/1322. Moving dependencies to dev_requirements.txt would improve the situation, and obviously removing them altogether would remove the maintenance burden completely.
Outcome of this ticket would be to check with users whether we can remove the mentioned CLI commands or not. That would give us evidence about whether we need dev-requirements or not.
I think kedro jupyter notebook/lab
and kedro ipython
should be left out of this and saved for the ongoing work on improving the interactive workflow. kedro jupyter convert
is a good candidate to kill though. In my opinion the following commands are the ones that we should consider killing/moving in this issue:
activate-nbstripout
build-docs
build-reqs
jupyter convert
lint
test
Summary of survey results + telemetry data
For comparison: unique user calls to
kedro run
from 1 Jan - 10 June: 3068, and unique user calls tokedro viz
: 726
CLI command | Unique calls 1 Jan-10 June | Survey results Slack | Survey results Discord | ||||||||
---|---|---|---|---|---|---|---|---|---|---|---|
kedro activate-nbstripout |
6 |
|
|
Comments from users about kedro activate-nbstripout
:
CLI command | Unique calls 1 Jan-10 June | Survey results Slack | Survey results Discord | ||||||||
---|---|---|---|---|---|---|---|---|---|---|---|
kedro build-docs |
88 |
|
|
Comments from users about kedro build-docs
:
kedro cli build-docs
command"CLI command | Unique calls 1 Jan-10 June | Survey results Slack | Survey results Discord | ||||||||
---|---|---|---|---|---|---|---|---|---|---|---|
kedro build-reqs |
263 |
|
|
Comments from users about kedro build-reqs
:
CLI command | Unique calls 1 Jan-10 June | Survey results Slack | Survey results Discord | ||||||||
---|---|---|---|---|---|---|---|---|---|---|---|
kedro jupyter convert |
19 |
|
|
Comments from users about kedro jupyter convert
:
CLI command | Unique calls 1 Jan-10 June | Survey results Slack | Survey results Discord | ||||||||
---|---|---|---|---|---|---|---|---|---|---|---|
kedro lint |
159 |
|
|
Comments from users about kedro lint
:
kedro lint
explicitly with pre-commit in place"CLI command | Unique calls 1 Jan-10 June | Survey results Slack | Survey results Discord | ||||||||
---|---|---|---|---|---|---|---|---|---|---|---|
kedro test |
459 |
|
|
Comments from users about kedro test
:
Based on the above results my suggestions for next steps are as follows:
kedro jupyter convert
commandkedro activate-nbstripout
commandkedro build-docs
command, but we could do some additional research to re-evaluate its purpose.
Some users shared that they really like this command, but would also like it to have more extensive functionality, such as
automatically including the kedro viz
diagram. Because of the nature of this command it isn't run that often, likely only at the end of a project (which is reflected by the telemetry data) and one user reported that it the command didn't work, with another elaborating "[building documentation] it's an afterthought for most projects (for handover or something), and the command would probably fail or raise so many issues that it might not be worth it"kedro lint
and see if we can offer an alternative to userskedro test
and see if we can offer an alternative to users.
My guess is that kedro test
is also run as part of CI builds and that's the reason why it seems to be used quite a lot in telemetry compared to some of the other commands. kedro test
is really only a wrapper around pytest
so we could re-educate our users to just start using pytest
like we did with kedro install
and pip install -r requirements.txt
. kedro build-reqs
: I would leave this command for now. This command (compared to the others) has the highest usage numbers combined with self-proclaimed usage by both OS and internal QB users. It's also one of the more complex commands and we'd have to investigate what alternative we could offer our users. Thanks for all the research and the most excellent write up of all this! Do you have the number of unique calls to kedro run
and kedro viz
in the same time period just for comparison?
Based on this research, personally I would remove all these commands in favour of my alternative proposal (to follow in next post). If we don't want to go down that route then:
kedro activate-nbstripout
: completely agree, kill itkedro jupyter convert
: completely agree, kill itkedro build-docs
: agree with you, but more strongly in favour of killing it. In one sense it seems slightly more valuable than the very thin wrapper ones like kedro lint
and kedro test
because it does something that is a bit tricky (i.e. sphinx, which we provide some config for). On the flip side if we killed it then we could also remove docs
folder from the template which would be nice. If we keep this then completely agree we should re-think what it's really for and whether we want to continue using sphinx for it. There are probably easier to use alternatives available now. N.B. if we want to do mermaid diagrams as one user mentioned, we could use https://github.com/kedro-org/kedro/discussions/1269kedro lint
. if we keep it then I think we should consider again whether pre-commit might be useful here. I think the argument against it before was difficulty on windows, but I'm not sure.kedro test
: agree with you, but more strongly in favour of killing it. Unlike kedro lint
this just wraps a single CLI command. Agree with you that's it's like deprecation of kedro install
in favour of pip install
, which is a way more important command and AFAIK we haven't received any complaints about that at all. And I don't think there is a good alternative to pytest
here reallykedro build-reqs
: disagree here, I think the case for keeping this isn't any stronger than for kedro test
(both in terms of telemetry data and user feedback). It did used to be a complex command but is not so any more in 0.18. It's really just a thin wrapper for pip-tools compile
but with non-standard file names (i.e. requirements.txt -> .lock
rather than requirements.in -> .txt
). Now that pip has ok dependency resolution built in I think bundling piptools is much less important than it used to be. Personally I do like the piptools workflow (just as I like pytest), but I don't think it belongs in core kedro. So IMO the same re-education journey as for deprecation of kedro test
would apply hereIf we do keep some of these commands in kedro core then I think we should consider hiding them away inside a new group (name tbc) so they clutter the CLI less:
kedro dev lint
kedro dev test
kedro dev build-reqs
We strip out all these commands from core kedro, all our current starters, remove packages from requirements.txt, etc.
We (or someone else) make a new plugin kedro-with-tools
(name completely made up for now) that includes:
kedro tools
group of commands (like the kedro dev
group I said above) including lint
, test
, build-docs
, whateverkedro with-tools-starter
) that is heavier than the core one and includes any required configuration (e.g. pyproject.toml, .pre-commit-config.yaml). Along these lines: https://github.com/kedro-org/kedro-starters/pull/40Important note: I say "we or someone else" because I think could be a very good candidate for an unofficial community-maintained plugin. Or we could get it started and then hand over, or we could maintain it but be much more relaxed in accepting updates than we are with the core kedro template.
Note that pip install
immediately makes the new kedro-with-tools-starter
alias available thanks to #1592.
Starting a new project:
pip install kedro-with-tools
kedro new -s kedro-with-tools-starter # Makes new project template with pre-commit etc. config
kedro tools lint etc. commands are now available
In existing project:
pip install kedro-with-tools
kedro tools init # Not sure if we'll actually need this command at all. But it would copy pre-commit etc. config to already existing project
kedro tools lint etc. commands are now available
kedro pipeline
, kedro catalog
become much more discoverable because they're not hidden among lower-level commands all sitting at the same levelkedro lint
etc. commands. These are hard to keep up to date since best practice Python tooling changes (e.g. sphinx; this article. The current "recommendations" we make through the project template are kind of arbitrary (e.g. why do we do isort
but not mypy
in lint? Just because we haven't updated it for ages I think). We're not really best placed to judge what are the best tools, and it's always changing. If this is community maintained then we don't need to worry about staying on top of it at all! kedro-datasets
, i.e. a lean core which is extendiblepip install
. None of the current functionality is actually lost - it's just moved. In fact, user will have access to better, more up to date versions of these commandsReferences that inspired this idea: https://github.com/kedro-org/kedro/issues/826; https://github.com/kedro-org/kedro/discussions/844#discussioncomment-1034549; https://github.com/kedro-org/kedro-starters/pull/40
kedro
. On the other hand, discoverability of the "core" kedro commands is enhancedlint
etc. commands. @yetudada above didn't see this is a problem. Given how few people use the commands, I don't think they can be a major part of the kedro value propositionOverall I think both the above can be addressed by documentation and deprecating warnings.
@AntonyMilneQB Thanks for sharing and elaborating on your thoughts!
I like the idea of moving the commands we want to keep to a separate plugin if we're keeping a reasonable amount. If we decide we're removing most of them anyway, I'm not sure it's even worth creating this plugin. From your earlier comment it sounded like you'd only really want to keep kedro lint
, so then a new plugin might be overkill. But 100% agree to create a plugin if we decide to keep several of the commands.
Just to clarify: if we (or someone else) made this plugin, I think it could still contain the build-docs, build-reqs and test functionality. If it's not part of the core kedro package I feel much more relaxed about putting in functionality that is deemed useful for software engineering best practice but not commonly used at the moment.
Also, slightly simpler than a whole plugin: my original thought was that this could be done with just a custom (ideally community-maintained) starter that would provide:
kedro lint
etc. commands that you wantI still think this is fine, just doing it as a plugin is slightly neater since that way you also get a nice kedro new --starter=
alias even if's not in our official starters repo.
Thanks @MerelTheisenQB and @AntonyMilneQB for the research, and excellent summary. I agree with @AntonyMilneQB on removing these commands to declutter the CLI, or alternatively moving the additional commands we decide to keep to a separate plugin.
This stellar research @MerelTheisenQB 🎉 Thank you so much for this. I've read through your thoughts; thanks for the additions @MerelTheisenQB, @AntonyMilneQB and @NeroOkwa.
Here are my summary thoughts:
kedro activate-nbstripout
and kedro jupyter convert
kedro build-docs
could help them but didn't. However, this ticket is a low priority. pytest
in a Kedro workflow and maybe a ticket to document some of the linting tools. So with these changes, we still maintain some of our value proposition. But it's clear our users like other parts of Kedro and not necessarily our support of tooling that helps them create better software.
Proposal
We strip out all these commands from core kedro, all our current starters, remove packages from requirements.txt, etc.
We (or someone else) make a new plugin
kedro-with-tools
(name completely made up for now) that includes:
kedro tools
group of commands (like thekedro dev
group I said above) includinglint
,test
,build-docs
, whatever- dependencies to make all these work
- a starter template (call it
kedro with-tools-starter
) that is heavier than the core one and includes any required configuration (e.g. pyproject.toml, .pre-commit-config.yaml). Along these lines: add simple starter kedro-starters#40
@AntonyMilneQB--and everybody else in favor of killing kedro lint
and kedro test
, but kedro lint
especially--I don't entirely see how this workflow would work.
I do think a non-trivial part of the value proposition of Kedro is introducing users to and enforcing basic software engineering best practices. Anecdotally, I remember being one of the first people on the client-facing side in QuantumBlack (at least in North America?) to enforce coding standards on projects over 3 years ago. More recently, I think a much larger percentage of projects do so--yes, in parallel to greater adoption of tools like Black and isort by the broader community, but also I think at higher rates due to it being bundled with Kedro. In my humble opinion, Kedro has always been opinionated, and part of the value comes from this opinionated nature.
I think removing the kedro lint
command and documenting how to replicate its current behavior by leveraging the underlying tools (or something like pre-commit
) directly is fine. Maybe this will help some users understand that there's no "magic".
However, going so far as to remove linter configuration needs to be considered much more carefully:
When it comes to kedro test
, I honestly think most users still aren't writing tests (hence the low command usage), but making the config optional isn't going to make that any better. Again, totally fine with just pointing people to the equivalent pytest
command here.
With regards to making these commands part of an optional plugin, if anything, I would make them part of a plugin installed by default. This doesn't reduce the maintenance burden, and you can't outsource the development of such a core plugin to the broader community, but it does make it optional for those who may have good reason to not want the functionality.
Finally, FWIW, I think the set of core tools included as part of the project template is actually pretty good. While I personally would include things like mypy and Prettier, I think the black/isort/Flake8 suite is a good representation of "the bare minimum" to say you're following software engineering best practices on this front, without being too difficult for the majority of users to follow. Outsourced to the community, I'm sure somebody will throw in random things like importlinter
that may be nice to have, but definitely not at the same level as Black or isort in terms of getting somebody on board with better software development practices.
I've created follow up tickets for all actions mentioned in the discussion above:
Please add any further comments to the specific follow ups.
As per discussions https://github.com/kedro-org/kedro/issues/1077 and https://github.com/kedro-org/kedro/discussions/844 we want to separate dev requirements into their own file outside requirements.txt. This means any requirements that are not required for
kedro run
or running a packaged kedro project (which only exposeskedro run
). Remember in v0.18 there is nokedro install
command.kedro build-docs
) to do the right thingkedro lint
) do the right thingThings to check and think about:
kedro lint
,kedro build-docs
etc. commands need to run a pip install to install dev requirements? Is this annoying and slow? Is Waylon's idea of pre-commit a better alternative?kedro pipeline
,kedro catalog
have any dependencies outside kedro, and if so where should those dependencies go (requirements or dev requirements)?kedro build-reqs
do?Also worth checking whether any kedro requirements should be moved to project requirements. e.g. pip-tools is there but after deprecation of
kedro install
can presumably be moved to a project requirement? Do we need jupyter_client in kedro requirements? etc.