kedro-org / kedro

Kedro is a toolbox for production-ready data science. It uses software engineering best practices to help you create data engineering and data science pipelines that are reproducible, maintainable, and modular.
https://kedro.org
Apache License 2.0
9.95k stars 903 forks source link

Review deployment pages and consider how to integrate GetInData plugins docs and improve them overall #2435

Closed stichbury closed 1 year ago

stichbury commented 1 year ago

Following discussion with the GetInData team, we should look to include more official documentation about Kedro plugins for deployment within our official guides.

One option is to add some docs on our side and point through to the docs e.g. https://kedro-azureml.readthedocs.io/en/0.3.6/ for Azure ML (which should probably be the first one, as it’s the most battle tested and feature complete one).

An alternative is that those plugin docs are brought inside our docs entirely (which has the benefit that the user stays on one location and has one style of docs to read) but also adds to the content load, which is already heavy.

I didn't have a ticket about this so have created one for discussion. Tagging in https://github.com/marrrcin

marrrcin commented 1 year ago

@stichbury We've agreed that we should include something like a "quickstart" or "tutorial" in the Kedro docs and then put a reference to more in-depth documentation (ours) at the end. This way it will make our plugins' development cycles uninterrupted and not dependent on the Kedro docs release lifecycle.

How can we proceed on that?

stichbury commented 1 year ago

@marrrcin We are still looking at changes to the information architecture, so this is difficult to pin down at present. In the current table of contents, what would you propose? A section in the Kedro plugins page? Or a new section about plugins with tutorials listed? You probably have some great ideas on how to position these in the current layout, which was can take forward as we think about the new one as part of https://github.com/kedro-org/kedro/issues/1866

marrrcin commented 1 year ago

There is a section called "Deployment" already, it's a good fit for our plugins. Actually some of the parts that are currently included there (e.g. SageMaker) can be replaced with the plugin-based approach.

astrojuanlu commented 1 year ago

cc @deepyaman should we raise the priority of this one?

stichbury commented 1 year ago

This is in the current sprint w/c 17-04

stichbury commented 1 year ago

I've done a little bit of reorganisation on the table of contents in the docs recently, which is unreleased at present, but should go out soon (you can see it in the latest docs). Let's consider how to make some changes to what we have in the set of deployment docs.

  1. I think each "How to deploy a Kedro project to X" page should have a set of subsections, something along the lines of Introduction, Prerequisites, Deployment process, and Summary. Within those sections, subsections are completely freeform, but it would be good to keep a consistent layout at the top level.

  2. Each of the pages should have a note on when it was last tested (and against which version of Kedro + other prerequisite tools), or at least some indicator of how confident we are in the content.

  3. Where there are two options (e.g. use what we describe or use the Get In Data plugin) we should explain the circumstances that make you prefer one vs the other, or if there's no difference, I'm not sure, but do we need both?

Turning to the deployment targets, I have these so far:

Deployment target/action Notes Technical reviewer input
Airflow Existing Airflow docs for QB-supported Airflow plugin These docs are well-structured but I can't speak for correctness
Get In Data's kedro-airflow-k8s plugin -- documentation suggests not to use for versions of Kedro > 17.0 ??
Reviewer: Is the documentation complete and up to date? How confident are you (green, amber or red)? Do we need both or should we just point through to GetInData docs?
Argo Kedro docs are for use without a plugin but also mention/link to an unsupported 3rd party plugin, last updated in summer 2020 Reviewer: Is the documentation complete and up to date? How confident are you (green, amber or red)?
AWS Batch Kedro docs are comprehensive Reviewer: Is the documentation complete and up to date? How confident are you (green, amber or red)?
AWS EMR Written up (as a blog post) but as yet unpublished This will stay as a blog post for now unless I'm persuaded otherwise, since it's nice to have the technical content. I will make a ticket to expand it and convert to docs though, if this makes sense to reviewers?
AWS SageMaker Existing SageMaker docs
GetInData have a kedro-sagemaker plugin
Reviewer: Is the documentation complete and up to date? How confident are you (green, amber or red)? Do we need both or should we just point through to GetInData docs?
Azure Battle-tested kedro-azureml plugin from GetInData Reviewer: Is the documentation complete and up to date? How confident are you (green, amber or red)?
Dask Existing Dask docs Reviewer: Is the documentation complete and up to date? How confident are you (green, amber or red)?
Kubeflow Existing Kubeflow Workflows docs
kedro-kubeflow plugin from GetInData.
Reviewer: Is the documentation complete and up to date? How confident are you (green, amber or red)? Do we need both or should we just point through to GetInData docs?
Prefect Existing Prefect docs have not been tested with Prefect 2.0 Reviewer: Is the documentation complete and up to date? How confident are you (green, amber or red)?
VertexAI kedro-vertexai plugin from GetInData. Reviewer: Is the documentation complete and up to date? How confident are you (green, amber or red)?
stichbury commented 1 year ago

OK, I've got a little table going up in the previous comment, to track our confidence and the completeness of various deployment pages.

Please could I ask for some technical help from the usual suspects: @deepyaman @noklam @merelcht @marrrcin @astrojuanlu to answer the 3 questions above and noted in the table:

  1. Are the docs for a target complete?
  2. Are we confident in them?
  3. Do we need both the existing text and to point to the Get In Data plugin, or should we phase out our docs?.

Feel free to either drop a comment below for anything you want to comment on, or edit the table directly above if you're brave enough/foolish enough to want to wrangle a markdown table.

From your input, I'll build a set of tickets to plan out updates to the deployment content (if not the location in the docs).


Also, another question. Are there any missing targets? We don't have Databricks in this section, for example, but should provide a link to the docs stored elsewhere (and reconsider the distribution of Databricks docs in due course).

merelcht commented 1 year ago

My thoughts on the deployment targets listed above (fyi I haven't recently tried any of this so I'm totally guessing if these recommendations still work):

  1. Airflow: these docs are indeed in good shape, but haven't been changed since 2021. Without trying the steps I'd probably give it an amber 🟠 rating. It seems like our recommendation airflow with astronomer is slightly different from the GetinData one which uses k8s. I'm not enough of an airflow expert to say which approach is better so for the time being I'd keep both.
  2. Argo: I know nothing about Argo. These docs are pretty old and the team member who wrote them isn't on the team anymore. I'd give it a red πŸ”΄ rating.
  3. AWS Batch: these look good, but also haven't been changed in a long time. I'm not personally confident that this would still work without trying it so would give it a red πŸ”΄ .
  4. AWS Sagemaker: similar to Argo, these are old docs and written by a member who isn't at QB anymore. I'd probably recommend the GetInData plugin instead.
  5. Azure: very happy to recommend the GetInData plugin here.
  6. Dask: These are fairly recent and added by Deepyaman, so I'm more confident these are in a good state and would rate them green 🟒
  7. Kubeflow: same as for AWS Sagemaker: I'd recommend the GetInData plugin instead of our old docs.
  8. Prefect: the code in these docs has been updated by someone from QB fairly recently, so I'd be happy to keep them and rate them green 🟒
  9. VertexAI: again very happy to recommend the GetInData plugin.
stichbury commented 1 year ago

Thanks @merelcht that is amazingly useful.

Given that you're unsure about Airflow, Argo and Batch, I'll ask @deepyaman for a second opinion on those, but TBH, I'm happy to just slate those for an update when there's opportunity (and look at usage logs to see which to prioritise)

marrrcin commented 1 year ago

My two cents:

noklam commented 1 year ago

I agree with Merel mostly, I have some minor comments.

stichbury commented 1 year ago

Thanks @marrrcin, that's very useful. I'll take your input on Airflow on board, and likewise for Azure. I plan to add some text for that as you suggest.

And to @noklam also, thank you πŸ™ I have no idea how I missed AWS Steps. I'll add it to my list, and add it to the flowchart.

Also, we don't have any copy about "Which AWS to use?" but that would be very useful. Let me get that on my list too.

marrrcin commented 1 year ago

I've revamped the quickstart guide for AzureML here: https://kedro-azureml.readthedocs.io/en/0.4.1/source/03_quickstart.html

astrojuanlu commented 1 year ago

I'm a bit late to the party, but regarding Prefect, notice that they're written for 1.x, and Prefect 2.0 changed a few things https://github.com/kedro-org/kedro/issues/2431 so I'd give those an amber rating too 🟠

stichbury commented 1 year ago

I will create a pair of tickets for updating the Prefect docs and Airflow/Astronomer docs to the latest versions. And note the version used in the docs so readers are aware.

stichbury commented 1 year ago