Improve docs for Training Operator 1.8

andreyvelich commented 5 months ago

On the recent AutoML and Training WG call we discuss how we can improve the documentation for Training Operator and onboarding for new contributors.

We identify several action items that we can work before the release:

[x] Add section Why using Kubeflow Training Operator ? Where we can explain user stories and how Training Operator can manage distributed training for various ML framework in a single place. So ML Engineers can easily train their ML models using unify operator.
[x] Add detailed architecture diagram for Training Operator in addition to this one.
[x] Identify which docs should live on GitHub and which on Kubeflow Website..
[ ] Automate SDK doc generation for TrainingClient, ref issue in Katib repo: https://github.com/kubeflow/katib/issues/2081

Please let me know if we should add something else @kubeflow/release-managers @kubeflow/wg-training-leads @tenzen-y @shashank-iitbhu.

tenzen-y commented 5 months ago

Thank you for raising this great issue! Describing all features in the doc would be great. For example, we don't have any doc for TFJob with enableDynamicWorker.

So, as a first iteration, we should identify which feature we don't have any document.

andreyvelich commented 5 months ago

cc @andreeamun

StefanoFioravanzo commented 2 months ago

@andreyvelich @tenzen-y As discussed, I looked into the training operator docs and I want to propose an initial refactoring to better align with best practices in how technical docs are organized.

A little premise to my porposal: in general you want tech docs to be organized in macro sections that roughly address

"Overview/Installation/GettingStarted"
"HowTOs/UserGuides"
"Reference" (Anything from autogen API docs, to arch diagrams, implementation details, etc.)
"Explanation" (anything that concerns explaining in free form why the project took some decisions, or discussions ecosystem, integrations, etc.

In our case we may also want to consider a "Developer" section, particularly useful for OSS projects.

Now, I can see clear ways to improve the current doc structure to better align with that model. Here are some suggestions:

Split "Overview" into
- "Overview" - trimmed down to only contain an intro to the project, how it fits within the ecosystem, who should care and why
- "Getting Started" - a (one or two) simple example to experiment with the training operator. No explanation required, something that just works end to end
- "Installation" - particularly important for those who want to install without Kubeflow Platform
- Move the Architecture part to a new section "Reference"
Move "Job Scheduling" under a new section called "User Guides", with the name "Advanced Scheduling". The main page provides an overview and then we have two child pages respectively called "Volvano" and "Scheduler Plugins"
Revisit each framework page with the following process:
1. Create a “ Training>” under “User Guides” -> all the “how do I do something” goes here
2. Create a “” under “Reference” -> all the CRD reference + implementation details go here.

This doesn't have to happen all in one PR, that's why I split into sequential steps. Let me know what you think. We can start iterating on some of these points in draft PRs and I am happy to get this started.

andreyvelich commented 2 months ago

Thank you so much for this @StefanoFioravanzo, I really like your ideas. A few questions:

Should we order Installation before Getting Started page ? Like in Model Registry docs.
Do we want to separate guides between Users, Administrators, and Developers like in KServe docs or Jupyter Docs or we can do it in the next iteration ?
- For example, initially we can move all guides to the User Guides.

all the CRD reference + implementation details go here.

We don't have CRD reference right now, how should we split these sections?

@kubeflow/wg-training-leads what are your thoughts ?

StefanoFioravanzo commented 2 months ago

@andreyvelich

Should we order Installation before Getting Started page ?

Yes let's keep installation before getting started. It makes sense for folks who need to go through the installation before getting their hands on.

Do we want to separate guides between Users, Administrators, and Developers

I am in favour of having additional grouping based on the persona. But, as a first step, I recommend limiting the amount of change. So, as you suggest, let's move all how-tos/guides to a generic "user guides" section. Once we go through this initial restructuring exercise, we can further refine.

We don't have CRD reference right now, how should we split these sections?

I think we do. I think I saw some generic CRD reference for some of the frameworks. If we don't have enough details, we can still add a "TBD" under a framework's reference/API guide.

andreyvelich commented 2 months ago

@StefanoFioravanzo I think, we have only this one: https://github.com/kubeflow/training-operator/blob/master/docs/api/kubeflow.org_v1_generated.asciidoc, but I am not sure if we keep this doc updated. Isn't it @kubeflow/wg-training-leads ?

StefanoFioravanzo commented 2 months ago

@andreyvelich since we merged https://github.com/kubeflow/website/pull/3719, can we revisit the first comment of this issue? What do we want to address for training operator 1.8 (Kubeflow 1.9)?

andreyvelich commented 2 months ago

I think, as part of Kubeflow 1.9 we completed all items. Let me close this issue.

kubeflow / training-operator

Improve docs for Training Operator 1.8 #1998