kubeflow / training-operator

Distributed ML Training and Fine-Tuning on Kubernetes
https://www.kubeflow.org/docs/components/training
Apache License 2.0
1.52k stars 661 forks source link

Improve docs for Training Operator 1.8 #1998

Closed andreyvelich closed 2 months ago

andreyvelich commented 5 months ago

On the recent AutoML and Training WG call we discuss how we can improve the documentation for Training Operator and onboarding for new contributors.

We identify several action items that we can work before the release:

Please let me know if we should add something else @kubeflow/release-managers @kubeflow/wg-training-leads @tenzen-y @shashank-iitbhu.

tenzen-y commented 5 months ago

Thank you for raising this great issue! Describing all features in the doc would be great. For example, we don't have any doc for TFJob with enableDynamicWorker.

So, as a first iteration, we should identify which feature we don't have any document.

andreyvelich commented 5 months ago

cc @andreeamun

StefanoFioravanzo commented 2 months ago

@andreyvelich @tenzen-y As discussed, I looked into the training operator docs and I want to propose an initial refactoring to better align with best practices in how technical docs are organized.

A little premise to my porposal: in general you want tech docs to be organized in macro sections that roughly address

In our case we may also want to consider a "Developer" section, particularly useful for OSS projects.

Now, I can see clear ways to improve the current doc structure to better align with that model. Here are some suggestions:

  1. Split "Overview" into
    • "Overview" - trimmed down to only contain an intro to the project, how it fits within the ecosystem, who should care and why
    • "Getting Started" - a (one or two) simple example to experiment with the training operator. No explanation required, something that just works end to end
    • "Installation" - particularly important for those who want to install without Kubeflow Platform
    • Move the Architecture part to a new section "Reference"
  2. Move "Job Scheduling" under a new section called "User Guides", with the name "Advanced Scheduling". The main page provides an overview and then we have two child pages respectively called "Volvano" and "Scheduler Plugins"
  3. Revisit each framework page with the following process:
    1. Create a “ Training>” under “User Guides” -> all the “how do I do something” goes here
    2. Create a “” under “Reference” -> all the CRD reference + implementation details go here.

This doesn't have to happen all in one PR, that's why I split into sequential steps. Let me know what you think. We can start iterating on some of these points in draft PRs and I am happy to get this started.

andreyvelich commented 2 months ago

Thank you so much for this @StefanoFioravanzo, I really like your ideas. A few questions:

all the CRD reference + implementation details go here.

We don't have CRD reference right now, how should we split these sections?

@kubeflow/wg-training-leads what are your thoughts ?

StefanoFioravanzo commented 2 months ago

@andreyvelich

Should we order Installation before Getting Started page ?

Yes let's keep installation before getting started. It makes sense for folks who need to go through the installation before getting their hands on.

Do we want to separate guides between Users, Administrators, and Developers

I am in favour of having additional grouping based on the persona. But, as a first step, I recommend limiting the amount of change. So, as you suggest, let's move all how-tos/guides to a generic "user guides" section. Once we go through this initial restructuring exercise, we can further refine.

We don't have CRD reference right now, how should we split these sections?

I think we do. I think I saw some generic CRD reference for some of the frameworks. If we don't have enough details, we can still add a "TBD" under a framework's reference/API guide.

andreyvelich commented 2 months ago

@StefanoFioravanzo I think, we have only this one: https://github.com/kubeflow/training-operator/blob/master/docs/api/kubeflow.org_v1_generated.asciidoc, but I am not sure if we keep this doc updated. Isn't it @kubeflow/wg-training-leads ?

StefanoFioravanzo commented 2 months ago

@andreyvelich since we merged https://github.com/kubeflow/website/pull/3719, can we revisit the first comment of this issue? What do we want to address for training operator 1.8 (Kubeflow 1.9)?

andreyvelich commented 2 months ago

I think, as part of Kubeflow 1.9 we completed all items. Let me close this issue.